Pandas Pivot_table和Groupby

Pandas透视表和Groupby
个人感觉是就是一个高级版的groupby,可以传入agg函数,比groupby聚合后调用的几种函数来的更高级.


数据如下:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
C:\Python34\python.exe C:/Users/Leo/Desktop/MingDong_Work/Work_3/business_statAna/electricity_analyse.py
PSCode Month DATATYPE ProcElectricity LifeElectricity
0 440100000004 201510 3 7.4555 0.0000
1 440100000004 201510 5 8.0744 0.0000
2 440100000004 201512 5 8.0984 0.0000
3 440100000004 201601 4 20.4855 0.0000
4 440100000004 201604 4 20.4855 0.0000
5 440100000004 201607 4 27.4033 0.0000
6 440100000004 201610 5 22.8613 0.0000
7 440100000004 201701 5 0.0000 7.9799
8 440100000004 201704 5 0.0000 5.2596
9 440100000028 201501 1 0.0000 0.0000
10 440100000028 201508 1 0.0000 0.0000
11 440100000028 201509 1 0.0000 0.0000
12 440100000028 201704 1 0.0000 0.0000
13 440100000028 201705 1 0.0000 0.0000
14 440100000028 201706 1 0.0000 0.0000
15 440100000028 201707 1 0.0000 0.0000
16 440100000028 201708 1 0.0000 0.0000
17 440100000028 201709 1 0.0000 0.0000
18 440100000028 201710 1 0.0000 0.0000
19 440100000071 201510 1 0.0000 0.0000
20 440100000071 201601 3 0.0000 0.0000
21 440100000075 201510 5 15.4600 4.9200
22 440100000075 201601 4 8.1900 4.0500
23 440100000075 201604 2 16.9000 4.6600
24 440100000075 201604 4 8.1900 4.0500
25 440100000075 201607 4 26.5600 6.1500
26 440100000075 201610 5 18.4600 6.3200
27 440100000075 201701 2 8.0800 2.9900
28 440100000075 201704 5 9.2700 2.4700
29 440100000075 201707 5 20.0100 3.9000
... ... ... ... ... ...
107599 449013000184 201709 1 0.0000 0.0000
107600 449013000188 201707 1 0.0000 0.0000
107601 449013000188 201708 1 0.0000 0.0000
107602 449013000188 201709 1 0.0000 0.0000
107603 449014000001 201501 3 0.0000 0.0000
107604 449014000006 201507 1 0.0000 0.0000
107605 449014000050 201601 1 0.0000 0.0000
107606 449016000002 201601 1 0.0000 0.0000
107607 449016000002 201604 1 0.0000 0.0000
107608 449016000002 201607 1 0.0000 0.0000
107609 449016000002 201701 1 0.0000 0.0000
107610 449016000004 201601 1 0.0000 0.0000
107611 449016000004 201604 1 0.0000 0.0000
107612 449016000004 201607 1 0.0000 0.0000
107613 449016000004 201610 1 0.0000 0.0000
107614 449016000004 201701 1 0.0000 0.0000
107615 449016000012 201601 1 1666.0000 0.0000
107616 449016000013 201601 1 0.0000 0.0000
107617 449016000013 201604 1 0.0000 0.0000
107618 449016000013 201607 1 0.0000 0.0000
107619 449016000013 201701 1 0.0000 0.0000
107620 449016000026 201601 1 0.0000 0.0000
107621 449016000026 201604 1 0.0000 0.0000
107622 449016000026 201607 1 0.0000 0.0000
107623 449016000026 201610 1 0.0000 0.0000
107624 449016000038 201701 1 0.0000 0.0000
107625 449016000038 201704 1 0.0000 0.0000
107626 449016000062 201701 1 0.0000 0.0000
107627 449016000062 201704 1 0.0000 0.0000
107628 449016000094 201707 1 0.0000 0.0000

[107629 rows x 5 columns]

Process finished with exit code 0

解释:

  • 细看一下数据会发现存在一些PSCode和DataMonth一样的数据但是数值和DATATYPE都不同,但是统计的时候又不能舍弃其他类型的,因此采用最大值填充

解决办法:

  • 想到思路以后脑子里就蹦出了groupby(代码如下)
1
df_group_result = self.total_df.groupby(['PSCode', 'Month']).max().reset_index()
  • 同事给我的方法是透视表,于是尝试了一下(以前有groupby忽略了这个这么好用的东西)
1
2
3
pivot_test = pd.pivot_table(self.total_df, index=['PSCode', 'Month'], values=['ProcElectricity', 'LifeElectricity'], aggfunc=np.mean).reset_index()

# 但是不知道为啥输出结果的时候把我的ProcElectricity, LifeElectricity两列数据顺序调换了.虽然无关紧要,但是还是比较奇怪的.

总结:

  • Pivot_table的注重呈现方式比groupby好,而groupby注重在分组结果而不是整体.
文章作者: sunhailin-Leo
文章链接: http://www.leoyuki.xyz/2017/12/29/PythonPandasPivot_table/
版权声明: 本博客所有文章除特别声明外,均采用 CC BY-NC-SA 3.0 许可协议。转载请注明来自 LeoBlog