Pandas透视表和Groupby
个人感觉是就是一个高级版的groupby,可以传入agg函数,比groupby聚合后调用的几种函数来的更高级.
数据如下:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67
| C:\Python34\python.exe C:/Users/Leo/Desktop/MingDong_Work/Work_3/business_statAna/electricity_analyse.py PSCode Month DATATYPE ProcElectricity LifeElectricity 0 440100000004 201510 3 7.4555 0.0000 1 440100000004 201510 5 8.0744 0.0000 2 440100000004 201512 5 8.0984 0.0000 3 440100000004 201601 4 20.4855 0.0000 4 440100000004 201604 4 20.4855 0.0000 5 440100000004 201607 4 27.4033 0.0000 6 440100000004 201610 5 22.8613 0.0000 7 440100000004 201701 5 0.0000 7.9799 8 440100000004 201704 5 0.0000 5.2596 9 440100000028 201501 1 0.0000 0.0000 10 440100000028 201508 1 0.0000 0.0000 11 440100000028 201509 1 0.0000 0.0000 12 440100000028 201704 1 0.0000 0.0000 13 440100000028 201705 1 0.0000 0.0000 14 440100000028 201706 1 0.0000 0.0000 15 440100000028 201707 1 0.0000 0.0000 16 440100000028 201708 1 0.0000 0.0000 17 440100000028 201709 1 0.0000 0.0000 18 440100000028 201710 1 0.0000 0.0000 19 440100000071 201510 1 0.0000 0.0000 20 440100000071 201601 3 0.0000 0.0000 21 440100000075 201510 5 15.4600 4.9200 22 440100000075 201601 4 8.1900 4.0500 23 440100000075 201604 2 16.9000 4.6600 24 440100000075 201604 4 8.1900 4.0500 25 440100000075 201607 4 26.5600 6.1500 26 440100000075 201610 5 18.4600 6.3200 27 440100000075 201701 2 8.0800 2.9900 28 440100000075 201704 5 9.2700 2.4700 29 440100000075 201707 5 20.0100 3.9000 ... ... ... ... ... ... 107599 449013000184 201709 1 0.0000 0.0000 107600 449013000188 201707 1 0.0000 0.0000 107601 449013000188 201708 1 0.0000 0.0000 107602 449013000188 201709 1 0.0000 0.0000 107603 449014000001 201501 3 0.0000 0.0000 107604 449014000006 201507 1 0.0000 0.0000 107605 449014000050 201601 1 0.0000 0.0000 107606 449016000002 201601 1 0.0000 0.0000 107607 449016000002 201604 1 0.0000 0.0000 107608 449016000002 201607 1 0.0000 0.0000 107609 449016000002 201701 1 0.0000 0.0000 107610 449016000004 201601 1 0.0000 0.0000 107611 449016000004 201604 1 0.0000 0.0000 107612 449016000004 201607 1 0.0000 0.0000 107613 449016000004 201610 1 0.0000 0.0000 107614 449016000004 201701 1 0.0000 0.0000 107615 449016000012 201601 1 1666.0000 0.0000 107616 449016000013 201601 1 0.0000 0.0000 107617 449016000013 201604 1 0.0000 0.0000 107618 449016000013 201607 1 0.0000 0.0000 107619 449016000013 201701 1 0.0000 0.0000 107620 449016000026 201601 1 0.0000 0.0000 107621 449016000026 201604 1 0.0000 0.0000 107622 449016000026 201607 1 0.0000 0.0000 107623 449016000026 201610 1 0.0000 0.0000 107624 449016000038 201701 1 0.0000 0.0000 107625 449016000038 201704 1 0.0000 0.0000 107626 449016000062 201701 1 0.0000 0.0000 107627 449016000062 201704 1 0.0000 0.0000 107628 449016000094 201707 1 0.0000 0.0000
[107629 rows x 5 columns]
Process finished with exit code 0
|
解释:
- 细看一下数据会发现存在一些PSCode和DataMonth一样的数据但是数值和DATATYPE都不同,但是统计的时候又不能舍弃其他类型的,因此采用最大值填充
解决办法:
- 想到思路以后脑子里就蹦出了groupby(代码如下)
1
| df_group_result = self.total_df.groupby(['PSCode', 'Month']).max().reset_index()
|
- 同事给我的方法是透视表,于是尝试了一下(以前有groupby忽略了这个这么好用的东西)
1 2 3
| pivot_test = pd.pivot_table(self.total_df, index=['PSCode', 'Month'], values=['ProcElectricity', 'LifeElectricity'], aggfunc=np.mean).reset_index()
|
总结:
- Pivot_table的注重呈现方式比groupby好,而groupby注重在分组结果而不是整体.