2.4. 數據透視與重塑(pivot table and reshape) · soton_數據分析

### 2.4.1. 重塑 reshape ![](https://img.kancloud.cn/0a/b5/0ab58d67da5b9e59ef918598829273e3_1257x644.png) ``` df = pd.DataFrame({'foo': ['one','one','one','two','two','two'], 'bar': ['A', 'B', 'C', 'A', 'B', 'C'], 'baz': [1, 2, 3, 4, 5, 6]}) df ``` ![](https://img.kancloud.cn/2f/9b/2f9b4b3bc993e92c339781e28318369e_181x299.png) ``` #pivot將foo列不重復的值變為索引列。bar列不重復的值變為列名。使用pivot index和columns列組合不能有重復的。兩列的唯一組合對應同行的數據值 df.pivot(index='foo', columns='bar', values='baz') ``` ![](https://img.kancloud.cn/79/16/7916d6c0fd2f2b8d6a31b165c60390c4_158x177.png) ***** 最常用的地方是時間序列數據： * date:代表日期 * group：代表銷售小組 * sells：銷售小組的業績 index是日期用來表示每一條觀測值，columns就是唯一的變量！ ``` from datetime import datetime,timedelta today = datetime.now().date() #獲取從今天到4天前的5個日期，把這五個值復制6遍放入一個列表中 date = [today - timedelta(days = i) for i in range(5)] * 6 # a,b,c,d,e,f各復制5次放到一個列表中。列表可以通過+拼接到一起 group = ['A']*5 + ['B'] * 5 + ['C']*5 + ['D'] * 5 +['E']*5 +['F']*5 # 生成1000到100000的30個值，放到一個列表中 sells = np.random.randint(1000,100000,size = (30,)).tolist() #創建字典 data = {"date":date,"group":group,"sells":sells} #根據字典生成dataFrame df_pivot = pd.DataFrame(data) # dataFrame值排序，先按組排，組內再按日期排 df_pivot.sort_values(['group','date']) ``` ![](https://img.kancloud.cn/3f/4b/3f4bb7514ef0472f05165e5073a1b3b3_343x288.png) ***** ``` df_pivot.pivot(index='date',columns='group',values='sells') ``` ![](https://img.kancloud.cn/b0/19/b01922fece6346c53e7fcf6ed8dabd43_555x300.png) ***** 如果沒有指定values參數，程序會計算出所有可被計算的列，并在最上方形成多層索引 ``` df_pivot['cumsells'] = df_pivot['sells']*2+1000 df_pivot ``` ![](https://img.kancloud.cn/79/c4/79c40a57d66ef8157bb5344be9e988c8_390x502.png) ``` #多個列的值被計算顯示出來，并在最上方形成多層索引 df_pivot.pivot(index='date',columns='group') ``` ![](https://img.kancloud.cn/5b/38/5b3802c85d9c69ab645a6ebaed2046bb_986x355.png) ***** **STACK** ![](https://img.kancloud.cn/b6/5d/b65df2a526ffd67b467fea64c4ce4d68_1221x539.png) **stack()與 unstack()** 函數都是用于多重索引的 * stack()：column轉換成index * unstack():index轉換成column ``` # 將列表變成元組 tuples = list(zip(*[['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']])) #根據元組創建多層索引，第一二層分別命名為first second index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second']) # 隨機生成8*2的數，index作為索引，初始化列名 df_mul = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B']) ``` ![](https://img.kancloud.cn/f4/d0/f4d02b5127dd77a72d5cb0f776cc5f1d_338x483.png) ``` # 把列名也變成索引，由索引和列對應的值轉換為每個值可以由索引直接確定 df_mul.stack() ``` ![](https://img.kancloud.cn/58/be/58be65fa0e89c75e1b6cb1dbd5d14922_273x344.png) 新生成的數據集會形成多少行呢？ * m : 行數 * n: 列數 * 總數量：m * n ***** stack()和unstack（）可以看做互逆過程。unstack將由多層索引可以直接確定的值轉換為有多層索引和列名確定一個值 ![](https://img.kancloud.cn/d2/1f/d21fa077cf098671e34035b35d40bfd0_1080x575.png) ![](https://img.kancloud.cn/7c/e4/7ce4c1499e0540d2235c540d98fd39ae_1024x567.png) ![](https://img.kancloud.cn/f4/87/f4873093b13c7a0adb1d1810fa5abd9e_1080x573.png) ***** **Melt** ![](https://img.kancloud.cn/7c/8d/7c8df0549056bdf71bd238fa5b4cefa3_1065x495.png) 某些列設置為標記變量，其它的列被設置為衡量變量。函數會自動生成兩列：“variable” and “value”，我們也可以通過 var_name 和 value_name 兩個參數自定義列名。 ``` pd.melt(frame, id_vars=None, value_vars=None, var_name=None, value_name='value', col_level=None) ``` ![](https://img.kancloud.cn/e2/ff/e2ff9d7f4100ba82d9d6e6058878f816_705x238.png) ![](https://img.kancloud.cn/85/01/8501b404d79c78f84a72291a372f74d7_680x240.png) ***** ### 2.4.2. 透視表 pivot table 這部分看《利用Python進行數據分析》10.4節 ``` import numpy as np pd.pivot_table(df,values = ['duration'],columns = ['director_name'],index=['color'],aggfunc=[np.sum],margins=True) ``` **pivot_table** 提供了類似于EXCEL數據透視表的功能，重點的參數如下: ![](https://img.kancloud.cn/58/ce/58ce4dfbe2880c36b5f905933ba036b1_1451x290.png) ![](https://img.kancloud.cn/4d/03/4d03e8f6bdc0d6b3d5b264c460f2ccc7_1402x298.png) ***** **crosstab** 用于計算兩個以上的因子的cross-tabulation. 默認的是計算因子之間的頻率 ![](https://img.kancloud.cn/a4/c1/a4c1bbe9ca892c3e184a276c54f1c086_610x498.png) ***** ![](https://img.kancloud.cn/de/34/de344a733615ada277341b1717a1d739_1428x556.png)