2.1 索引選取 · soton_數據分析

[TOC] ***** **索引選取:** 根據某種條件篩選出數據的子集，類似sql那樣整數列表的切片時前閉后開 pandas中標簽名稱切片是前后都閉合的超出索引值范圍和標簽名不存在會報錯 ***** 標簽指的是索引列中的值索引選取兩種方式： 1.基于loc(根據索引的標簽名選取) 2.基于iloc(根據行和列的位置，用整數選取) ***** loc ![](https://img.kancloud.cn/da/6a/da6acb2132ae136c712ca5495bc7bd15_1280x447.png) 沒必要使用callable function來進行索引選取 ***** iloc ![](https://img.kancloud.cn/93/ee/93eece3aafdf5b94085af89a2b970fbf_1010x317.png) ***** ### 2.1.1. 基于label.loc **Series操作** ``` #隨機生成6個數字，并生成series和索引列 s1 = pd.Series(np.random.randn(6), index=list('abcdef')) ``` ![](https://img.kancloud.cn/dd/ee/ddee452c845548220c87b0f512a8675f_202x193.png) ***** 根據單個標簽名選取數據 ``` s1.loc['c'] # 結果 0.7140279186233623 ``` ***** 根據標簽列表選取數據 ``` s1.loc[['a','e','f']] ``` ![](https://img.kancloud.cn/e5/72/e572641fbf69dde38ef0e4920d5430dc_160x92.png) ***** 根據標簽切片選取數據，按照生成時的標簽順序進行切片 ``` s1.loc['d':'f'] ``` ![](https://img.kancloud.cn/ea/52/ea520d78084daa1010d0270f2f117961_152x97.png) ***** 判斷s1中的每個元素是否大于0，返回一個布爾數組 ![](https://img.kancloud.cn/15/3f/153fe7cb123ee8060f7308dd9d749dfb_131x197.png) 根據布爾數組選取數據 ![](https://img.kancloud.cn/6a/29/6a29d401de85392d70dcacf610aff7bf_168x157.png) **DataFrame操作** DataFrame是二維數據，可以操控index與column。DataFrame的行索引是整數 ***** 根據行和列選取數據 ![](https://img.kancloud.cn/0f/d0/0fd06245749ddd913615bc1769f4dcd1_297x125.png) ***** 根據列表 ![](https://img.kancloud.cn/f9/33/f9331e5f51fd2038b2e27285836ddfc7_363x210.png) ***** 根據切片 ![](https://img.kancloud.cn/c3/df/c3df634f925ec7ca262beba7d9811b0e_360x270.png) ***** 根據布爾數組選取有關height行的數據 df.loc[df['height'] >= 200,['Player','height']] ![](https://img.kancloud.cn/cf/0d/cf0d2e9aaa26c1cc4fe6f134bfd321da_318x344.png) 下列代碼生成的是一個series ![](https://img.kancloud.cn/db/8d/db8daa81a52f23c0bdd4696693736a0d_245x269.png) ***** ### 2.1.2. 基于位置.iloc 用索引的位置下標選取數據 ![](https://img.kancloud.cn/34/b5/34b5242ad9af80a5a75cd5e19d0d6bcf_389x560.png) 傳統切片通過位置下標整數進行切片 ![](https://img.kancloud.cn/06/fa/06fae5659b5cd71784b9d78fb7ab2760_214x209.png) ***** 根據列表選取行 ![](https://img.kancloud.cn/90/0b/900b4472addb1ef6a42ec3bf6cb83c8e_677x193.png) 根據位置下表整數切片選取行 ![](https://img.kancloud.cn/6a/17/6a17d3942839d7386f3e41e47ab18873_659x203.png) ***** 用Player列作為索引列 ![](https://img.kancloud.cn/9e/ba/9ebab083ee7fff946fe7f19dc9f6c042_694x352.png) ***** 用iloc進行選取行或列，只能用整數位置下標例子：切片選取行，列表選取列，逗號分割的是行和列的參數 ![](https://img.kancloud.cn/75/e8/75e8299c2a04c283ba246c5a9b00aa98_235x429.png) ### 2.1.3. 隨機選取數據使用sample()方法對數據進行列或者行的隨機選取，默認是對行進行選取。函數接受一個參數用來指定返回的數量或者百分比 ``` 隨機抽取10行 df.sample(10) ``` ![](https://img.kancloud.cn/63/f4/63f4552ed83d8e7705da020e0413de3a_789x385.png) ***** ``` 隨機返回總數量的百分之一數據 df.sample(frac=0.01) ``` ![](https://img.kancloud.cn/eb/d2/ebd2dcf9ff7ebbca3813d4fdc43c4811_887x442.png) ***** **通過控制axis對列進行抽樣** .head()默認顯示前五行 ``` 隨機抽取三列，并顯示前五行 df.sample(n = 3,axis = 1).head() ``` ![](https://img.kancloud.cn/aa/db/aadbeb4d24226aedff59399e4ee4ab3a_362x206.png) ***** sample有一個參數,random_state是隨機數種子，決定是否返回固定的隨機數據 df.sample(n = 3,axis = 1,random_state=10).head() ![](https://img.kancloud.cn/21/66/2166d5e61e432436e37a74d06e53bf98_300x206.png) ***** ### 2.1.4. 使用isin() 該函數回返回一個布爾型向量，根據Series里面的值是否在給定的列表中。通過這個條件篩選出多行數據！ ``` # 過濾 "Chicago","New York" s = df['birth_city'] s.isin(["Chicago","New York"]) ``` ![](https://img.kancloud.cn/da/df/dadffee8b64561eaf3d8358187d35fa1_266x228.png) dataFrame默認按行選取 ![](https://img.kancloud.cn/f2/7d/f27d42f87b101ba7d45a002dda5c8c24_364x278.png) ***** ![](https://img.kancloud.cn/53/4d/534d313717771c475b5c5a202a686af9_790x290.png) ***** **對多列進行布爾值選取** 對于DataFrame我們可以使用dict進行處理，dict的key就是對應的column name. 我們經常與all()或者any函數組合進行數據過濾選取 * all 指定axis上的元素全部為True * any 指定axis上的元素至少一個為True ![](https://img.kancloud.cn/6a/05/6a054fe5de920b6d770cda8dac5344b2_859x538.png) ***** 選出全為true的列 ![](https://img.kancloud.cn/64/1b/641b1c594e247b1f6737fe141c0cb632_480x77.png) ***** axis= 1是篩選行，axis=0是篩選列一行中有一個true，這行被標記為true，通過loc返回行 ![](https://img.kancloud.cn/47/66/4766bfcb196c65ed8ecf714b1b951ade_223x277.png) ![](https://img.kancloud.cn/9c/11/9c113f8064c5225a6255001148bdecff_804x302.png) ### 2.1.5. 數據過濾基于loc的強大功能，我們可以對數據做很多復雜的操作。第一個就是實現數據的過濾，類似于SQL里面的where功能選取出height >= 180 ,weight >= 80的運動員數據。 ***** 選出符合條件的數據行 ![](https://img.kancloud.cn/5d/7c/5d7c1f29f5b874a065a06f061e893573_1081x286.png) ***** * 如果height >= 180, weight >=80, 值為 “high" * 如果height=170, weight=70 值為 ”msize" * 其余的值為 "small" ``` #1 新建一個flag列，將符合條件的行的flag列填入"high"值 df.loc[(df['height'] >=180) & (df['weight'] >=80),"flag"] = "high" ``` ![](https://img.kancloud.cn/dc/5f/dc5f7dd89f118628501fc5365e829c15_949x243.png) ***** ``` #2 新建一個flag列，將符合條件的行的flag列填入"msize"值 df.loc[((df['height'] <=180) & (df['height']>=170)) & ((df['weight'] <=80) & (df['weight'] >=70)),"flag"] = "msize" ``` ***** ``` #3 其余的值為 "small" ~對條件1和2進行否定，不滿足條件一或2的。新建一個flag列，將符合條件的行的flag列填入"small"值 df.loc[~(((df['height'] >=180) & (df['weight'] >=80)) |(((df['height'] <=180) & (df['height']>=170))&((df['weight'] <=80) & (df['weight'] >=70)))),"flag"] = "small" ``` ***** 對flag列各值的頻數進行統計 ![](https://img.kancloud.cn/d5/65/d5655184be2d7c3e25996356649b8ba5_278x137.png) ***** 對各行的height進行條件判斷，滿足條件，判定該行為true ![](https://img.kancloud.cn/c5/0c/c50c77b5864633e988f302347e82a512_194x282.png) ***** ### 2.1.6. query()方法使用表達式進行數據篩選.類似sql中的where表達式 ![](https://img.kancloud.cn/41/fc/41fc4dd935033d4fecbf7170c696aca8_1006x311.png) ![](https://img.kancloud.cn/d6/2a/d62a01a25d1615450e5b8d189579f099_970x262.png) **注意** query里面不可以引用變量 ![](https://img.kancloud.cn/44/9c/449c7f5c1649e7377f7006e866a7653e_878x396.png) ### 2.1.7. 索引設置 set_index()方法可以將一列或者多列設置為索引 ``` #keys設置索引列，drop保留作為索引列的數據，append是否保留原來的位置索引，inplace修改原數據集 df.set_index(keys, drop=True, append=False, inplace=False, verify_integrity=False) ``` ***** 把Player和collage列作為索引列 ![](https://img.kancloud.cn/ab/f8/abf8cbca28162f4db031f55af1b3d92f_729x329.png) ***** 將索引列放回數據框中，并且設置簡單的整數索引 ``` df1.reset_index(level=None, drop=False, inplace=False, co_level=0, col_fill='') ``` ![](https://img.kancloud.cn/f2/b3/f2b370f2cb84b8440c029f5177310a6d_994x321.png) ### 2.1.8. where方法 * 通過布爾類型的數組選取數據僅僅返回數據的子集 * where()函數能夠確保返回的結果和原數據集的結構一樣 ***** 保留符合條件的元素和索引 ![](https://img.kancloud.cn/4a/cd/4acd967abfc95c919be5e85009457c11_323x179.png) ***** 保留原來數據集的結構 ![](https://img.kancloud.cn/91/4c/914c840a4d35a53db3aa413beb3b0b4c_144x239.png) ### 2.1.9. 重復數據 duplicate * duplicated 返回一個和行數相等的布爾數組，表明某一行是否是重復的,true是重復值 * drop_duplicates 刪除重復行 ***** **通過keep參數來控制行的取舍** * 除第一個值外，剩下重復的值都認為是重復值 keep='first' (default): mark / drop duplicates except for the first occurrence. * 除最后一個值外，剩下重復的值都認為是重復值 keep='last': mark / drop duplicates except for the last occurrence. * 標記所有的值為重復值 keep=False: mark / drop all duplicates. ***** ![](https://img.kancloud.cn/40/5e/405efc776e1f04f0fa603e502fcdf540_688x426.png) ***** ``` #判斷是否有兩行是重復的 df2.duplicated() ``` ![](https://img.kancloud.cn/f5/53/f55365e770d85ff5df540c4da8332150_134x168.png) ***** ``` #判斷a列是否有重復值 df2.duplicated('a',keep = False) ``` ![](https://img.kancloud.cn/91/23/912360baa7d65a18c433cb10bcf6af40_150x175.png) ![](https://img.kancloud.cn/66/68/6668e5eb1578fda72546a88cfc15a1ca_355x214.png) ![](https://img.kancloud.cn/5b/25/5b2550a8179c04b1c82885647fa5729d_317x208.png) ***** 刪去重復行 ![](https://img.kancloud.cn/bd/be/bdbe065c0e44605dc8db3b964753f54f_253x321.png) ***** 刪去a b兩列都一樣的行。第二行和第四行相等，默認刪去第四行 ![](https://img.kancloud.cn/3c/9d/3c9deb5959b14d203ccd8b95783565a9_362x298.png) ***** keep = 'last'，刪去最后一個重復的值之前的值 ![](https://img.kancloud.cn/f5/4c/f54c0ed60c897ca3787713185d023fbd_472x316.png) ### 2.1.10. MultiIndex 層次索引可以允許我們操作更加復雜的數據 ``` arrays = [['bar', 'bar', 'baz', 'baz', 'foo', 'foo', 'qux', 'qux'], ['one', 'two', 'one', 'two', 'one', 'two', 'one', 'two']] 根據嵌套列表創建兩列索引，第一列索引叫first,第二列索引叫second index = pd.MultiIndex.from_arrays(arrays, names=['first', 'second']) ``` 第一列索引有四個值，第二列索引有兩個值。第一列索引分為四組，每組有兩個相同的值。第二列索引也分為四組，每組有兩個不同的值 ![](https://img.kancloud.cn/85/6f/856f8792c288004dc46471f63c94fb1b_642x116.png) ***** 把數組變為元組 ![](https://img.kancloud.cn/6f/06/6f0644f051209f9a3c7541a551449152_267x238.png) ***** ``` #根據元組創建兩層索引 index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second']) #得到第一層索引的值 index.get_level_values(0) ``` ![](https://img.kancloud.cn/51/47/5147bebbf83af9be356deaa9ba336934_821x44.png) ***** ``` #用隨機數創建series值，用建好的多層索引作為series的索引 pd.Series(np.random.randn(8), index=index) ``` ![](https://img.kancloud.cn/9d/af/9dafa7a1346800a97bc92fd87e9b5ddc_260x217.png) ***** **索引選取** ``` #用隨機數創建8行4列的數據集。用兩層索引index作為數據集的索引。list('ABCD')作為數據集的列 df = pd.DataFrame(np.random.randn(8, 4), index=index,columns = list('ABCD')) ``` ![](https://img.kancloud.cn/1d/a3/1da36028bec0be3b1ceb0ad802f8f51e_432x384.png) ***** ``` #用多層索引，類似找書的第幾章第幾節。章是第一層索引，節是第二層索引 #loc根據索引標簽名選取數據各層索引標簽名要放到一個元組里 df.loc[('bar', 'two')] ``` ![](https://img.kancloud.cn/35/b8/35b8b2435b15fbe33004b48885142b20_299x107.png) ***** 選指定多層索引的行與指定列交叉的數據 ![](https://img.kancloud.cn/9b/ff/9bff5de74cf6166479e326d7b646b96c_272x80.png) ***** 多個多層索引放入列表中選取行 ![](https://img.kancloud.cn/fc/3e/fc3e44b72dee26071f66573e7258bcff_446x194.png) ***** 選中指定的第一層索引bar下的所有行 ![](https://img.kancloud.cn/72/58/7258ab97f296212c54864def864772e7_410x139.png) ***** 對于多層索引，不可以跳過前面層的索引，用后面層的索引選擇行。如：想選所有第二層索引為'one'的行，報錯 ![](https://img.kancloud.cn/13/b0/13b01a07db3c85e3dc02fefa62ac947b_462x128.png) ***** **可以使用切片(slicers)對多重索引進行操作** * 你可以使用任意的列表，元祖，布爾型作為Indexer * 可以使用sclie(None)表達在某個level上選取全部的內容，不需要對全部的level進行指定，它們會被隱式的推導為slice(None) * 所有的axis必須都被指定，意味著index和column上都要被顯式的指明 **正確的方式** ~~~python 選取第一層索引A1和第二層索引A3 選取所有的列 df.loc[(slice('A1', 'A3'), ...), :] ~~~ **錯誤的方式X** 沒有選擇列 ~~~python df.loc[(slice('A1', 'A3'), ...)] ~~~ ***** ![](https://img.kancloud.cn/7f/80/7f808e0aee43fd87aa360b083c5866a0_431x390.png) ***** ``` #slice(None)選擇第一層全部索引,'one'選擇第二層帶'one'的索引，：選擇全部的列 df.loc[(slice(None),'one'),:] ``` ![](https://img.kancloud.cn/eb/99/eb995eb8b60fa94bb463cc7587e05c06_424x212.png) ***** ``` # IndexSlice是一種更接近自然語法的用法，可以替換slice # 生成IndexSlice idx = pd.IndexSlice # idx:選擇第一層全部索引。'one' : 第二層'one'索引。: 選擇全部的列。 df.loc[idx[:,'one'],:] ``` ![](https://img.kancloud.cn/6e/fa/6efa1b050d553a9c567b2ca74c673c82_427x231.png) ***** ![](https://img.kancloud.cn/d5/7f/d57fb5042be9f32248f0f0888c337142_321x253.png) ***** **函數xs()可以讓我們在指定level的索引上進行數據選取** level=1 指定為第二層索引，第一層為level=0。選第二層帶 'one'索引的數據行 ![](https://img.kancloud.cn/53/4f/534fe3d5bcf2f0b530e72759d6612295_382x259.png) ![](https://img.kancloud.cn/5b/95/5b9532e8d3c5222f2afae874066795dc_438x276.png) ![](https://img.kancloud.cn/68/24/6824bb2a9404c002fae28e86b0cfda5f_444x178.png) ![](https://img.kancloud.cn/ee/c6/eec6662ea9a2c51e26f4a4814bc69aa5_443x202.png) **索引排序** 先排第一層索引，下層索引再組內排序 ![](https://img.kancloud.cn/51/de/51de38ceff6464d1672040b4247e10a3_450x390.png) ***** 數據行按第二層索引排序 ![](https://img.kancloud.cn/9a/e0/9ae0d503ce90504a837ac8ad121a453f_432x398.png)