# 十九、數據整理(上)
> 作者:[Chris Albon](https://chrisalbon.com/)
>
> 譯者:[飛龍](https://github.com/wizardforcel)
>
> 協議:[CC BY-NC-SA 4.0](http://creativecommons.org/licenses/by-nc-sa/4.0/)
## 在 Pandas 中通過分組應用函數
```py
import pandas as pd
# 創建示例數據幀
data = {'Platoon': ['A','A','A','A','A','A','B','B','B','B','B','C','C','C','C','C'],
'Casualties': [1,4,5,7,5,5,6,1,4,5,6,7,4,6,4,6]}
df = pd.DataFrame(data)
df
```
| | Casualties | Platoon |
| --- | --- | --- |
| 0 | 1 | A |
| 1 | 4 | A |
| 2 | 5 | A |
| 3 | 7 | A |
| 4 | 5 | A |
| 5 | 5 | A |
| 6 | 6 | B |
| 7 | 1 | B |
| 8 | 4 | B |
| 9 | 5 | B |
| 10 | 6 | B |
| 11 | 7 | C |
| 12 | 4 | C |
| 13 | 6 | C |
| 14 | 4 | C |
| 15 | 6 | C |
```py
# 按照 df.platoon 對 df 分組
# 然后將滾動平均 lambda 函數應用于 df.casualties
df.groupby('Platoon')['Casualties'].apply(lambda x:x.rolling(center=False,window=2).mean())
'''
0 NaN
1 2.5
2 4.5
3 6.0
4 6.0
5 5.0
6 NaN
7 3.5
8 2.5
9 4.5
10 5.5
11 NaN
12 5.5
13 5.0
14 5.0
15 5.0
dtype: float64
'''
```
## 在 Pandas 中向分組應用操作
```py
# 導入模塊
import pandas as pd
# 創建數據幀
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'],
'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'],
'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'],
'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTestScore', 'postTestScore'])
df
```
| | regiment | company | name | preTestScore | postTestScore |
| --- | --- | --- | --- | --- | --- |
| 0 | Nighthawks | 1st | Miller | 4 | 25 |
| 1 | Nighthawks | 1st | Jacobson | 24 | 94 |
| 2 | Nighthawks | 2nd | Ali | 31 | 57 |
| 3 | Nighthawks | 2nd | Milner | 2 | 62 |
| 4 | Dragoons | 1st | Cooze | 3 | 70 |
| 5 | Dragoons | 1st | Jacon | 4 | 25 |
| 6 | Dragoons | 2nd | Ryaner | 24 | 94 |
| 7 | Dragoons | 2nd | Sone | 31 | 57 |
| 8 | Scouts | 1st | Sloan | 2 | 62 |
| 9 | Scouts | 1st | Piger | 3 | 70 |
| 10 | Scouts | 2nd | Riani | 2 | 62 |
| 11 | Scouts | 2nd | Ali | 3 | 70 |
```py
# 創建一個 groupby 變量,按團隊(regiment)對 preTestScores 分組
groupby_regiment = df['preTestScore'].groupby(df['regiment'])
groupby_regiment
# <pandas.core.groupby.SeriesGroupBy object at 0x113ddb550>
```
“這個分組變量現在是`GroupBy`對象。 除了分組的鍵`df ['key1']`的一些中間數據之外,它實際上還沒有計算任何東西。 我們的想法是,該對象具有將所有操作應用于每個分組所需的所有信息。” -- PyDA
使用`list()`顯示分組的樣子。
```py
list(df['preTestScore'].groupby(df['regiment']))
'''
[('Dragoons', 4 3
5 4
6 24
7 31
Name: preTestScore, dtype: int64), ('Nighthawks', 0 4
1 24
2 31
3 2
Name: preTestScore, dtype: int64), ('Scouts', 8 2
9 3
10 2
11 3
Name: preTestScore, dtype: int64)]
'''
df['preTestScore'].groupby(df['regiment']).describe()
```
| | count | mean | std | min | 25% | 50% | 75% | max |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| regiment | | | | | | | | |
| Dragoons | 4.0 | 15.50 | 14.153916 | 3.0 | 3.75 | 14.0 | 25.75 | 31.0 |
| Nighthawks | 4.0 | 15.25 | 14.453950 | 2.0 | 3.50 | 14.0 | 25.75 | 31.0 |
| Scouts | 4.0 | 2.50 | 0.577350 | 2.0 | 2.00 | 2.5 | 3.00 | 3.0 |
```py
# 每個團隊的 preTestScore 均值
groupby_regiment.mean()
'''
regiment
Dragoons 15.50
Nighthawks 15.25
Scouts 2.50
Name: preTestScore, dtype: float64
'''
df['preTestScore'].groupby([df['regiment'], df['company']]).mean()
'''
regiment company
Dragoons 1st 3.5
2nd 27.5
Nighthawks 1st 14.0
2nd 16.5
Scouts 1st 2.5
2nd 2.5
Name: preTestScore, dtype: float64
'''
df['preTestScore'].groupby([df['regiment'], df['company']]).mean().unstack()
```
| company | 1st | 2nd |
| --- | --- | --- |
| regiment | | |
| Dragoons | 3.5 | 27.5 |
| Nighthawks | 14.0 | 16.5 |
| Scouts | 2.5 | 2.5 |
```py
# 按團隊和公司(company)對整個數據幀分組
df.groupby(['regiment', 'company']).mean()
```
| | | preTestScore | postTestScore |
| --- | --- | --- | --- |
| regiment | company | | |
| Dragoons | 1st | 3.5 | 47.5 |
| 2nd | 27.5 | 75.5 |
| Nighthawks | 1st | 14.0 | 59.5 |
| 2nd | 16.5 | 59.5 |
| Scouts | 1st | 2.5 | 66.0 |
| 2nd | 2.5 | 66.0 |
```py
# 每個團隊和公司的觀測數量
df.groupby(['regiment', 'company']).size()
'''
regiment company
Dragoons 1st 2
2nd 2
Nighthawks 1st 2
2nd 2
Scouts 1st 2
2nd 2
dtype: int64
'''
# 按團隊對數據幀分組,對于每個團隊,
for name, group in df.groupby('regiment'):
# 打印團隊名稱
print(name)
# 打印它的數據
print(group)
'''
Dragoons
regiment company name preTestScore postTestScore
4 Dragoons 1st Cooze 3 70
5 Dragoons 1st Jacon 4 25
6 Dragoons 2nd Ryaner 24 94
7 Dragoons 2nd Sone 31 57
Nighthawks
regiment company name preTestScore postTestScore
0 Nighthawks 1st Miller 4 25
1 Nighthawks 1st Jacobson 24 94
2 Nighthawks 2nd Ali 31 57
3 Nighthawks 2nd Milner 2 62
Scouts
regiment company name preTestScore postTestScore
8 Scouts 1st Sloan 2 62
9 Scouts 1st Piger 3 70
10 Scouts 2nd Riani 2 62
11 Scouts 2nd Ali 3 70
'''
```
按列分組:
特別是在這種情況下:按列對數據類型(即`axis = 1`)分組,然后使用`list()`查看該分組的外觀。
```py
list(df.groupby(df.dtypes, axis=1))
'''
[(dtype('int64'), preTestScore postTestScore
0 4 25
1 24 94
2 31 57
3 2 62
4 3 70
5 4 25
6 24 94
7 31 57
8 2 62
9 3 70
10 2 62
11 3 70),
(dtype('O'), regiment company name
0 Nighthawks 1st Miller
1 Nighthawks 1st Jacobson
2 Nighthawks 2nd Ali
3 Nighthawks 2nd Milner
4 Dragoons 1st Cooze
5 Dragoons 1st Jacon
6 Dragoons 2nd Ryaner
7 Dragoons 2nd Sone
8 Scouts 1st Sloan
9 Scouts 1st Piger
10 Scouts 2nd Riani
11 Scouts 2nd Ali)]
df.groupby('regiment').mean().add_prefix('mean_')
```
| | mean_preTestScore | mean_postTestScore |
| --- | --- | --- |
| regiment | | |
| Dragoons | 15.50 | 61.5 |
| Nighthawks | 15.25 | 59.5 |
| Scouts | 2.50 | 66.0 |
```py
# 創建獲取分組狀態的函數
def get_stats(group):
return {'min': group.min(), 'max': group.max(), 'count': group.count(), 'mean': group.mean()}
bins = [0, 25, 50, 75, 100]
group_names = ['Low', 'Okay', 'Good', 'Great']
df['categories'] = pd.cut(df['postTestScore'], bins, labels=group_names)
df['postTestScore'].groupby(df['categories']).apply(get_stats).unstack()
```
| | count | max | mean | min |
| --- | --- | --- | --- | --- |
| categories | | | | |
| Good | 8.0 | 70.0 | 63.75 | 57.0 |
| Great | 2.0 | 94.0 | 94.00 | 94.0 |
| Low | 2.0 | 25.0 | 25.00 | 25.0 |
| Okay | 0.0 | NaN | NaN | NaN |
## 在 Pandas 數據幀上應用操作
```py
# 導入模型
import pandas as pd
import numpy as np
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'year': [2012, 2012, 2013, 2014, 2014],
'reports': [4, 24, 31, 2, 3],
'coverage': [25, 94, 57, 62, 70]}
df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df
```
| | coverage | name | reports | year |
| --- | --- | --- | --- | --- |
| Cochice | 25 | Jason | 4 | 2012 |
| Pima | 94 | Molly | 24 | 2012 |
| Santa Cruz | 57 | Tina | 31 | 2013 |
| Maricopa | 62 | Jake | 2 | 2014 |
| Yuma | 70 | Amy | 3 | 2014 |
```py
# 創建大寫轉換的 lambda 函數
capitalizer = lambda x: x.upper()
```
將`capitalizer`函數應用于`name`列。
`apply()`可以沿數據幀的任意軸應用函數。
```py
df['name'].apply(capitalizer)
'''
Cochice JASON
Pima MOLLY
Santa Cruz TINA
Maricopa JAKE
Yuma AMY
Name: name, dtype: object
'''
```
將`capitalizer` lambda 函數映射到序列`name`中的每個元素。
`map()`對序列的每個元素應用操作。
```py
df['name'].map(capitalizer)
'''
Cochice JASON
Pima MOLLY
Santa Cruz TINA
Maricopa JAKE
Yuma AMY
Name: name, dtype: object
'''
```
將平方根函數應用于整個數據幀中的每個單元格。
`applymap()`將函數應用于整個數據幀中的每個元素。
```py
# 刪除字符串變量,以便 applymap() 可以運行
df = df.drop('name', axis=1)
# 返回數據幀每個單元格的平方根
df.applymap(np.sqrt)
```
| | coverage | reports | year |
| --- | --- | --- | --- |
| Cochice | 5.000000 | 2.000000 | 44.855323 |
| Pima | 9.695360 | 4.898979 | 44.855323 |
| Santa Cruz | 7.549834 | 5.567764 | 44.866469 |
| Maricopa | 7.874008 | 1.414214 | 44.877611 |
| Yuma | 8.366600 | 1.732051 | 44.877611 |
在數據幀上應用函數。
```py
# 創建叫做 times100 的函數
def times100(x):
# 如果 x 是字符串,
if type(x) is str:
# 原樣返回它
return x
# 如果不是,返回它乘上 100
elif x:
return 100 * x
# 并留下其它東西
else:
return
df.applymap(times100)
```
| | coverage | reports | year |
| --- | --- | --- | --- |
| Cochice | 2500 | 400 | 201200 |
| Pima | 9400 | 2400 | 201200 |
| Santa Cruz | 5700 | 3100 | 201300 |
| Maricopa | 6200 | 200 | 201400 |
| Yuma | 7000 | 300 | 201400 |
## 向 Pandas 數據幀賦予新列
```py
import pandas as pd
# 創建空數據幀
df = pd.DataFrame()
# 創建一列
df['name'] = ['John', 'Steve', 'Sarah']
# 查看數據幀
df
```
| | name |
| --- | --- |
| 0 | John |
| 1 | Steve |
| 2 | Sarah |
```py
# 將一個新列賦予名為 age 的 df,它包含年齡列表
df.assign(age = [31, 32, 19])
```
| | name | age |
| --- | --- | --- |
| 0 | John | 31 |
| 1 | Steve | 32 |
| 2 | Sarah | 19 |
# 將列表拆分為大小為 N 的分塊
在這個片段中,我們接受一個列表并將其分解為大小為 n 的塊。 在處理具有最大請求大小的 API 時,這是一種非常常見的做法。
這個漂亮的函數由 Ned Batchelder 貢獻,發布于 [StackOverflow](http://stackoverflow.com/questions/312443/how-do-you-split-a-list-into-evenly-sized-chunks-in-python)。
```py
# 創建名稱列表
first_names = ['Steve', 'Jane', 'Sara', 'Mary','Jack','Bob', 'Bily', 'Boni', 'Chris','Sori', 'Will', 'Won','Li']
# 創建叫做 chunks 的函數,有兩個參數 l 和 n
def chunks(l, n):
# 對于長度為 l 的范圍中的項目 i
for i in range(0, len(l), n):
# 創建索引范圍
yield l[i:i+n]
# 從函數 chunks 的結果創建一個列表
list(chunks(first_names, 5))
'''
[['Steve', 'Jane', 'Sara', 'Mary', 'Jack'],
['Bob', 'Bily', 'Boni', 'Chris', 'Sori'],
['Will', 'Won', 'Li']]
'''
```
## 在 Pandas 中使用正則表達式將字符串分解為列
```py
# 導入模塊
import re
import pandas as pd
# 創建帶有一列字符串的數據幀
data = {'raw': ['Arizona 1 2014-12-23 3242.0',
'Iowa 1 2010-02-23 3453.7',
'Oregon 0 2014-06-20 2123.0',
'Maryland 0 2014-03-14 1123.6',
'Florida 1 2013-01-15 2134.0',
'Georgia 0 2012-07-14 2345.6']}
df = pd.DataFrame(data, columns = ['raw'])
df
```
| | raw |
| --- | --- |
| 0 | Arizona 1 2014-12-23 3242.0 |
| 1 | Iowa 1 2010-02-23 3453.7 |
| 2 | Oregon 0 2014-06-20 2123.0 |
| 3 | Maryland 0 2014-03-14 1123.6 |
| 4 | Florida 1 2013-01-15 2134.0 |
| 5 | Georgia 0 2012-07-14 2345.6 |
```py
# df['raw'] 的哪些行包含 'xxxx-xx-xx'?
df['raw'].str.contains('....-..-..', regex=True)
'''
0 True
1 True
2 True
3 True
4 True
5 True
Name: raw, dtype: bool
'''
# 在 raw 列中,提取字符串中的單個數字
df['female'] = df['raw'].str.extract('(\d)', expand=True)
df['female']
'''
0 1
1 1
2 0
3 0
4 1
5 0
Name: female, dtype: object
'''
# 在 raw 列中,提取字符串中的 xxxx-xx-xx
df['date'] = df['raw'].str.extract('(....-..-..)', expand=True)
df['date']
'''
0 2014-12-23
1 2010-02-23
2 2014-06-20
3 2014-03-14
4 2013-01-15
5 2012-07-14
Name: date, dtype: object
'''
# 在 raw 列中,提取字符串中的 ####.##
df['score'] = df['raw'].str.extract('(\d\d\d\d\.\d)', expand=True)
df['score']
'''
0 3242.0
1 3453.7
2 2123.0
3 1123.6
4 2134.0
5 2345.6
Name: score, dtype: object
'''
# 在 raw 列中,提取字符串中的單詞
df['state'] = df['raw'].str.extract('([A-Z]\w{0,})', expand=True)
df['state']
'''
0 Arizona
1 Iowa
2 Oregon
3 Maryland
4 Florida
5 Georgia
Name: state, dtype: object
'''
df
```
| | raw | female | date | score | state |
| --- | --- | --- | --- | --- | --- |
| 0 | Arizona 1 2014-12-23 3242.0 | 1 | 2014-12-23 | 3242.0 | Arizona |
| 1 | Iowa 1 2010-02-23 3453.7 | 1 | 2010-02-23 | 3453.7 | Iowa |
| 2 | Oregon 0 2014-06-20 2123.0 | 0 | 2014-06-20 | 2123.0 | Oregon |
| 3 | Maryland 0 2014-03-14 1123.6 | 0 | 2014-03-14 | 1123.6 | Maryland |
| 4 | Florida 1 2013-01-15 2134.0 | 1 | 2013-01-15 | 2134.0 | Florida |
| 5 | Georgia 0 2012-07-14 2345.6 | 0 | 2012-07-14 | 2345.6 | Georgia |
## 由兩個數據幀貢獻列
```py
# 導入庫
import pandas as pd
# 創建數據幀
dataframe_one = pd.DataFrame()
dataframe_one['1'] = ['1', '1', '1']
dataframe_one['B'] = ['b', 'b', 'b']
# 創建第二個數據幀
dataframe_two = pd.DataFrame()
dataframe_two['2'] = ['2', '2', '2']
dataframe_two['B'] = ['b', 'b', 'b']
# 將每個數據幀的列轉換為集合,
# 然后找到這兩個集合的交集。
# 這將是兩個數據幀共享的列的集合。
set.intersection(set(dataframe_one), set(dataframe_two))
# {'B'}
```
## 從多個列表構建字典
```py
# 創建官員名稱的列表
officer_names = ['Sodoni Dogla', 'Chris Jefferson', 'Jessica Billars', 'Michael Mulligan', 'Steven Johnson']
# 創建官員軍隊的列表
officer_armies = ['Purple Army', 'Orange Army', 'Green Army', 'Red Army', 'Blue Army']
# 創建字典,它是兩個列表的 zip
dict(zip(officer_names, officer_armies))
'''
{'Chris Jefferson': 'Orange Army',
'Jessica Billars': 'Green Army',
'Michael Mulligan': 'Red Army',
'Sodoni Dogla': 'Purple Army',
'Steven Johnson': 'Blue Army'}
'''
```
## 將 CSV 轉換為 Python 代碼來重建它
```py
# 導入 pandas 包
import pandas as pd
# 將 csv 文件加載為數據幀
df_original = pd.read_csv('http://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv')
df = pd.read_csv('http://vincentarelbundock.github.io/Rdatasets/csv/datasets/iris.csv')
# 打印創建數據幀的代碼
print('==============================')
print('RUN THE CODE BELOW THIS LINE')
print('==============================')
print('raw_data =', df.to_dict(orient='list'))
print('df = pd.DataFrame(raw_data, columns = ' + str(list(df_original)) + ')')
'''
==============================
RUN THE CODE BELOW THIS LINE
==============================
raw_data = {'Sepal.Length': [5.0999999999999996, 4.9000000000000004, 4.7000000000000002, 4.5999999999999996, 5.0, 5.4000000000000004, 4.5999999999999996, 5.0, 4.4000000000000004, 4.9000000000000004, 5.4000000000000004, 4.7999999999999998, 4.7999999999999998, 4.2999999999999998, 5.7999999999999998, 5.7000000000000002, 5.4000000000000004, 5.0999999999999996, 5.7000000000000002, 5.0999999999999996, 5.4000000000000004, 5.0999999999999996, 4.5999999999999996, 5.0999999999999996, 4.7999999999999998, 5.0, 5.0, 5.2000000000000002, 5.2000000000000002, 4.7000000000000002, 4.7999999999999998, 5.4000000000000004, 5.2000000000000002, 5.5, 4.9000000000000004, 5.0, 5.5, 4.9000000000000004, 4.4000000000000004, 5.0999999999999996, 5.0, 4.5, 4.4000000000000004, 5.0, 5.0999999999999996, 4.7999999999999998, 5.0999999999999996, 4.5999999999999996, 5.2999999999999998, 5.0, 7.0, 6.4000000000000004, 6.9000000000000004, 5.5, 6.5, 5.7000000000000002, 6.2999999999999998, 4.9000000000000004, 6.5999999999999996, 5.2000000000000002, 5.0, 5.9000000000000004, 6.0, 6.0999999999999996, 5.5999999999999996, 6.7000000000000002, 5.5999999999999996, 5.7999999999999998, 6.2000000000000002, 5.5999999999999996, 5.9000000000000004, 6.0999999999999996, 6.2999999999999998, 6.0999999999999996, 6.4000000000000004, 6.5999999999999996, 6.7999999999999998, 6.7000000000000002, 6.0, 5.7000000000000002, 5.5, 5.5, 5.7999999999999998, 6.0, 5.4000000000000004, 6.0, 6.7000000000000002, 6.2999999999999998, 5.5999999999999996, 5.5, 5.5, 6.0999999999999996, 5.7999999999999998, 5.0, 5.5999999999999996, 5.7000000000000002, 5.7000000000000002, 6.2000000000000002, 5.0999999999999996, 5.7000000000000002, 6.2999999999999998, 5.7999999999999998, 7.0999999999999996, 6.2999999999999998, 6.5, 7.5999999999999996, 4.9000000000000004, 7.2999999999999998, 6.7000000000000002, 7.2000000000000002, 6.5, 6.4000000000000004, 6.7999999999999998, 5.7000000000000002, 5.7999999999999998, 6.4000000000000004, 6.5, 7.7000000000000002, 7.7000000000000002, 6.0, 6.9000000000000004, 5.5999999999999996, 7.7000000000000002, 6.2999999999999998, 6.7000000000000002, 7.2000000000000002, 6.2000000000000002, 6.0999999999999996, 6.4000000000000004, 7.2000000000000002, 7.4000000000000004, 7.9000000000000004, 6.4000000000000004, 6.2999999999999998, 6.0999999999999996, 7.7000000000000002, 6.2999999999999998, 6.4000000000000004, 6.0, 6.9000000000000004, 6.7000000000000002, 6.9000000000000004, 5.7999999999999998, 6.7999999999999998, 6.7000000000000002, 6.7000000000000002, 6.2999999999999998, 6.5, 6.2000000000000002, 5.9000000000000004], 'Petal.Width': [0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.40000000000000002, 0.29999999999999999, 0.20000000000000001, 0.20000000000000001, 0.10000000000000001, 0.20000000000000001, 0.20000000000000001, 0.10000000000000001, 0.10000000000000001, 0.20000000000000001, 0.40000000000000002, 0.40000000000000002, 0.29999999999999999, 0.29999999999999999, 0.29999999999999999, 0.20000000000000001, 0.40000000000000002, 0.20000000000000001, 0.5, 0.20000000000000001, 0.20000000000000001, 0.40000000000000002, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.40000000000000002, 0.10000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.10000000000000001, 0.20000000000000001, 0.20000000000000001, 0.29999999999999999, 0.29999999999999999, 0.20000000000000001, 0.59999999999999998, 0.40000000000000002, 0.29999999999999999, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 1.3999999999999999, 1.5, 1.5, 1.3, 1.5, 1.3, 1.6000000000000001, 1.0, 1.3, 1.3999999999999999, 1.0, 1.5, 1.0, 1.3999999999999999, 1.3, 1.3999999999999999, 1.5, 1.0, 1.5, 1.1000000000000001, 1.8, 1.3, 1.5, 1.2, 1.3, 1.3999999999999999, 1.3999999999999999, 1.7, 1.5, 1.0, 1.1000000000000001, 1.0, 1.2, 1.6000000000000001, 1.5, 1.6000000000000001, 1.5, 1.3, 1.3, 1.3, 1.2, 1.3999999999999999, 1.2, 1.0, 1.3, 1.2, 1.3, 1.3, 1.1000000000000001, 1.3, 2.5, 1.8999999999999999, 2.1000000000000001, 1.8, 2.2000000000000002, 2.1000000000000001, 1.7, 1.8, 1.8, 2.5, 2.0, 1.8999999999999999, 2.1000000000000001, 2.0, 2.3999999999999999, 2.2999999999999998, 1.8, 2.2000000000000002, 2.2999999999999998, 1.5, 2.2999999999999998, 2.0, 2.0, 1.8, 2.1000000000000001, 1.8, 1.8, 1.8, 2.1000000000000001, 1.6000000000000001, 1.8999999999999999, 2.0, 2.2000000000000002, 1.5, 1.3999999999999999, 2.2999999999999998, 2.3999999999999999, 1.8, 1.8, 2.1000000000000001, 2.3999999999999999, 2.2999999999999998, 1.8999999999999999, 2.2999999999999998, 2.5, 2.2999999999999998, 1.8999999999999999, 2.0, 2.2999999999999998, 1.8], 'Petal.Length': [1.3999999999999999, 1.3999999999999999, 1.3, 1.5, 1.3999999999999999, 1.7, 1.3999999999999999, 1.5, 1.3999999999999999, 1.5, 1.5, 1.6000000000000001, 1.3999999999999999, 1.1000000000000001, 1.2, 1.5, 1.3, 1.3999999999999999, 1.7, 1.5, 1.7, 1.5, 1.0, 1.7, 1.8999999999999999, 1.6000000000000001, 1.6000000000000001, 1.5, 1.3999999999999999, 1.6000000000000001, 1.6000000000000001, 1.5, 1.5, 1.3999999999999999, 1.5, 1.2, 1.3, 1.3999999999999999, 1.3, 1.5, 1.3, 1.3, 1.3, 1.6000000000000001, 1.8999999999999999, 1.3999999999999999, 1.6000000000000001, 1.3999999999999999, 1.5, 1.3999999999999999, 4.7000000000000002, 4.5, 4.9000000000000004, 4.0, 4.5999999999999996, 4.5, 4.7000000000000002, 3.2999999999999998, 4.5999999999999996, 3.8999999999999999, 3.5, 4.2000000000000002, 4.0, 4.7000000000000002, 3.6000000000000001, 4.4000000000000004, 4.5, 4.0999999999999996, 4.5, 3.8999999999999999, 4.7999999999999998, 4.0, 4.9000000000000004, 4.7000000000000002, 4.2999999999999998, 4.4000000000000004, 4.7999999999999998, 5.0, 4.5, 3.5, 3.7999999999999998, 3.7000000000000002, 3.8999999999999999, 5.0999999999999996, 4.5, 4.5, 4.7000000000000002, 4.4000000000000004, 4.0999999999999996, 4.0, 4.4000000000000004, 4.5999999999999996, 4.0, 3.2999999999999998, 4.2000000000000002, 4.2000000000000002, 4.2000000000000002, 4.2999999999999998, 3.0, 4.0999999999999996, 6.0, 5.0999999999999996, 5.9000000000000004, 5.5999999999999996, 5.7999999999999998, 6.5999999999999996, 4.5, 6.2999999999999998, 5.7999999999999998, 6.0999999999999996, 5.0999999999999996, 5.2999999999999998, 5.5, 5.0, 5.0999999999999996, 5.2999999999999998, 5.5, 6.7000000000000002, 6.9000000000000004, 5.0, 5.7000000000000002, 4.9000000000000004, 6.7000000000000002, 4.9000000000000004, 5.7000000000000002, 6.0, 4.7999999999999998, 4.9000000000000004, 5.5999999999999996, 5.7999999999999998, 6.0999999999999996, 6.4000000000000004, 5.5999999999999996, 5.0999999999999996, 5.5999999999999996, 6.0999999999999996, 5.5999999999999996, 5.5, 4.7999999999999998, 5.4000000000000004, 5.5999999999999996, 5.0999999999999996, 5.0999999999999996, 5.9000000000000004, 5.7000000000000002, 5.2000000000000002, 5.0, 5.2000000000000002, 5.4000000000000004, 5.0999999999999996], 'Species': ['setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica'], 'Sepal.Width': [3.5, 3.0, 3.2000000000000002, 3.1000000000000001, 3.6000000000000001, 3.8999999999999999, 3.3999999999999999, 3.3999999999999999, 2.8999999999999999, 3.1000000000000001, 3.7000000000000002, 3.3999999999999999, 3.0, 3.0, 4.0, 4.4000000000000004, 3.8999999999999999, 3.5, 3.7999999999999998, 3.7999999999999998, 3.3999999999999999, 3.7000000000000002, 3.6000000000000001, 3.2999999999999998, 3.3999999999999999, 3.0, 3.3999999999999999, 3.5, 3.3999999999999999, 3.2000000000000002, 3.1000000000000001, 3.3999999999999999, 4.0999999999999996, 4.2000000000000002, 3.1000000000000001, 3.2000000000000002, 3.5, 3.6000000000000001, 3.0, 3.3999999999999999, 3.5, 2.2999999999999998, 3.2000000000000002, 3.5, 3.7999999999999998, 3.0, 3.7999999999999998, 3.2000000000000002, 3.7000000000000002, 3.2999999999999998, 3.2000000000000002, 3.2000000000000002, 3.1000000000000001, 2.2999999999999998, 2.7999999999999998, 2.7999999999999998, 3.2999999999999998, 2.3999999999999999, 2.8999999999999999, 2.7000000000000002, 2.0, 3.0, 2.2000000000000002, 2.8999999999999999, 2.8999999999999999, 3.1000000000000001, 3.0, 2.7000000000000002, 2.2000000000000002, 2.5, 3.2000000000000002, 2.7999999999999998, 2.5, 2.7999999999999998, 2.8999999999999999, 3.0, 2.7999999999999998, 3.0, 2.8999999999999999, 2.6000000000000001, 2.3999999999999999, 2.3999999999999999, 2.7000000000000002, 2.7000000000000002, 3.0, 3.3999999999999999, 3.1000000000000001, 2.2999999999999998, 3.0, 2.5, 2.6000000000000001, 3.0, 2.6000000000000001, 2.2999999999999998, 2.7000000000000002, 3.0, 2.8999999999999999, 2.8999999999999999, 2.5, 2.7999999999999998, 3.2999999999999998, 2.7000000000000002, 3.0, 2.8999999999999999, 3.0, 3.0, 2.5, 2.8999999999999999, 2.5, 3.6000000000000001, 3.2000000000000002, 2.7000000000000002, 3.0, 2.5, 2.7999999999999998, 3.2000000000000002, 3.0, 3.7999999999999998, 2.6000000000000001, 2.2000000000000002, 3.2000000000000002, 2.7999999999999998, 2.7999999999999998, 2.7000000000000002, 3.2999999999999998, 3.2000000000000002, 2.7999999999999998, 3.0, 2.7999999999999998, 3.0, 2.7999999999999998, 3.7999999999999998, 2.7999999999999998, 2.7999999999999998, 2.6000000000000001, 3.0, 3.3999999999999999, 3.1000000000000001, 3.0, 3.1000000000000001, 3.1000000000000001, 3.1000000000000001, 2.7000000000000002, 3.2000000000000002, 3.2999999999999998, 3.0, 2.5, 3.0, 3.3999999999999999, 3.0], 'Unnamed: 0': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150]}
'''
df = pd.DataFrame(raw_data, columns = ['Unnamed: 0', 'Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species'])
# 如果你打算檢查結果
# 1\. 輸入此單元格中上面單元格生成的代碼
raw_data = {'Petal.Width': [0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.40000000000000002, 0.29999999999999999, 0.20000000000000001, 0.20000000000000001, 0.10000000000000001, 0.20000000000000001, 0.20000000000000001, 0.10000000000000001, 0.10000000000000001, 0.20000000000000001, 0.40000000000000002, 0.40000000000000002, 0.29999999999999999, 0.29999999999999999, 0.29999999999999999, 0.20000000000000001, 0.40000000000000002, 0.20000000000000001, 0.5, 0.20000000000000001, 0.20000000000000001, 0.40000000000000002, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.40000000000000002, 0.10000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.10000000000000001, 0.20000000000000001, 0.20000000000000001, 0.29999999999999999, 0.29999999999999999, 0.20000000000000001, 0.59999999999999998, 0.40000000000000002, 0.29999999999999999, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 0.20000000000000001, 1.3999999999999999, 1.5, 1.5, 1.3, 1.5, 1.3, 1.6000000000000001, 1.0, 1.3, 1.3999999999999999, 1.0, 1.5, 1.0, 1.3999999999999999, 1.3, 1.3999999999999999, 1.5, 1.0, 1.5, 1.1000000000000001, 1.8, 1.3, 1.5, 1.2, 1.3, 1.3999999999999999, 1.3999999999999999, 1.7, 1.5, 1.0, 1.1000000000000001, 1.0, 1.2, 1.6000000000000001, 1.5, 1.6000000000000001, 1.5, 1.3, 1.3, 1.3, 1.2, 1.3999999999999999, 1.2, 1.0, 1.3, 1.2, 1.3, 1.3, 1.1000000000000001, 1.3, 2.5, 1.8999999999999999, 2.1000000000000001, 1.8, 2.2000000000000002, 2.1000000000000001, 1.7, 1.8, 1.8, 2.5, 2.0, 1.8999999999999999, 2.1000000000000001, 2.0, 2.3999999999999999, 2.2999999999999998, 1.8, 2.2000000000000002, 2.2999999999999998, 1.5, 2.2999999999999998, 2.0, 2.0, 1.8, 2.1000000000000001, 1.8, 1.8, 1.8, 2.1000000000000001, 1.6000000000000001, 1.8999999999999999, 2.0, 2.2000000000000002, 1.5, 1.3999999999999999, 2.2999999999999998, 2.3999999999999999, 1.8, 1.8, 2.1000000000000001, 2.3999999999999999, 2.2999999999999998, 1.8999999999999999, 2.2999999999999998, 2.5, 2.2999999999999998, 1.8999999999999999, 2.0, 2.2999999999999998, 1.8], 'Sepal.Width': [3.5, 3.0, 3.2000000000000002, 3.1000000000000001, 3.6000000000000001, 3.8999999999999999, 3.3999999999999999, 3.3999999999999999, 2.8999999999999999, 3.1000000000000001, 3.7000000000000002, 3.3999999999999999, 3.0, 3.0, 4.0, 4.4000000000000004, 3.8999999999999999, 3.5, 3.7999999999999998, 3.7999999999999998, 3.3999999999999999, 3.7000000000000002, 3.6000000000000001, 3.2999999999999998, 3.3999999999999999, 3.0, 3.3999999999999999, 3.5, 3.3999999999999999, 3.2000000000000002, 3.1000000000000001, 3.3999999999999999, 4.0999999999999996, 4.2000000000000002, 3.1000000000000001, 3.2000000000000002, 3.5, 3.6000000000000001, 3.0, 3.3999999999999999, 3.5, 2.2999999999999998, 3.2000000000000002, 3.5, 3.7999999999999998, 3.0, 3.7999999999999998, 3.2000000000000002, 3.7000000000000002, 3.2999999999999998, 3.2000000000000002, 3.2000000000000002, 3.1000000000000001, 2.2999999999999998, 2.7999999999999998, 2.7999999999999998, 3.2999999999999998, 2.3999999999999999, 2.8999999999999999, 2.7000000000000002, 2.0, 3.0, 2.2000000000000002, 2.8999999999999999, 2.8999999999999999, 3.1000000000000001, 3.0, 2.7000000000000002, 2.2000000000000002, 2.5, 3.2000000000000002, 2.7999999999999998, 2.5, 2.7999999999999998, 2.8999999999999999, 3.0, 2.7999999999999998, 3.0, 2.8999999999999999, 2.6000000000000001, 2.3999999999999999, 2.3999999999999999, 2.7000000000000002, 2.7000000000000002, 3.0, 3.3999999999999999, 3.1000000000000001, 2.2999999999999998, 3.0, 2.5, 2.6000000000000001, 3.0, 2.6000000000000001, 2.2999999999999998, 2.7000000000000002, 3.0, 2.8999999999999999, 2.8999999999999999, 2.5, 2.7999999999999998, 3.2999999999999998, 2.7000000000000002, 3.0, 2.8999999999999999, 3.0, 3.0, 2.5, 2.8999999999999999, 2.5, 3.6000000000000001, 3.2000000000000002, 2.7000000000000002, 3.0, 2.5, 2.7999999999999998, 3.2000000000000002, 3.0, 3.7999999999999998, 2.6000000000000001, 2.2000000000000002, 3.2000000000000002, 2.7999999999999998, 2.7999999999999998, 2.7000000000000002, 3.2999999999999998, 3.2000000000000002, 2.7999999999999998, 3.0, 2.7999999999999998, 3.0, 2.7999999999999998, 3.7999999999999998, 2.7999999999999998, 2.7999999999999998, 2.6000000000000001, 3.0, 3.3999999999999999, 3.1000000000000001, 3.0, 3.1000000000000001, 3.1000000000000001, 3.1000000000000001, 2.7000000000000002, 3.2000000000000002, 3.2999999999999998, 3.0, 2.5, 3.0, 3.3999999999999999, 3.0], 'Species': ['setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'setosa', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'versicolor', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica', 'virginica'], 'Unnamed: 0': [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59, 60, 61, 62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74, 75, 76, 77, 78, 79, 80, 81, 82, 83, 84, 85, 86, 87, 88, 89, 90, 91, 92, 93, 94, 95, 96, 97, 98, 99, 100, 101, 102, 103, 104, 105, 106, 107, 108, 109, 110, 111, 112, 113, 114, 115, 116, 117, 118, 119, 120, 121, 122, 123, 124, 125, 126, 127, 128, 129, 130, 131, 132, 133, 134, 135, 136, 137, 138, 139, 140, 141, 142, 143, 144, 145, 146, 147, 148, 149, 150], 'Sepal.Length': [5.0999999999999996, 4.9000000000000004, 4.7000000000000002, 4.5999999999999996, 5.0, 5.4000000000000004, 4.5999999999999996, 5.0, 4.4000000000000004, 4.9000000000000004, 5.4000000000000004, 4.7999999999999998, 4.7999999999999998, 4.2999999999999998, 5.7999999999999998, 5.7000000000000002, 5.4000000000000004, 5.0999999999999996, 5.7000000000000002, 5.0999999999999996, 5.4000000000000004, 5.0999999999999996, 4.5999999999999996, 5.0999999999999996, 4.7999999999999998, 5.0, 5.0, 5.2000000000000002, 5.2000000000000002, 4.7000000000000002, 4.7999999999999998, 5.4000000000000004, 5.2000000000000002, 5.5, 4.9000000000000004, 5.0, 5.5, 4.9000000000000004, 4.4000000000000004, 5.0999999999999996, 5.0, 4.5, 4.4000000000000004, 5.0, 5.0999999999999996, 4.7999999999999998, 5.0999999999999996, 4.5999999999999996, 5.2999999999999998, 5.0, 7.0, 6.4000000000000004, 6.9000000000000004, 5.5, 6.5, 5.7000000000000002, 6.2999999999999998, 4.9000000000000004, 6.5999999999999996, 5.2000000000000002, 5.0, 5.9000000000000004, 6.0, 6.0999999999999996, 5.5999999999999996, 6.7000000000000002, 5.5999999999999996, 5.7999999999999998, 6.2000000000000002, 5.5999999999999996, 5.9000000000000004, 6.0999999999999996, 6.2999999999999998, 6.0999999999999996, 6.4000000000000004, 6.5999999999999996, 6.7999999999999998, 6.7000000000000002, 6.0, 5.7000000000000002, 5.5, 5.5, 5.7999999999999998, 6.0, 5.4000000000000004, 6.0, 6.7000000000000002, 6.2999999999999998, 5.5999999999999996, 5.5, 5.5, 6.0999999999999996, 5.7999999999999998, 5.0, 5.5999999999999996, 5.7000000000000002, 5.7000000000000002, 6.2000000000000002, 5.0999999999999996, 5.7000000000000002, 6.2999999999999998, 5.7999999999999998, 7.0999999999999996, 6.2999999999999998, 6.5, 7.5999999999999996, 4.9000000000000004, 7.2999999999999998, 6.7000000000000002, 7.2000000000000002, 6.5, 6.4000000000000004, 6.7999999999999998, 5.7000000000000002, 5.7999999999999998, 6.4000000000000004, 6.5, 7.7000000000000002, 7.7000000000000002, 6.0, 6.9000000000000004, 5.5999999999999996, 7.7000000000000002, 6.2999999999999998, 6.7000000000000002, 7.2000000000000002, 6.2000000000000002, 6.0999999999999996, 6.4000000000000004, 7.2000000000000002, 7.4000000000000004, 7.9000000000000004, 6.4000000000000004, 6.2999999999999998, 6.0999999999999996, 7.7000000000000002, 6.2999999999999998, 6.4000000000000004, 6.0, 6.9000000000000004, 6.7000000000000002, 6.9000000000000004, 5.7999999999999998, 6.7999999999999998, 6.7000000000000002, 6.7000000000000002, 6.2999999999999998, 6.5, 6.2000000000000002, 5.9000000000000004], 'Petal.Length': [1.3999999999999999, 1.3999999999999999, 1.3, 1.5, 1.3999999999999999, 1.7, 1.3999999999999999, 1.5, 1.3999999999999999, 1.5, 1.5, 1.6000000000000001, 1.3999999999999999, 1.1000000000000001, 1.2, 1.5, 1.3, 1.3999999999999999, 1.7, 1.5, 1.7, 1.5, 1.0, 1.7, 1.8999999999999999, 1.6000000000000001, 1.6000000000000001, 1.5, 1.3999999999999999, 1.6000000000000001, 1.6000000000000001, 1.5, 1.5, 1.3999999999999999, 1.5, 1.2, 1.3, 1.3999999999999999, 1.3, 1.5, 1.3, 1.3, 1.3, 1.6000000000000001, 1.8999999999999999, 1.3999999999999999, 1.6000000000000001, 1.3999999999999999, 1.5, 1.3999999999999999, 4.7000000000000002, 4.5, 4.9000000000000004, 4.0, 4.5999999999999996, 4.5, 4.7000000000000002, 3.2999999999999998, 4.5999999999999996, 3.8999999999999999, 3.5, 4.2000000000000002, 4.0, 4.7000000000000002, 3.6000000000000001, 4.4000000000000004, 4.5, 4.0999999999999996, 4.5, 3.8999999999999999, 4.7999999999999998, 4.0, 4.9000000000000004, 4.7000000000000002, 4.2999999999999998, 4.4000000000000004, 4.7999999999999998, 5.0, 4.5, 3.5, 3.7999999999999998, 3.7000000000000002, 3.8999999999999999, 5.0999999999999996, 4.5, 4.5, 4.7000000000000002, 4.4000000000000004, 4.0999999999999996, 4.0, 4.4000000000000004, 4.5999999999999996, 4.0, 3.2999999999999998, 4.2000000000000002, 4.2000000000000002, 4.2000000000000002, 4.2999999999999998, 3.0, 4.0999999999999996, 6.0, 5.0999999999999996, 5.9000000000000004, 5.5999999999999996, 5.7999999999999998, 6.5999999999999996, 4.5, 6.2999999999999998, 5.7999999999999998, 6.0999999999999996, 5.0999999999999996, 5.2999999999999998, 5.5, 5.0, 5.0999999999999996, 5.2999999999999998, 5.5, 6.7000000000000002, 6.9000000000000004, 5.0, 5.7000000000000002, 4.9000000000000004, 6.7000000000000002, 4.9000000000000004, 5.7000000000000002, 6.0, 4.7999999999999998, 4.9000000000000004, 5.5999999999999996, 5.7999999999999998, 6.0999999999999996, 6.4000000000000004, 5.5999999999999996, 5.0999999999999996, 5.5999999999999996, 6.0999999999999996, 5.5999999999999996, 5.5, 4.7999999999999998, 5.4000000000000004, 5.5999999999999996, 5.0999999999999996, 5.0999999999999996, 5.9000000000000004, 5.7000000000000002, 5.2000000000000002, 5.0, 5.2000000000000002, 5.4000000000000004, 5.0999999999999996]}
df = pd.DataFrame(raw_data, columns = ['Unnamed: 0', 'Sepal.Length', 'Sepal.Width', 'Petal.Length', 'Petal.Width', 'Species'])
# 查看原始數據幀的前幾行
df.head()
```
| | Unnamed: 0 | Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
| --- | --- | --- | --- | --- | --- | --- |
| 0 | 1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 1 | 2 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 2 | 3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 3 | 4 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 4 | 5 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
```py
# 查看使用我們的代碼創建的,數據幀的前幾行
df_original.head()
```
| | Unnamed: 0 | Sepal.Length | Sepal.Width | Petal.Length | Petal.Width | Species |
| --- | --- | --- | --- | --- | --- | --- |
| 0 | 1 | 5.1 | 3.5 | 1.4 | 0.2 | setosa |
| 1 | 2 | 4.9 | 3.0 | 1.4 | 0.2 | setosa |
| 2 | 3 | 4.7 | 3.2 | 1.3 | 0.2 | setosa |
| 3 | 4 | 4.6 | 3.1 | 1.5 | 0.2 | setosa |
| 4 | 5 | 5.0 | 3.6 | 1.4 | 0.2 | setosa |
## 將分類變量轉換為虛擬變量
```py
# 導入模塊
import pandas as pd
# 創建數據幀
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
'sex': ['male', 'female', 'male', 'female', 'female']}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'sex'])
df
```
| | first_name | last_name | sex |
| --- | --- | --- | --- |
| 0 | Jason | Miller | male |
| 1 | Molly | Jacobson | female |
| 2 | Tina | Ali | male |
| 3 | Jake | Milner | female |
| 4 | Amy | Cooze | female |
```py
# 從 sex 變量創建一組虛擬變量
df_sex = pd.get_dummies(df['sex'])
# 將虛擬變量連接到主數據幀
df_new = pd.concat([df, df_sex], axis=1)
df_new
```
| | first_name | last_name | sex | female | male |
| --- | --- | --- | --- | --- | --- |
| 0 | Jason | Miller | male | 0.0 | 1.0 |
| 1 | Molly | Jacobson | female | 1.0 | 0.0 |
| 2 | Tina | Ali | male | 0.0 | 1.0 |
| 3 | Jake | Milner | female | 1.0 | 0.0 |
| 4 | Amy | Cooze | female | 1.0 | 0.0 |
```py
# 連接新列的替代方案
df_new = df.join(df_sex)
df_new
```
| | first_name | last_name | sex | female | male |
| --- | --- | --- | --- | --- | --- |
| 0 | Jason | Miller | male | 0.0 | 1.0 |
| 1 | Molly | Jacobson | female | 1.0 | 0.0 |
| 2 | Tina | Ali | male | 0.0 | 1.0 |
| 3 | Jake | Milner | female | 1.0 | 0.0 |
| 4 | Amy | Cooze | female | 1.0 | 0.0 |
## 將分類變量轉換為虛擬變量
```py
# 導入模塊
import pandas as pd
import patsy
# 創建數據幀
raw_data = {'countrycode': [1, 2, 3, 2, 1]}
df = pd.DataFrame(raw_data, columns = ['countrycode'])
df
```
| | countrycode |
| --- | --- |
| 0 | 1 |
| 1 | 2 |
| 2 | 3 |
| 3 | 2 |
| 4 | 1 |
```py
# 將 countrycode 變量轉換為三個二元變量
patsy.dmatrix('C(countrycode)-1', df, return_type='dataframe')
```
| | C(countrycode)[1] | C(countrycode)[2] | C(countrycode)[3] |
| --- | --- | --- | --- |
| 0 | 1.0 | 0.0 | 0.0 |
| 1 | 0.0 | 1.0 | 0.0 |
| 2 | 0.0 | 0.0 | 1.0 |
| 3 | 0.0 | 1.0 | 0.0 |
| 4 | 1.0 | 0.0 | 0.0 |
## 將字符串分類變量轉換為數字變量
```py
# 導入模塊
import pandas as pd
raw_data = {'patient': [1, 1, 1, 2, 2],
'obs': [1, 2, 3, 1, 2],
'treatment': [0, 1, 0, 1, 0],
'score': ['strong', 'weak', 'normal', 'weak', 'strong']}
df = pd.DataFrame(raw_data, columns = ['patient', 'obs', 'treatment', 'score'])
df
```
| | patient | obs | treatment | score |
| --- | --- | --- | --- | --- |
| 0 | 1 | 1 | 0 | strong |
| 1 | 1 | 2 | 1 | weak |
| 2 | 1 | 3 | 0 | normal |
| 3 | 2 | 1 | 1 | weak |
| 4 | 2 | 2 | 0 | strong |
```py
# 創建一個函數,將 df['score'] 的所有值轉換為數字
def score_to_numeric(x):
if x=='strong':
return 3
if x=='normal':
return 2
if x=='weak':
return 1
df['score_num'] = df['score'].apply(score_to_numeric)
df
```
| | patient | obs | treatment | score | score_num |
| --- | --- | --- | --- | --- | --- |
| 0 | 1 | 1 | 0 | strong | 3 |
| 1 | 1 | 2 | 1 | weak | 1 |
| 2 | 1 | 3 | 0 | normal | 2 |
| 3 | 2 | 1 | 1 | weak | 1 |
| 4 | 2 | 2 | 0 | strong | 3 |
## 將變量轉換為時間序列
```py
# 導入庫
import pandas as pd
# 創建索引為一組名稱的數據集
raw_data = {'date': ['2014-06-01T01:21:38.004053', '2014-06-02T01:21:38.004053', '2014-06-03T01:21:38.004053'],
'score': [25, 94, 57]}
df = pd.DataFrame(raw_data, columns = ['date', 'score'])
df
```
| | date | score |
| --- | --- | --- |
| 0 | 2014-06-01T01:21:38.004053 | 25 |
| 1 | 2014-06-02T01:21:38.004053 | 94 |
| 2 | 2014-06-03T01:21:38.004053 | 57 |
```py
# 轉置數據集,使索引(在本例中為名稱)為列
df["date"] = pd.to_datetime(df["date"])
df = df.set_index(df["date"])
df
```
| | date | score |
| --- | --- | --- |
| date | | |
| --- | --- | --- |
| 2014-06-01 01:21:38.004053 | 2014-06-01 01:21:38.004053 | 25 |
| 2014-06-02 01:21:38.004053 | 2014-06-02 01:21:38.004053 | 94 |
| 2014-06-03 01:21:38.004053 | 2014-06-03 01:21:38.004053 | 57 |
## 在 Pandas 數據幀中計數
```py
# 導入庫
import pandas as pd
year = pd.Series([1875, 1876, 1877, 1878, 1879, 1880, 1881, 1882, 1883, 1884,
1885, 1886, 1887, 1888, 1889, 1890, 1891, 1892, 1893, 1894])
guardCorps = pd.Series([0,2,2,1,0,0,1,1,0,3,0,2,1,0,0,1,0,1,0,1])
corps1 = pd.Series([0,0,0,2,0,3,0,2,0,0,0,1,1,1,0,2,0,3,1,0])
corps2 = pd.Series([0,0,0,2,0,2,0,0,1,1,0,0,2,1,1,0,0,2,0,0])
corps3 = pd.Series([0,0,0,1,1,1,2,0,2,0,0,0,1,0,1,2,1,0,0,0])
corps4 = pd.Series([0,1,0,1,1,1,1,0,0,0,0,1,0,0,0,0,1,1,0,0])
corps5 = pd.Series([0,0,0,0,2,1,0,0,1,0,0,1,0,1,1,1,1,1,1,0])
corps6 = pd.Series([0,0,1,0,2,0,0,1,2,0,1,1,3,1,1,1,0,3,0,0])
corps7 = pd.Series([1,0,1,0,0,0,1,0,1,1,0,0,2,0,0,2,1,0,2,0])
corps8 = pd.Series([1,0,0,0,1,0,0,1,0,0,0,0,1,0,0,0,1,1,0,1])
corps9 = pd.Series([0,0,0,0,0,2,1,1,1,0,2,1,1,0,1,2,0,1,0,0])
corps10 = pd.Series([0,0,1,1,0,1,0,2,0,2,0,0,0,0,2,1,3,0,1,1])
corps11 = pd.Series([0,0,0,0,2,4,0,1,3,0,1,1,1,1,2,1,3,1,3,1])
corps14 = pd.Series([ 1,1,2,1,1,3,0,4,0,1,0,3,2,1,0,2,1,1,0,0])
corps15 = pd.Series([0,1,0,0,0,0,0,1,0,1,1,0,0,0,2,2,0,0,0,0])
variables = dict(guardCorps = guardCorps, corps1 = corps1,
corps2 = corps2, corps3 = corps3, corps4 = corps4,
corps5 = corps5, corps6 = corps6, corps7 = corps7,
corps8 = corps8, corps9 = corps9, corps10 = corps10,
corps11 = corps11 , corps14 = corps14, corps15 = corps15)
horsekick = pd.DataFrame(variables, columns = ['guardCorps',
'corps1', 'corps2',
'corps3', 'corps4',
'corps5', 'corps6',
'corps7', 'corps8',
'corps9', 'corps10',
'corps11', 'corps14',
'corps15'])
horsekick.index = [1875, 1876, 1877, 1878, 1879, 1880, 1881, 1882, 1883, 1884,
1885, 1886, 1887, 1888, 1889, 1890, 1891, 1892, 1893, 1894]
horsekick
```
| | guardCorps | corps1 | corps2 | corps3 | corps4 | corps5 | corps6 | corps7 | corps8 | corps9 | corps10 | corps11 | corps14 | corps15 |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| 1875 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 1 | 0 |
| 1876 | 2 | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| 1877 | 2 | 0 | 0 | 0 | 0 | 0 | 1 | 1 | 0 | 0 | 1 | 0 | 2 | 0 |
| 1878 | 1 | 2 | 2 | 1 | 1 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
| 1879 | 0 | 0 | 0 | 1 | 1 | 2 | 2 | 0 | 1 | 0 | 0 | 2 | 1 | 0 |
| 1880 | 0 | 3 | 2 | 1 | 1 | 1 | 0 | 0 | 0 | 2 | 1 | 4 | 3 | 0 |
| 1881 | 1 | 0 | 0 | 2 | 1 | 0 | 0 | 1 | 0 | 1 | 0 | 0 | 0 | 0 |
| 1882 | 1 | 2 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 2 | 1 | 4 | 1 |
| 1883 | 0 | 0 | 1 | 2 | 0 | 1 | 2 | 1 | 0 | 1 | 0 | 3 | 0 | 0 |
| 1884 | 3 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 2 | 0 | 1 | 1 |
| 1885 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 | 2 | 0 | 1 | 0 | 1 |
| 1886 | 2 | 1 | 0 | 0 | 1 | 1 | 1 | 0 | 0 | 1 | 0 | 1 | 3 | 0 |
| 1887 | 1 | 1 | 2 | 1 | 0 | 0 | 3 | 2 | 1 | 1 | 0 | 1 | 2 | 0 |
| 1888 | 0 | 1 | 1 | 0 | 0 | 1 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
| 1889 | 0 | 0 | 1 | 1 | 0 | 1 | 1 | 0 | 0 | 1 | 2 | 2 | 0 | 2 |
| 1890 | 1 | 2 | 0 | 2 | 0 | 1 | 1 | 2 | 0 | 2 | 1 | 1 | 2 | 2 |
| 1891 | 0 | 0 | 0 | 1 | 1 | 1 | 0 | 1 | 1 | 0 | 3 | 3 | 1 | 0 |
| 1892 | 1 | 3 | 2 | 0 | 1 | 1 | 3 | 0 | 1 | 1 | 0 | 1 | 1 | 0 |
| 1893 | 0 | 1 | 0 | 0 | 0 | 1 | 0 | 2 | 0 | 0 | 1 | 3 | 0 | 0 |
| 1894 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | 1 | 0 | 0 |
```py
# 計算每個團隊中每個死亡人數的次數
result = horsekick.apply(pd.value_counts).fillna(0); result
```
| | guardCorps | corps1 | corps2 | corps3 | corps4 | corps5 | corps6 | corps7 | corps8 | corps9 | corps10 | corps11 | corps14 | corps15 |
| 0 | 9.0 | 11.0 | 12.0 | 11.0 | 12.0 | 10.0 | 9.0 | 11.0 | 13.0 | 10.0 | 10.0 | 6 | 6 | 14.0 |
| 1 | 7.0 | 4.0 | 4.0 | 6.0 | 8.0 | 9.0 | 7.0 | 6.0 | 7.0 | 7.0 | 6.0 | 8 | 8 | 4.0 |
| 2 | 3.0 | 3.0 | 4.0 | 3.0 | 0.0 | 1.0 | 2.0 | 3.0 | 0.0 | 3.0 | 3.0 | 2 | 3 | 2.0 |
| 3 | 1.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 1.0 | 3 | 2 | 0.0 |
| 4 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1 | 1 | 0.0 |
```py
# 計算每個月死亡總數出現在 guardCorps 的次數
pd.value_counts(horsekick['guardCorps'].values, sort=False)
'''
0 9
1 7
2 3
3 1
dtype: int64
'''
horsekick['guardCorps'].unique()
# array([0, 2, 1, 3])
```
## 在 Pandas 中創建流水線
Pandas 的流水線功能允許你將 Python 函數串聯在一起,來構建數據處理流水線。
```py
import pandas as pd
# 創建空數據幀
df = pd.DataFrame()
# Create a column
df['name'] = ['John', 'Steve', 'Sarah']
df['gender'] = ['Male', 'Male', 'Female']
df['age'] = [31, 32, 19]
# 查看數據幀
df
```
| | name | gender | age |
| --- | --- | --- | --- |
| 0 | John | Male | 31 |
| 1 | Steve | Male | 32 |
| 2 | Sarah | Female | 19 |
```py
# 創建函數,
def mean_age_by_group(dataframe, col):
# 它按列分組數據,并返回每組的均值
return dataframe.groupby(col).mean()
# 創建函數,
def uppercase_column_name(dataframe):
# 它大寫所有列標題
dataframe.columns = dataframe.columns.str.upper()
# 并返回它
return dataframe
# 創建流水線,它應用 mean_age_by_group 函數
(df.pipe(mean_age_by_group, col='gender')
# 之后應用 uppercase_column_name 函數
.pipe(uppercase_column_name)
)
```
| | AGE |
| --- | --- |
| gender | |
| Female | 19.0 |
| Male | 31.5 |
## 使用`for`循環創建 Pandas 列
```py
import pandas as pd
import numpy as np
raw_data = {'student_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'],
'test_score': [76, 88, 84, 67, 53, 96, 64, 91, 77, 73, 52, np.NaN]}
df = pd.DataFrame(raw_data, columns = ['student_name', 'test_score'])
# 創建列表來儲存數據
grades = []
# 對于列中的每一行
for row in df['test_score']:
# 如果大于某個值
if row > 95:
# 添加字母分數
grades.append('A')
# 或者,如果大于某個值
elif row > 90:
# 添加字母分數
grades.append('A-')
# 或者,如果大于某個值
elif row > 85:
# 添加字母分數
grades.append('B')
# 或者,如果大于某個值
elif row > 80:
# 添加字母分數
grades.append('B-')
# 或者,如果大于某個值
elif row > 75:
# 添加字母分數
grades.append('C')
# 或者,如果大于某個值
elif row > 70:
# 添加字母分數
grades.append('C-')
# 或者,如果大于某個值
elif row > 65:
# 添加字母分數
grades.append('D')
# 或者,如果大于某個值
elif row > 60:
# 添加字母分數
grades.append('D-')
# 否則
else:
# 添加不及格分數
grades.append('Failed')
# 從列表創建一列
df['grades'] = grades
# 查看新數據幀
df
```
| | student_name | test_score | grades |
| --- | --- | --- | --- |
| 0 | Miller | 76.0 | C |
| 1 | Jacobson | 88.0 | B |
| 2 | Ali | 84.0 | B- |
| 3 | Milner | 67.0 | D |
| 4 | Cooze | 53.0 | Failed |
| 5 | Jacon | 96.0 | A |
| 6 | Ryaner | 64.0 | D- |
| 7 | Sone | 91.0 | A- |
| 8 | Sloan | 77.0 | C |
| 9 | Piger | 73.0 | C- |
| 10 | Riani | 52.0 | Failed |
| 11 | Ali | NaN | Failed |
## 創建項目計數
```py
from collections import Counter
# 創建一個今天吃的水果的計數器
fruit_eaten = Counter(['Apple', 'Apple', 'Apple', 'Banana', 'Pear', 'Pineapple'])
# 查看計數器
fruit_eaten
# Counter({'Apple': 3, 'Banana': 1, 'Pear': 1, 'Pineapple': 1})
# 更新菠蘿的計數(因為你只吃菠蘿)
fruit_eaten.update(['Pineapple'])
# 查看計數器
fruit_eaten
# Counter({'Apple': 3, 'Banana': 1, 'Pear': 1, 'Pineapple': 2})
# 查看計數最大的三個項目
fruit_eaten.most_common(3)
# [('Apple', 3), ('Pineapple', 2), ('Banana', 1)]
```
## 基于條件創建一列
```py
# 導入所需模塊
import pandas as pd
import numpy as np
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'age': [42, 52, 36, 24, 73],
'preTestScore': [4, 24, 31, 2, 3],
'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(data, columns = ['name', 'age', 'preTestScore', 'postTestScore'])
df
```
| | name | age | preTestScore | postTestScore |
| --- | --- | --- | --- | --- |
| 0 | Jason | 42 | 4 | 25 |
| 1 | Molly | 52 | 24 | 94 |
| 2 | Tina | 36 | 31 | 57 |
| 3 | Jake | 24 | 2 | 62 |
| 4 | Amy | 73 | 3 | 70 |
```py
# 創建一個名為 df.elderly 的新列
# 如果 df.age 大于 50 則值為 yes,否則為 no
df['elderly'] = np.where(df['age']>=50, 'yes', 'no')
# 查看數據幀
df
```
| | name | age | preTestScore | postTestScore | elderly |
| --- | --- | --- | --- | --- | --- |
| 0 | Jason | 42 | 4 | 25 | no |
| 1 | Molly | 52 | 24 | 94 | yes |
| 2 | Tina | 36 | 31 | 57 | no |
| 3 | Jake | 24 | 2 | 62 | no |
| 4 | Amy | 73 | 3 | 70 | yes |
## 從詞典鍵和值創建列表
```py
# 創建字典
dict = {'county': ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'],
'year': [2012, 2012, 2013, 2014, 2014],
'fireReports': [4, 24, 31, 2, 3]}
# 創建鍵的列表
list(dict.keys())
# ['fireReports', 'year', 'county']
# 創建值的列表
list(dict.values())
'''
[[4, 24, 31, 2, 3],
[2012, 2012, 2013, 2014, 2014],
['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma']]
'''
```
## Pandas 中的交叉表
```py
# 導入庫
import pandas as pd
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'],
'company': ['infantry', 'infantry', 'cavalry', 'cavalry', 'infantry', 'infantry', 'cavalry', 'cavalry','infantry', 'infantry', 'cavalry', 'cavalry'],
'experience': ['veteran', 'rookie', 'veteran', 'rookie', 'veteran', 'rookie', 'veteran', 'rookie','veteran', 'rookie', 'veteran', 'rookie'],
'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'],
'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'experience', 'name', 'preTestScore', 'postTestScore'])
df
```
| | regiment | company | experience | name | preTestScore | postTestScore |
| --- | --- | --- | --- | --- | --- | --- |
| 0 | Nighthawks | infantry | veteran | Miller | 4 | 25 |
| 1 | Nighthawks | infantry | rookie | Jacobson | 24 | 94 |
| 2 | Nighthawks | cavalry | veteran | Ali | 31 | 57 |
| 3 | Nighthawks | cavalry | rookie | Milner | 2 | 62 |
| 4 | Dragoons | infantry | veteran | Cooze | 3 | 70 |
| 5 | Dragoons | infantry | rookie | Jacon | 4 | 25 |
| 6 | Dragoons | cavalry | veteran | Ryaner | 24 | 94 |
| 7 | Dragoons | cavalry | rookie | Sone | 31 | 57 |
| 8 | Scouts | infantry | veteran | Sloan | 2 | 62 |
| 9 | Scouts | infantry | rookie | Piger | 3 | 70 |
| 10 | Scouts | cavalry | veteran | Riani | 2 | 62 |
| 11 | Scouts | cavalry | rookie | Ali | 3 | 70 |
按公司和團隊創建交叉表。按公司和團隊計算觀測數量。
```py
pd.crosstab(df.regiment, df.company, margins=True)
```
| company | cavalry | infantry | All |
| --- | --- | --- | --- |
| regiment | | | |
| Dragoons | 2 | 2 | 4 |
| Nighthawks | 2 | 2 | 4 |
| Scouts | 2 | 2 | 4 |
| All | 6 | 6 | 12 |
```py
# 為每個團隊創建公司和經驗的交叉表
pd.crosstab([df.company, df.experience], df.regiment, margins=True)
```
| | regiment | Dragoons | Nighthawks | Scouts | All |
| --- | --- | --- | --- | --- | --- |
| company | experience | | | | |
| cavalry | rookie | 1 | 1 | 1 | 3 |
| | veteran | 1 | 1 | 1 | 3 |
| infantry | rookie | 1 | 1 | 1 | 3 |
| | veteran | 1 | 1 | 1 | 3 |
| All | | 4 | 4 | 4 | 12 |
## 刪除重復
```py
# 導入模塊
import pandas as pd
raw_data = {'first_name': ['Jason', 'Jason', 'Jason','Tina', 'Jake', 'Amy'],
'last_name': ['Miller', 'Miller', 'Miller','Ali', 'Milner', 'Cooze'],
'age': [42, 42, 1111111, 36, 24, 73],
'preTestScore': [4, 4, 4, 31, 2, 3],
'postTestScore': [25, 25, 25, 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])
df
```
| | first_name | last_name | age | preTestScore | postTestScore |
| --- | --- | --- | --- | --- | --- |
| 0 | Jason | Miller | 42 | 4 | 25 |
| 1 | Jason | Miller | 42 | 4 | 25 |
| 2 | Jason | Miller | 1111111 | 4 | 25 |
| 3 | Tina | Ali | 36 | 31 | 57 |
| 4 | Jake | Milner | 24 | 2 | 62 |
| 5 | Amy | Cooze | 73 | 3 | 70 |
```py
# 確定哪些觀測是重復的
df.duplicated()
'''
0 False
1 True
2 False
3 False
4 False
5 False
dtype: bool
'''
df.drop_duplicates()
```
| | first_name | last_name | age | preTestScore | postTestScore |
| --- | --- | --- | --- | --- | --- |
| 0 | Jason | Miller | 42 | 4 | 25 |
| 2 | Jason | Miller | 1111111 | 4 | 25 |
| 3 | Tina | Ali | 36 | 31 | 57 |
| 4 | Jake | Milner | 24 | 2 | 62 |
| 5 | Amy | Cooze | 73 | 3 | 70 |
```py
# 刪除 first_name 列中的重復項
# 但保留重復集中的最后一個觀測
df.drop_duplicates(['first_name'], keep='last')
```
| | first_name | last_name | age | preTestScore | postTestScore |
| --- | --- | --- | --- | --- | --- |
| 2 | Jason | Miller | 1111111 | 4 | 25 |
| 3 | Tina | Ali | 36 | 31 | 57 |
| 4 | Jake | Milner | 24 | 2 | 62 |
| 5 | Amy | Cooze | 73 | 3 | 70 |
## Pandas 數據幀的描述性統計
```py
# 導入模塊
import pandas as pd
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'age': [42, 52, 36, 24, 73],
'preTestScore': [4, 24, 31, 2, 3],
'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(data, columns = ['name', 'age', 'preTestScore', 'postTestScore'])
df
```
| | name | age | preTestScore | postTestScore |
| --- | --- | --- | --- | --- |
| 0 | Jason | 42 | 4 | 25 |
| 1 | Molly | 52 | 24 | 94 |
| 2 | Tina | 36 | 31 | 57 |
| 3 | Jake | 24 | 2 | 62 |
| 4 | Amy | 73 | 3 | 70 |
5 rows × 4 columns
```py
# 所有年齡之和
df['age'].sum()
# 227
df['preTestScore'].mean()
# 12.800000000000001
df['preTestScore'].cumsum()
'''
0 4
1 28
2 59
3 61
4 64
Name: preTestScore, dtype: int64
'''
df['preTestScore'].describe()
'''
count 5.000000
mean 12.800000
std 13.663821
min 2.000000
25% 3.000000
50% 4.000000
75% 24.000000
max 31.000000
Name: preTestScore, dtype: float64
'''
df['preTestScore'].count()
# 5
df['preTestScore'].min()
# 2
df['preTestScore'].max()
# 31
df['preTestScore'].median()
# 4.0
df['preTestScore'].var()
# 186.69999999999999
df['preTestScore'].std()
# 13.663820841916802
df['preTestScore'].skew()
# 0.74334524573267591
df['preTestScore'].kurt()
# -2.4673543738411525
df.corr()
```
| | age | preTestScore | postTestScore |
| --- | --- | --- | --- |
| age | 1.000000 | -0.105651 | 0.328852 |
| preTestScore | -0.105651 | 1.000000 | 0.378039 |
| postTestScore | 0.328852 | 0.378039 | 1.000000 |
3 rows × 3 columns
```py
# 協方差矩陣
df.cov()
```
| | age | preTestScore | postTestScore |
| --- | --- | --- | --- |
| age | 340.80 | -26.65 | 151.20 |
| preTestScore | -26.65 | 186.70 | 128.65 |
| postTestScore | 151.20 | 128.65 | 620.30 |
3 rows × 3 columns
## 丟棄行或者列
```py
# 導入模塊
import pandas as pd
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'year': [2012, 2012, 2013, 2014, 2014],
'reports': [4, 24, 31, 2, 3]}
df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df
```
| | name | reports | year |
| --- | --- | --- | --- |
| Cochice | Jason | 4 | 2012 |
| Pima | Molly | 24 | 2012 |
| Santa Cruz | Tina | 31 | 2013 |
| Maricopa | Jake | 2 | 2014 |
| Yuma | Amy | 3 | 2014 |
```py
# 丟棄觀測(行)
df.drop(['Cochice', 'Pima'])
```
| | name | reports | year |
| --- | --- | --- | --- |
| Santa Cruz | Tina | 31 | 2013 |
| Maricopa | Jake | 2 | 2014 |
| Yuma | Amy | 3 | 2014 |
```py
# 丟棄變量(列)
# 注意:`axis = 1`表示我們指的是列,而不是行
df.drop('reports', axis=1)
```
| | name | year |
| --- | --- | --- |
| Cochice | Jason | 2012 |
| Pima | Molly | 2012 |
| Santa Cruz | Tina | 2013 |
| Maricopa | Jake | 2014 |
| Yuma | Amy | 2014 |
如果它包含某個值(這里是`Tina`),丟棄一行。
具體來說:創建一個名為`df`的新數據框,名稱列中的單元格的值不等于`Tina`。
```py
df[df.name != 'Tina']
```
| | name | reports | year |
| --- | --- | --- | --- |
| Cochice | Jason | 4 | 2012 |
| Pima | Molly | 24 | 2012 |
| Maricopa | Jake | 2 | 2014 |
| Yuma | Amy | 3 | 2014 |
按照行號丟棄一行(在本例中為第 3 行)。
請注意,Pandas使用從零開始的編號,因此 0 是第一行,1 是第二行,等等。
```py
df.drop(df.index[2])
```
| | name | reports | year |
| --- | --- | --- | --- |
| Cochice | Jason | 4 | 2012 |
| Pima | Molly | 24 | 2012 |
| Maricopa | Jake | 2 | 2014 |
| Yuma | Amy | 3 | 2014 |
可以擴展到范圍。
```py
df.drop(df.index[[2,3]])
```
| | name | reports | year |
| --- | --- | --- | --- |
| Cochice | Jason | 4 | 2012 |
| Pima | Molly | 24 | 2012 |
| Yuma | Amy | 3 | 2014 |
或相對于 DF 的末尾來丟棄。
```py
df.drop(df.index[-2])
```
| | name | reports | year |
| --- | --- | --- | --- |
| Cochice | Jason | 4 | 2012 |
| Pima | Molly | 24 | 2012 |
| Santa Cruz | Tina | 31 | 2013 |
| Yuma | Amy | 3 | 2014 |
你也可以選擇相對于起始或末尾的范圍。
```py
df[:3] # 保留前三個
```
| | name | reports | year |
| --- | --- | --- | --- |
| Cochice | Jason | 4 | 2012 |
| Pima | Molly | 24 | 2012 |
| Santa Cruz | Tina | 31 | 2013 |
```py
df[:-3] # 丟掉后三個
```
| | name | reports | year |
| --- | --- | --- | --- |
| Cochice | Jason | 4 | 2012 |
| Pima | Molly | 24 | 2012 |
## 枚舉列表
```py
# 創建字符串列表
data = ['One','Two','Three','Four','Five']
# 對于 enumerate(data) 中的每個項目
for item in enumerate(data):
# 打印整個枚舉的元素
print(item)
# 只打印值(沒有索引)
print(item[1])
'''
(0, 'One')
One
(1, 'Two')
Two
(2, 'Three')
Three
(3, 'Four')
Four
(4, 'Five')
Five
'''
```
## 在 Pandas 中將包含列表的單元擴展為自己的變量
```py
# 導入 pandas
import pandas as pd
# 創建數據集
raw_data = {'score': [1,2,3],
'tags': [['apple','pear','guava'],['truck','car','plane'],['cat','dog','mouse']]}
df = pd.DataFrame(raw_data, columns = ['score', 'tags'])
# 查看數據集
df
```
| | score | tags |
| --- | --- | --- |
| 0 | 1 | [apple, pear, guava] |
| 1 | 2 | [truck, car, plane] |
| 2 | 3 | [cat, dog, mouse] |
```py
# 將 df.tags 擴展為自己的數據幀
tags = df['tags'].apply(pd.Series)
# 將每個變量重命名為標簽
tags = tags.rename(columns = lambda x : 'tag_' + str(x))
# 查看 tags 數據幀
tags
```
| | tag_0 | tag_1 | tag_2 |
| --- | --- | --- | --- |
| 0 | apple | pear | guava |
| 1 | truck | car | plane |
| 2 | cat | dog | mouse |
```py
# 將 tags 數據幀添加回原始數據幀
pd.concat([df[:], tags[:]], axis=1)
```
| | score | tags | tag_0 | tag_1 | tag_2 |
| --- | --- | --- | --- | --- | --- |
| 0 | 1 | [apple, pear, guava] | apple | pear | guava |
| 1 | 2 | [truck, car, plane] | truck | car | plane |
| 2 | 3 | [cat, dog, mouse] | cat | dog | mouse |
## 過濾 pandas 數據幀
```py
# 導入模塊
import pandas as pd
data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'year': [2012, 2012, 2013, 2014, 2014],
'reports': [4, 24, 31, 2, 3],
'coverage': [25, 94, 57, 62, 70]}
df = pd.DataFrame(data, index = ['Cochice', 'Pima', 'Santa Cruz', 'Maricopa', 'Yuma'])
df
```
| | coverage | name | reports | year |
| --- | --- | --- | --- | --- |
| Cochice | 25 | Jason | 4 | 2012 |
| Pima | 94 | Molly | 24 | 2012 |
| Santa Cruz | 57 | Tina | 31 | 2013 |
| Maricopa | 62 | Jake | 2 | 2014 |
| Yuma | 70 | Amy | 3 | 2014 |
```py
# 查看列
df['name']
'''
Cochice Jason
Pima Molly
Santa Cruz Tina
Maricopa Jake
Yuma Amy
Name: name, dtype: object
'''
df[['name', 'reports']]
```
| | name | reports |
| --- | --- | --- |
| Cochice | Jason | 4 |
| Pima | Molly | 24 |
| Santa Cruz | Tina | 31 |
| Maricopa | Jake | 2 |
| Yuma | Amy | 3 |
```py
# 查看前兩行
df[:2]
```
| | coverage | name | reports | year |
| --- | --- | --- | --- | --- |
| Cochice | 25 | Jason | 4 | 2012 |
| Pima | 94 | Molly | 24 | 2012 |
```py
# 查看 Coverage 大于 50 的行
df[df['coverage'] > 50]
```
| | coverage | name | reports | year |
| --- | --- | --- | --- | --- |
| Pima | 94 | Molly | 24 | 2012 |
| Santa Cruz | 57 | Tina | 31 | 2013 |
| Maricopa | 62 | Jake | 2 | 2014 |
| Yuma | 70 | Amy | 3 | 2014 |
```py
# 查看 Coverage 大于 50 并且 Reports 小于 4 的行
df[(df['coverage'] > 50) & (df['reports'] < 4)]
```
| | coverage | name | reports | year |
| --- | --- | --- | --- | --- |
| Maricopa | 62 | Jake | 2 | 2014 |
| Yuma | 70 | Amy | 3 | 2014 |
## 尋找數據幀的列中的最大值
```py
# 導入模塊
%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
# 創建數據幀
raw_data = {'first_name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'],
'last_name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze'],
'age': [42, 52, 36, 24, 73],
'preTestScore': [4, 24, 31, 2, 3],
'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['first_name', 'last_name', 'age', 'preTestScore', 'postTestScore'])
df
```
| | first_name | last_name | age | preTestScore | postTestScore |
| --- | --- | --- | --- | --- | --- |
| 0 | Jason | Miller | 42 | 4 | 25 |
| 1 | Molly | Jacobson | 52 | 24 | 94 |
| 2 | Tina | Ali | 36 | 31 | 57 |
| 3 | Jake | Milner | 24 | 2 | 62 |
| 4 | Amy | Cooze | 73 | 3 | 70 |
```py
# 獲取 preTestScore 列中的最大值的索引
df['preTestScore'].idxmax()
# 2
```
## 尋找數據幀中的唯一值
```py
import pandas as pd
import numpy as np
raw_data = {'regiment': ['51st', '29th', '2nd', '19th', '12th', '101st', '90th', '30th', '193th', '1st', '94th', '91th'],
'trucks': ['MAZ-7310', np.nan, 'MAZ-7310', 'MAZ-7310', 'Tatra 810', 'Tatra 810', 'Tatra 810', 'Tatra 810', 'ZIS-150', 'Tatra 810', 'ZIS-150', 'ZIS-150'],
'tanks': ['Merkava Mark 4', 'Merkava Mark 4', 'Merkava Mark 4', 'Leopard 2A6M', 'Leopard 2A6M', 'Leopard 2A6M', 'Arjun MBT', 'Leopard 2A6M', 'Arjun MBT', 'Arjun MBT', 'Arjun MBT', 'Arjun MBT'],
'aircraft': ['none', 'none', 'none', 'Harbin Z-9', 'Harbin Z-9', 'none', 'Harbin Z-9', 'SH-60B Seahawk', 'SH-60B Seahawk', 'SH-60B Seahawk', 'SH-60B Seahawk', 'SH-60B Seahawk']}
df = pd.DataFrame(raw_data, columns = ['regiment', 'trucks', 'tanks', 'aircraft'])
# 查看前幾行
df.head()
```
| | regiment | trucks | tanks | aircraft |
| --- | --- | --- | --- | --- |
| 0 | 51st | MAZ-7310 | Merkava Mark 4 | none |
| 1 | 29th | NaN | Merkava Mark 4 | none |
| 2 | 2nd | MAZ-7310 | Merkava Mark 4 | none |
| 3 | 19th | MAZ-7310 | Leopard 2A6M | Harbin Z-9 |
| 4 | 12th | Tatra 810 | Leopard 2A6M | Harbin Z-9 |
```py
# 通過將 pandas 列轉換為集合
# 創建唯一值的列表
list(set(df.trucks))
# [nan, 'Tatra 810', 'MAZ-7310', 'ZIS-150']
# 創建 df.trucks 中的唯一值的列表
list(df['trucks'].unique())
# ['MAZ-7310', nan, 'Tatra 810', 'ZIS-150']
```
## 地理編碼和反向地理編碼
在使用地理數據時,地理編碼(將物理地址或位置轉換為經緯度)和反向地理編碼(將經緯度轉換為物理地址或位置)是常見任務。
Python 提供了許多軟件包,使任務變得異常簡單。 在下面的教程中,我使用 pygeocoder(Google 的 geo-API 的包裝器)來進行地理編碼和反向地理編碼。
首先,我們要加載我們想要在腳本中使用的包。 具體來說,我正在為地理函數加載 pygeocoder,為數據幀結構加載 pandas,為缺失值(`np.nan`)函數加載 numpy。
```py
# 加載包
from pygeocoder import Geocoder
import pandas as pd
import numpy as np
```
地理數據有多種形式,在這種情況下,我們有一個 Python 字典,包含五個經緯度的字符串,每個坐標在逗號分隔的坐標對中。
```py
# 創建原始數據的字典
data = {'Site 1': '31.336968, -109.560959',
'Site 2': '31.347745, -108.229963',
'Site 3': '32.277621, -107.734724',
'Site 4': '31.655494, -106.420484',
'Site 5': '30.295053, -104.014528'}
```
雖然技術上沒必要,因為我最初使用 R,我是數據幀的忠實粉絲,所以讓我們把模擬的數據字典變成數據幀。
```py
# 將字典轉換為 pandas 數據幀
df = pd.DataFrame.from_dict(data, orient='index')
# 查看數據幀
df
```
| | 0 |
| --- | --- |
| Site 1 | 31.336968, -109.560959 |
| Site 2 | 31.347745, -108.229963 |
| Site 3 | 32.277621, -107.734724 |
| Site 4 | 31.655494, -106.420484 |
| Site 5 | 30.295053, -104.014528 |
你現在可以看到,我們有了包含五行的數據幀,每行包含一個經緯度字符串。 在我們處理數據之前,我們需要1)將字符串分成緯度和經度,然后將它們轉換為浮點數。以下代碼就是這樣。
```py
# 為循環創建兩個列表
lat = []
lon = []
# 對于變量中的每一行
for row in df[0]:
# 嘗試
try:
# 用逗號分隔行,轉換為浮點
# 并將逗號前的所有內容追加到 lat
lat.append(float(row.split(',')[0]))
# 用逗號分隔行,轉換為浮點
# 并將逗號后的所有內容追加到 lon
lon.append(float(row.split(',')[1]))
# 但是如果你得到了錯誤
except:
# 向 lat 添加缺失值
lat.append(np.NaN)
# 向 lon 添加缺失值
lon.append(np.NaN)
# 從 lat 和 lon 創建新的兩列
df['latitude'] = lat
df['longitude'] = lon
```
讓我們看看現在有了什么。
```py
# 查看數據幀
df
```
| | 0 | latitude | longitude |
| --- | --- | --- | --- |
| Site 1 | 31.336968, -109.560959 | 31.336968 | -109.560959 |
| Site 2 | 31.347745, -108.229963 | 31.347745 | -108.229963 |
| Site 3 | 32.277621, -107.734724 | 32.277621 | -107.734724 |
| Site 4 | 31.655494, -106.420484 | 31.655494 | -106.420484 |
| Site 5 | 30.295053, -104.014528 | 30.295053 | -104.014528 |
真棒。這正是我們想要看到的,一列用于緯度的浮點和一列用于經度的浮點。
為了反轉地理編碼,我們將特定的經緯度對(這里為第一行,索引為`0`)提供給 pygeocoder 的`reverse_geocoder`函數。
```py
# 將經度和緯度轉換為某個位置
results = Geocoder.reverse_geocode(df['latitude'][0], df['longitude'][0])
```
現在我們可以開始提取我們想要的數據了。
```py
# 打印經緯度
results.coordinates
# (31.3372728, -109.5609559)
# 打印城市
results.city
# 'Douglas'
# 打印國家/地區
results.country
# 'United States'
# 打印街道地址(如果可用)
results.street_address
# 打印行政區
results.administrative_area_level_1
# 'Arizona'
```
對于地理編碼,我們需要將包含地址或位置(例如城市)的字符串,傳入地理編碼函數中。 但是,并非所有字符串的格式都是 Google 的 geo-API 可以理解的。 如果由`.geocode().valid_address`函數驗證有效,我們可以轉換。
```py
# 驗證地址是否有效(即在 Google 的系統中)
Geocoder.geocode("4207 N Washington Ave, Douglas, AZ 85607").valid_address
# True
```
因為輸出是True,我們現在知道這是一個有效的地址,因此可以打印緯度和經度坐標。
```py
# 打印經緯度
results.coordinates
# (31.3372728, -109.5609559)
```
但更有趣的是,一旦地址由 Google 地理 API 處理,我們就可以解析它并輕松地分隔街道號碼,街道名稱等。
```py
# 尋找特定地址中的經緯度
result = Geocoder.geocode("7250 South Tucson Boulevard, Tucson, AZ 85756")
# 打印街道號碼
result.street_number
# '7250'
# 打印街道名
result.route
# 'South Tucson Boulevard'
```
你就實現了它。Python 使整個過程變得簡單,只需幾分鐘即可完成分析。祝好運!
## 地理定位城市和國家
本教程創建一個函數,嘗試獲取城市和國家并返回其經緯度。 但是當城市不可用時(通常是這種情況),則返回該國中心的經緯度。
```py
from geopy.geocoders import Nominatim
geolocator = Nominatim()
import numpy as np
def geolocate(city=None, country=None):
'''
輸入城市和國家,或僅輸入國家。 如果可以的話,返回城市的經緯度坐標,否則返回該國家中心的經緯度。
'''
# 如果城市存在
if city != None:
# 嘗試
try:
# 地理定位城市和國家
loc = geolocator.geocode(str(city + ',' + country))
# 并返回經緯度
return (loc.latitude, loc.longitude)
# 否則
except:
# 返回缺失值
return np.nan
# 如果城市不存在
else:
# 嘗試
try:
# 地理定位國家中心
loc = geolocator.geocode(country)
# 返回經緯度
return (loc.latitude, loc.longitude)
# 否則
except:
# 返回缺失值
return np.nan
# 地理定位城市和國家
geolocate(city='Austin', country='USA')
# (30.2711286, -97.7436995)
# 僅僅地理定位國家
geolocate(country='USA')
# (39.7837304, -100.4458824)
```
## 使用 pandas 分組時間序列
```py
# 導入所需模塊
import pandas as pd
import numpy as np
df = pd.DataFrame()
df['german_army'] = np.random.randint(low=20000, high=30000, size=100)
df['allied_army'] = np.random.randint(low=20000, high=40000, size=100)
df.index = pd.date_range('1/1/2014', periods=100, freq='H')
df.head()
```
| | german_army | allied_army |
| --- | --- | --- |
| 2014-01-01 00:00:00 | 28755 | 33938 |
| 2014-01-01 01:00:00 | 25176 | 28631 |
| --- | --- | --- |
| 2014-01-01 02:00:00 | 23261 | 39685 |
| --- | --- | --- |
| 2014-01-01 03:00:00 | 28686 | 27756 |
| --- | --- | --- |
| 2014-01-01 04:00:00 | 24588 | 25681 |
| --- | --- | --- |
### Truncate the dataframe
```py
df.truncate(before='1/2/2014', after='1/3/2014')
```
| | german_army | allied_army |
| --- | --- | --- |
| 2014-01-02 00:00:00 | 26401 | 20189 |
| 2014-01-02 01:00:00 | 29958 | 23934 |
| 2014-01-02 02:00:00 | 24492 | 39075 |
| 2014-01-02 03:00:00 | 25707 | 39262 |
| 2014-01-02 04:00:00 | 27129 | 35961 |
| 2014-01-02 05:00:00 | 27903 | 25418 |
| 2014-01-02 06:00:00 | 20409 | 25163 |
| 2014-01-02 07:00:00 | 25736 | 34794 |
| 2014-01-02 08:00:00 | 24057 | 27209 |
| 2014-01-02 09:00:00 | 26875 | 33402 |
| 2014-01-02 10:00:00 | 23963 | 38575 |
| 2014-01-02 11:00:00 | 27506 | 31859 |
| 2014-01-02 12:00:00 | 23564 | 25750 |
| 2014-01-02 13:00:00 | 27958 | 24365 |
| 2014-01-02 14:00:00 | 24915 | 38866 |
| 2014-01-02 15:00:00 | 23538 | 33820 |
| 2014-01-02 16:00:00 | 23361 | 30080 |
| 2014-01-02 17:00:00 | 27284 | 22922 |
| 2014-01-02 18:00:00 | 24176 | 32155 |
| 2014-01-02 19:00:00 | 23924 | 27763 |
| 2014-01-02 20:00:00 | 23111 | 32343 |
| 2014-01-02 21:00:00 | 20348 | 28907 |
| 2014-01-02 22:00:00 | 27136 | 38634 |
| 2014-01-02 23:00:00 | 28649 | 29950 |
| 2014-01-03 00:00:00 | 21292 | 26395 |
```py
# 設置數據幀的索引
df.index = df.index + pd.DateOffset(months=4, days=5)
df.head()
```
| | german_army | allied_army |
| --- | --- | --- |
| 2014-05-06 00:00:00 | 28755 | 33938 |
| 2014-05-06 01:00:00 | 25176 | 28631 |
| 2014-05-06 02:00:00 | 23261 | 39685 |
| 2014-05-06 03:00:00 | 28686 | 27756 |
| 2014-05-06 04:00:00 | 24588 | 25681 |
```py
# 將變量提前一小時
df.shift(1).head()
```
| | german_army | allied_army |
| --- | --- | --- |
| 2014-05-06 00:00:00 | NaN | NaN |
| 2014-05-06 01:00:00 | 28755.0 | 33938.0 |
| 2014-05-06 02:00:00 | 25176.0 | 28631.0 |
| 2014-05-06 03:00:00 | 23261.0 | 39685.0 |
| 2014-05-06 04:00:00 | 28686.0 | 27756.0 |
```py
# 將變量延后一小時
df.shift(-1).tail()
```
| | german_army | allied_army |
| --- | --- | --- |
| 2014-05-09 23:00:00 | 26903.0 | 39144.0 |
| 2014-05-10 00:00:00 | 27576.0 | 39759.0 |
| 2014-05-10 01:00:00 | 25232.0 | 35246.0 |
| 2014-05-10 02:00:00 | 23391.0 | 21044.0 |
| 2014-05-10 03:00:00 | NaN | NaN |
```py
# 對每小時觀測值求和來按天匯總
df.resample('D').sum()
```
| | german_army | allied_army |
| --- | --- | --- |
| 2014-05-06 | 605161 | 755962 |
| 2014-05-07 | 608100 | 740396 |
| 2014-05-08 | 589744 | 700297 |
| 2014-05-09 | 607092 | 719283 |
| 2014-05-10 | 103102 | 135193 |
```py
# 對每小時觀測值求平均來按天匯總
df.resample('D').mean()
```
| | german_army | allied_army |
| --- | --- | --- |
| 2014-05-06 | 25215.041667 | 31498.416667 |
| 2014-05-07 | 25337.500000 | 30849.833333 |
| 2014-05-08 | 24572.666667 | 29179.041667 |
| 2014-05-09 | 25295.500000 | 29970.125000 |
| 2014-05-10 | 25775.500000 | 33798.250000 |
```py
# 對每小時觀測值求最小值來按天匯總
df.resample('D').min()
```
| | german_army | allied_army |
| --- | --- | --- |
| 2014-05-06 | 24882.0 | 31310.0 |
| 2014-05-07 | 25311.0 | 30969.5 |
| 2014-05-08 | 24422.5 | 28318.0 |
| 2014-05-09 | 24941.5 | 32082.5 |
| 2014-05-10 | 26067.5 | 37195.0 |
```py
# 對每小時觀測值求中值來按天匯總
df.resample('D').median()
```
| | german_army | allied_army |
| --- | --- | --- |
| 2014-05-06 | 24882.0 | 31310.0 |
| 2014-05-07 | 25311.0 | 30969.5 |
| 2014-05-08 | 24422.5 | 28318.0 |
| 2014-05-09 | 24941.5 | 32082.5 |
| 2014-05-10 | 26067.5 | 37195.0 |
```py
# 對每小時觀測值取第一個值來按天匯總
df.resample('D').first()
```
| | german_army | allied_army |
| --- | --- | --- |
| 2014-05-06 | 28755 | 33938 |
| 2014-05-07 | 26401 | 20189 |
| 2014-05-08 | 21292 | 26395 |
| 2014-05-09 | 25764 | 22613 |
| 2014-05-10 | 26903 | 39144 |
```py
# 對每小時觀測值取最后一個值來按天匯總
df.resample('D').last()
```
| | german_army | allied_army |
| --- | --- | --- |
| 2014-05-06 | 28214 | 32110 |
| 2014-05-07 | 28649 | 29950 |
| 2014-05-08 | 28379 | 32600 |
| 2014-05-09 | 26752 | 22379 |
| 2014-05-10 | 23391 | 21044 |
```py
# 對每小時觀測值取第一個值,最后一個值,最高值,最低值來按天匯總
df.resample('D').ohlc()
```
| | german_army | allied_army |
| --- | --- | --- |
| | open | high | low | close | open | high | low | close |
| 2014-05-06 | 28755 | 29206 | 20037 | 28214 | 33938 | 39955 | 23417 | 32110 |
| 2014-05-07 | 26401 | 29958 | 20348 | 28649 | 20189 | 39262 | 20189 | 29950 |
| 2014-05-08 | 21292 | 29786 | 20296 | 28379 | 26395 | 38197 | 20404 | 32600 |
| 2014-05-09 | 25764 | 29952 | 20738 | 26752 | 22613 | 39695 | 20189 | 22379 |
| 2014-05-10 | 26903 | 27576 | 23391 | 23391 | 39144 | 39759 | 21044 | 21044 |
## 按時間分組數據
2016 年 3 月 13 日,Pandas 版本 0.18.0 發布,重采樣功能的運行方式發生了重大變化。 本教程遵循 v0.18.0,不適用于以前版本的 pandas。
首先讓我們加載我們關心的模塊。
```py
# 導入所需模塊
import pandas as pd
import datetime
import numpy as np
```
接下來,讓我們創建一些樣例數據,我們可以將它們按時間分組作為樣本。 在這個例子中,我創建了一個包含兩列 365 行的數據幀。一列是日期,第二列是數值。
```py
# 為今天創建 datetime 變量
base = datetime.datetime.today()
# 創建一列變量
# 包含 365 天的 datetime 值
date_list = [base - datetime.timedelta(days=x) for x in range(0, 365)]
# 創建 365 個數值的列表
score_list = list(np.random.randint(low=1, high=1000, size=365))
# 創建空數據幀
df = pd.DataFrame()
# 從 datetime 變量創建一列
df['datetime'] = date_list
# 將列轉換為 datetime 類型
df['datetime'] = pd.to_datetime(df['datetime'])
# 將 datetime 列設為索引
df.index = df['datetime']
# 為數值得分變量創建一列
df['score'] = score_list
# 讓我們看看數據
df.head()
```
| | datetime | score |
| --- | --- | --- |
| datetime | | |
| 2016-06-02 09:57:54.793972 | 2016-06-02 09:57:54.793972 | 900 |
| 2016-06-01 09:57:54.793972 | 2016-06-01 09:57:54.793972 | 121 |
| 2016-05-31 09:57:54.793972 | 2016-05-31 09:57:54.793972 | 547 |
| 2016-05-30 09:57:54.793972 | 2016-05-30 09:57:54.793972 | 504 |
| 2016-05-29 09:57:54.793972 | 2016-05-29 09:57:54.793972 | 304 |
在 pandas 中,按時間分組的最常用方法是使用`.resample()`函數。 在 v0.18.0 中,此函數是兩階段的。 這意味著`df.resample('M')`創建了一個對象,我們可以對其應用其他函數(`mean`,`count`,`sum`等)
```py
# 按月對數據分組,并取每組(即每個月)的平均值
df.resample('M').mean()
```
| | score |
| --- | --- |
| datetime | |
| 2015-06-30 | 513.629630 |
| 2015-07-31 | 561.516129 |
| 2015-08-31 | 448.032258 |
| 2015-09-30 | 548.000000 |
| 2015-10-31 | 480.419355 |
| 2015-11-30 | 487.033333 |
| 2015-12-31 | 499.935484 |
| 2016-01-31 | 429.193548 |
| 2016-02-29 | 520.413793 |
| 2016-03-31 | 349.806452 |
| 2016-04-30 | 395.500000 |
| 2016-05-31 | 503.451613 |
| 2016-06-30 | 510.500000 |
```py
# 按月對數據分組,并獲取每組(即每個月)的總和
df.resample('M').sum()
```
| | score |
| --- | --- |
| datetime | |
| 2015-06-30 | 13868 |
| 2015-07-31 | 17407 |
| 2015-08-31 | 13889 |
| 2015-09-30 | 16440 |
| 2015-10-31 | 14893 |
| 2015-11-30 | 14611 |
| 2015-12-31 | 15498 |
| 2016-01-31 | 13305 |
| 2016-02-29 | 15092 |
| 2016-03-31 | 10844 |
| 2016-04-30 | 11865 |
| 2016-05-31 | 15607 |
| 2016-06-30 | 1021 |
分組有很多選項。 你可以在[ Pandas 的時間序列文檔](http://pandas.pydata.org/pandas-docs/stable/timeseries.html)中了解它們的更多信息,但是,為了你的方便,我也在下面列出了它們。
| 值 | 描述 |
| --- | --- |
| B | business day frequency |
| C | custom business day frequency (experimental) |
| D | calendar day frequency |
| W | weekly frequency |
| M | month end frequency |
| BM | business month end frequency |
| CBM | custom business month end frequency |
| MS | month start frequency |
| BMS | business month start frequency |CBMS| custom business month start frequency |
| Q | quarter end frequency |
| BQ | business quarter endfrequency |
| QS | quarter start frequency |
| BQS | business quarter start frequency |
| A | year end frequency |
| BA | business year end frequency |
| AS | year start frequency |
| BAS | business year start frequency |
| BH | business hour frequency |
| H | hourly frequency |
| T | minutely frequency |
| S | secondly frequency |
| L | milliseonds |
| U | microseconds |
| N | nanoseconds |
## 按小時分組數據
```py
# 導入庫
import pandas as pd
import numpy as np
# 創建 2000 個元素的時間序列
# 每五分鐘一個元素,起始于 2000.1.1
time = pd.date_range('1/1/2000', periods=2000, freq='5min')
# 創建 pandas 序列,帶有 0 到 100 的隨機值
# 將 time 用于索引
series = pd.Series(np.random.randint(100, size=2000), index=time)
# 查看前幾行
series[0:10]
'''
2000-01-01 00:00:00 40
2000-01-01 00:05:00 13
2000-01-01 00:10:00 99
2000-01-01 00:15:00 72
2000-01-01 00:20:00 4
2000-01-01 00:25:00 36
2000-01-01 00:30:00 24
2000-01-01 00:35:00 20
2000-01-01 00:40:00 83
2000-01-01 00:45:00 44
Freq: 5T, dtype: int64
'''
# 按索引的小時值對數據分組,然后按平均值進行匯總
series.groupby(series.index.hour).mean()
'''
0 50.380952
1 49.380952
2 49.904762
3 53.273810
4 47.178571
5 46.095238
6 49.047619
7 44.297619
8 53.119048
9 48.261905
10 45.166667
11 54.214286
12 50.714286
13 56.130952
14 50.916667
15 42.428571
16 46.880952
17 56.892857
18 54.071429
19 47.607143
20 50.940476
21 50.511905
22 44.550000
23 50.250000
dtype: float64
'''
```
## 對行分組
```py
# 導入模塊
import pandas as pd
# 示例數據幀
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'],
'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'],
'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'],
'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTestScore', 'postTestScore'])
df
```
| | regiment | company | name | preTestScore | postTestScore |
| --- | --- | --- | --- | --- | --- |
| 0 | Nighthawks | 1st | Miller | 4 | 25 |
| 1 | Nighthawks | 1st | Jacobson | 24 | 94 |
| 2 | Nighthawks | 2nd | Ali | 31 | 57 |
| 3 | Nighthawks | 2nd | Milner | 2 | 62 |
| 4 | Dragoons | 1st | Cooze | 3 | 70 |
| 5 | Dragoons | 1st | Jacon | 4 | 25 |
| 6 | Dragoons | 2nd | Ryaner | 24 | 94 |
| 7 | Dragoons | 2nd | Sone | 31 | 57 |
| 8 | Scouts | 1st | Sloan | 2 | 62 |
| 9 | Scouts | 1st | Piger | 3 | 70 |
| 10 | Scouts | 2nd | Riani | 2 | 62 |
| 11 | Scouts | 2nd | Ali | 3 | 70 |
```py
# 創建分組對象。 換句話說,
# 創建一個表示該特定分組的對象。
# 這里,我們按照團隊來分組 pre-test 得分。
regiment_preScore = df['preTestScore'].groupby(df['regiment'])
# 展示每個團隊的 pre-test 得分的均值
regiment_preScore.mean()
'''
regiment
Dragoons 15.50
Nighthawks 15.25
Scouts 2.50
Name: preTestScore, dtype: float64
'''
```
## Pandas 中的分層數據
```py
# 導入模塊
import pandas as pd
# 創建數據幀
raw_data = {'regiment': ['Nighthawks', 'Nighthawks', 'Nighthawks', 'Nighthawks', 'Dragoons', 'Dragoons', 'Dragoons', 'Dragoons', 'Scouts', 'Scouts', 'Scouts', 'Scouts'],
'company': ['1st', '1st', '2nd', '2nd', '1st', '1st', '2nd', '2nd','1st', '1st', '2nd', '2nd'],
'name': ['Miller', 'Jacobson', 'Ali', 'Milner', 'Cooze', 'Jacon', 'Ryaner', 'Sone', 'Sloan', 'Piger', 'Riani', 'Ali'],
'preTestScore': [4, 24, 31, 2, 3, 4, 24, 31, 2, 3, 2, 3],
'postTestScore': [25, 94, 57, 62, 70, 25, 94, 57, 62, 70, 62, 70]}
df = pd.DataFrame(raw_data, columns = ['regiment', 'company', 'name', 'preTestScore', 'postTestScore'])
df
```
| | regiment | company | name | preTestScore | postTestScore |
| --- | --- | --- | --- | --- | --- |
| 0 | Nighthawks | 1st | Miller | 4 | 25 |
| 1 | Nighthawks | 1st | Jacobson | 24 | 94 |
| 2 | Nighthawks | 2nd | Ali | 31 | 57 |
| 3 | Nighthawks | 2nd | Milner | 2 | 62 |
| 4 | Dragoons | 1st | Cooze | 3 | 70 |
| 5 | Dragoons | 1st | Jacon | 4 | 25 |
| 6 | Dragoons | 2nd | Ryaner | 24 | 94 |
| 7 | Dragoons | 2nd | Sone | 31 | 57 |
| 8 | Scouts | 1st | Sloan | 2 | 62 |
| 9 | Scouts | 1st | Piger | 3 | 70 |
| 10 | Scouts | 2nd | Riani | 2 | 62 |
| 11 | Scouts | 2nd | Ali | 3 | 70 |
```py
# 設置分層索引但將列保留在原位
df = df.set_index(['regiment', 'company'], drop=False)
df
```
| | | regiment | company | name | preTestScore | postTestScore |
| --- | --- | --- | --- | --- | --- | --- |
| regiment | company | | | | | |
| | Nighthawks | 1st | Nighthawks | 1st | Miller | 4 | 25 |
| 1st | Nighthawks | 1st | Jacobson | 24 | 94 |
| 2nd | Nighthawks | 2nd | Ali | 31 | 57 |
| 2nd | Nighthawks | 2nd | Milner | 2 | 62 |
| | Dragoons | 1st | Dragoons | 1st | Cooze | 3 | 70 |
| 1st | Dragoons | 1st | Jacon | 4 | 25 |
| 2nd | Dragoons | 2nd | Ryaner | 24 | 94 |
| 2nd | Dragoons | 2nd | Sone | 31 | 57 |
| | Scouts | 1st | Scouts | 1st | Sloan | 2 | 62 |
| 1st | Scouts | 1st | Piger | 3 | 70 |
| 2nd | Scouts | 2nd | Riani | 2 | 62 |
| 2nd | Scouts | 2nd | Ali | 3 | 70 |
```py
# 將分層索引設置為團隊然后公司
df = df.set_index(['regiment', 'company'])
df
```
| | | name | preTestScore | postTestScore |
| --- | --- | --- | --- | --- |
| regiment | company | | | |
| | Nighthawks | 1st | Miller | 4 | 25 |
| 1st | Jacobson | 24 | 94 |
| 2nd | Ali | 31 | 57 |
| 2nd | Milner | 2 | 62 |
| | Dragoons | 1st | Cooze | 3 | 70 |
| 1st | Jacon | 4 | 25 |
| 2nd | Ryaner | 24 | 94 |
| 2nd | Sone | 31 | 57 |
| | Scouts | 1st | Sloan | 2 | 62 |
| 1st | Piger | 3 | 70 |
| 2nd | Riani | 2 | 62 |
| 2nd | Ali | 3 | 70 |
```py
# 查看索引
df.index
MultiIndex(levels=[['Dragoons', 'Nighthawks', 'Scouts'], ['1st', '2nd']],
labels=[[1, 1, 1, 1, 0, 0, 0, 0, 2, 2, 2, 2], [0, 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1]],
names=['regiment', 'company'])
# 交換索引中的級別
df.swaplevel('regiment', 'company')
```
| | | name | preTestScore | postTestScore |
| --- | --- | --- | --- | --- |
| company | regiment | | | |
| 1st | Nighthawks | Miller | 4 | 25 |
| | Nighthawks | Jacobson | 24 | 94 |
| 2nd | Nighthawks | Ali | 31 | 57 |
| | Nighthawks | Milner | 2 | 62 |
| 1st | Dragoons | Cooze | 3 | 70 |
| | Dragoons | Jacon | 4 | 25 |
| 2nd | Dragoons | Ryaner | 24 | 94 |
| | Dragoons | Sone | 31 | 57 |
| 1st | Scouts | Sloan | 2 | 62 |
| | Scouts | Piger | 3 | 70 |
| 2nd | Scouts | Riani | 2 | 62 |
| | Scouts | Ali | 3 | 70 |
```py
# 按需求和數據
df.sum(level='regiment')
```
| | preTestScore | postTestScore |
| --- | --- | --- |
| regiment | | |
| Nighthawks | 61 | 238 |
| Dragoons | 62 | 246 |
| Scouts | 10 | 264 |