六、數據可視化 · UCB DS100 數據科學的原理與技巧

# 六、數據可視化 > 原文：[DS-100/textbook/notebooks/ch06](https://nbviewer.jupyter.org/github/DS-100/textbook/tree/master/notebooks/ch06/) > > 校驗：[飛龍](https://github.com/wizardforcel) > > 自豪地采用[谷歌翻譯](https://translate.google.cn/) > 圖表中有一種魔力。曲線的輪廓瞬間揭示出整體情況 - 流行病，恐慌或繁榮時代的歷史。曲線將信息傳給大腦，激活想象力，并具有說服力。 > > Henry D. Hubbard 數據可視化是數據科學的每個分析步驟的必不可少的工具，從數據清理，到 EDA，到傳達結論和預測。由于人類的大腦的視覺感知高度發達，精心挑選的繪圖往往比文本描述更有效地揭示數據中的趨勢和異常。為了高效地使用數據可視化，你必須精通編程工具來生成繪圖，以及可視化原則。在本章中，我們將介紹`seaborn`和`matplotlib`，這是我們創建繪圖的首選工具。我們還將學習如何發現誤導性的可視化，以及如何使用數據轉換，平滑和降維來改善可視化。 ## 定量數據我們通常使用不同類型的圖表來可視化定量（數字）數據和定性（序數或標稱）數據。對于定量數據，我們通常使用直方圖，箱形圖和散點圖。我們可以使用`seaborn`繪圖庫在 Python 中創建這些圖。我們將使用包含泰坦尼克號上乘客信息的數據集。 ```py # Import seaborn and apply its plotting styles import seaborn as sns sns.set() # Load the dataset and drop N/A values to make plot function calls simpler ti = sns.load_dataset('titanic').dropna().reset_index(drop=True) # This table is too large to fit onto a page so we'll output sliders to # pan through different sections. df_interact(ti) # (182 rows, 15 columns) total ``` ### 直方圖我們可以看到數據集每行包含一個乘客，包括乘客的年齡和乘客為機票支付的金額。讓我們用直方圖來可視化年齡。我們可以使用`seaborn`的`distplot`函數： ```py # Adding a semi-colon at the end tells Jupyter not to output the # usual <matplotlib.axes._subplots.AxesSubplot> line sns.distplot(ti['age']); ``` ![](https://img.kancloud.cn/1f/de/1fdee584d01b044266d1ace085d756a7_392x270.png) 默認情況下，`seaborn`的`distplot`函數將輸出一個平滑的曲線，大致擬合分布。我們還可以添加`rug`繪圖，在`x`軸上標記每個點： ```py sns.distplot(ti['age'], rug=True); ``` ![](https://img.kancloud.cn/e2/1f/e21f7940803626d5934825119cf7bd08_392x270.png) 我們也可以繪制分布本身。調整桶的數量表明船上有許多兒童。 ```py sns.distplot(ti['age'], kde=False, bins=30); ``` ![](https://img.kancloud.cn/fd/98/fd98fd0917f53cf263879d037d250c8a_372x270.png) ### 箱形圖箱形圖是查看大部分數據所在位置的便捷方式。通常，我們使用數據的 25 和 75 百分位數作為箱子的起點和終點，并在箱子內為 50 百分位數（中位數）繪制一條線。我們繪制兩個“胡須”，除了離群點外，它們可以擴展顯示剩余數據，離群點數據被標記為胡須外的各個點。 ```py sns.boxplot(x='fare', data=ti); ``` ![](https://img.kancloud.cn/58/bf/58bfee09f82ee84c60015385763fc46a_349x270.png) 我們通常使用四分位間距（IQR）來確定，哪些點被認為是箱形圖的異常值。 IQR 是數據的 75 百分位數與 25 百分位數的差。 ```py lower, upper = np.percentile(ti['fare'], [25, 75]) iqr = upper - lower iqr # 60.299999999999997 ``` 比 75 百分位數大`1.5×IQR`，或者比 25 百分位數小`1.5×IQR`的值，被認為是離群點，我們可以在上面的箱形圖中看到，它們被單獨標記： ```py upper_cutoff = upper + 1.5 * iqr lower_cutoff = lower - 1.5 * iqr upper_cutoff, lower_cutoff # (180.44999999999999, -60.749999999999986) ``` 雖然直方圖一次顯示整個分布，但當我們按不同類別分割數據時，箱型圖通常更容易理解。例如，我們可以為每類乘客制作一個箱形圖： ```py sns.boxplot(x='fare', y='who', data=ti); ``` ![](https://img.kancloud.cn/74/d1/74d14a79e8aecc448fe3302eabd6a94b_418x270.png) 單獨的箱形圖比下面的重疊直方圖更容易理解，它繪制相同數據： ```py sns.distplot(ti.loc[ti['who'] == 'woman', 'fare']) sns.distplot(ti.loc[ti['who'] == 'man', 'fare']) sns.distplot(ti.loc[ti['who'] == 'child', 'fare']); ``` ![](https://img.kancloud.cn/f6/3f/f63f632a643fd6d0f93386158e820f2c_390x270.png) ### 使用 Seaborn 的簡要注解你可能已經注意到，為`who`列創建單個箱形圖的`boxplot`調用，比制作疊加直方圖的等效代碼更簡單。雖然`sns.distplot`接受數據數組或序列，但大多數其他`seaborn`函數允許你傳入一個`DataFrame`，并指定在`x`和`y`軸上繪制哪一列。例如： ```py # Plots the `fare` column of the `ti` DF on the x-axis sns.boxplot(x='fare', data=ti); ``` 當列是類別的時候（`'who'`列包含`'woman'`，`'man'`和`'child'`），`seaborn`會在繪圖之前自動按照類別分割數據。這意味著，我們不必像我們為`sns.distplot`所做的那樣，自己過濾掉每個類別。 ```py # fare (numerical) on the x-axis, # who (nominal) on the y-axis sns.boxplot(x='fare', y='who', data=ti); ``` ![](https://img.kancloud.cn/74/d1/74d14a79e8aecc448fe3302eabd6a94b_418x270.png) ### 散點圖散點圖用于比較兩個定量變量。我們可以使用散點圖比較泰坦尼克號數據集的年齡和票價列。 ```py sns.lmplot(x='age', y='fare', data=ti); ``` ![](https://img.kancloud.cn/2c/a7/2ca7b17a96fb35a3c31e74b60ac80a79_340x340.png) 默認情況下，`seaborn`也會使回歸直線擬合我們的散點圖，并自舉散點圖，在回歸線周圍創建 95% 置信區間，如上圖中的淺藍色陰影所示。這里，回歸線似乎不太適合散點圖，所以我們可以關閉回歸。 ```py sns.lmplot(x='age', y='fare', data=ti, fit_reg=False); ``` ![](https://img.kancloud.cn/0a/49/0a49330c8e1278fced1291a9a8dcad71_340x340.png) 我們可以使用類別變量對點進行著色。讓我們再次使用`who`列： ```py sns.lmplot(x='age', y='fare', hue='who', data=ti, fit_reg=False); ``` ![](https://img.kancloud.cn/8b/1b/8b1bf95db61abcfe76f0fb87a051c97d_423x340.png) 我們可以從這個圖中看出，所有年齡在 18 歲左右的乘客都標記為小孩。雖然兩張最貴的門票是男性購買的，但男女乘客票價似乎沒有明顯的差異。 ## 定性數據對于定性或類別數據，我們通常使用條形圖和點圖。我們將展示，如何使用`seaborn`和泰坦尼克號幸存者數據集創建這些繪圖。 ```py # Import seaborn and apply its plotting styles import seaborn as sns sns.set() # Load the dataset ti = sns.load_dataset('titanic').reset_index(drop=True) # This table is too large to fit onto a page so we'll output sliders to # pan through different sections. df_interact(ti) # (891 rows, 15 columns) total ``` ### 條形圖在`seaborn`中，有兩種類型的條形圖。第一種使用`countplot`方法來計算每個類別在列中出現的次數。 ```py # Counts how many passengers survived and didn't survive and # draws bars with corresponding heights sns.countplot(x='alive', data=ti); ``` ![](https://img.kancloud.cn/54/ce/54ce88851f0bdfedde9366088ea55f23_397x270.png) ```py sns.countplot(x='class', data=ti); ``` ![](https://img.kancloud.cn/d1/09/d109402388ba7a7eb61dc5b19370d5d8_397x270.png) ```py # As with box plots, we can break down each category further using color sns.countplot(x='alive', hue='class', data=ti); ``` ![](https://img.kancloud.cn/e9/d3/e9d3dc9c0472748c6b3b22eba29a9cf7_397x270.png) 另一方面，`barplot`方法按照類別列對`DataFrame`進行分組，并根據每個組內的數字列的平均值繪制條的高度。 ```py # For each set of alive/not alive passengers, compute and plot the average age. sns.barplot(x='alive', y='age', data=ti); ``` 通過對原始`DataFrame`進行分組，并計算年齡列的平均值，可以計算每個條形的高度： ```py ti[['alive', 'age']].groupby('alive').mean() ``` | | age | | --- | --- | | alive | | | no | 30.626179 | | yes | 28.343690 | 默認情況下，`barplot`方法也會為每個平均值，計算自舉的 95% 置信區間，在上面的條形圖中標記為黑線。置信區間表明，如果數據集包含泰坦尼克號乘客的隨機樣本，那么在 5% 的顯著性水平下，幸存者和死亡者的年齡差異無統計學意義。當我們有更大的數據集時，這些置信區間需要很長的時間才能生成，因此有時會關閉它們： ```py sns.barplot(x='alive', y='age', data=ti, ci=False); ``` ![](https://img.kancloud.cn/f9/50/f950aef4a596a8bd9e0da18657361e44_390x270.png) ### 點圖點圖與條形圖類似。它不是繪制條形圖，而是標記條形圖末尾處的單個點。我們使用`pointplot`方法在`seaborn`中制作點圖。與`barplot`方法一樣，`pointplot`方法也自動將`DataFrame`分組，并計算每組數值變量的平均值，將 95% 置信區間標記為以每個點為中心的垂直線。 ```py # For each set of alive/not alive passengers, compute and plot the average age. sns.pointplot(x='alive', y='age', data=ti); ``` ![](https://img.kancloud.cn/39/f1/39f11134f005e98dec96c1f0e9e8d4d9_390x270.png) 比較跨類別的變化時，點圖非常有用： ```py # Shows the proportion of survivors for each passenger class sns.pointplot(x='class', y='survived', data=ti); ``` ![](https://img.kancloud.cn/48/12/48120125e053269515a7fda18e256dc0_394x270.png) ```py # Shows the proportion of survivors for each passenger class, # split by whether the passenger was an adult male sns.pointplot(x='class', y='survived', hue='adult_male', data=ti); ``` ![](https://img.kancloud.cn/39/64/3964ad3e1cebc7e52da3e25cf1f6836d_394x270.png) ## 使用`matplotlib`定制繪圖盡管`seaborn`能讓我們快速創建多種類型的繪圖，但它并不能讓我們細致地控制圖表。例如，我們不能使用`seaborn`來修改繪圖的標題，更改`x`或`y`軸標簽，或將注釋添加到繪圖。相反，我們必須使用`seaborn`所基于的`matplotlib`庫。 `matplotlib`提供了基本的積木，用于在 Python 中創建繪圖。雖然它提供了很好的控制權，但它也更加麻煩 - 使用`matplotlib`重新創建之前章節中的`seaborn`繪圖需要多行代碼。實際上，我們可以將`seaborn`視為一組有用的快捷方式，用于創建`matplotlib`繪圖。盡管我們更喜歡在`seaborn`中創建繪圖的原型，但為了定制繪圖以便發布，我們需要學習`matplotlib`的基本部分。在我們查看第一個簡單示例之前，我們必須在筆記本中開啟`matplotlib`支持： ```py # This line allows matplotlib plots to appear as images in the notebook # instead of in a separate window. %matplotlib inline # plt is a commonly used shortcut for matplotlib import matplotlib.pyplot as plt ``` ### 定制圖表和軸域為了在`matplotlib`中創建繪圖，我們創建圖形（Figure），然后在圖形中添加一個軸域（Axes）。在`matplotlib`中，軸域是單個圖表，而圖形可以在一個表格布局中包含多個軸域。軸域包含標記，圖上繪制的線或補丁。 ```py # Create a figure f = plt.figure() # Add an axes to the figure. The second and third arguments create a table # with 1 row and 1 column. The first argument places the axes in the first # cell of the table. ax = f.add_subplot(1, 1, 1) # Create a line plot on the axes ax.plot([0, 1, 2, 3], [1, 3, 4, 3]) # Show the plot. This will automatically get called in a Jupyter notebook # so we'll omit it in future cells plt.show() ``` ![](https://img.kancloud.cn/a1/79/a179d3a645aed8bd8153be18bbc72b6f_375x252.png) 為了自定義繪圖，我們可以在軸域對象上使用其他方法： ```py f = plt.figure() ax = f.add_subplot(1, 1, 1) x = np.arange(0, 10, 0.1) # Setting the label kwarg lets us generate a legend ax.plot(x, np.sin(x), label='sin(x)') ax.plot(x, np.cos(x), label='cos(x)') ax.legend() ax.set_title('Sinusoids') ax.set_xlabel('x') ax.set_ylabel('y'); ``` ![](https://img.kancloud.cn/e0/9b/e09b494aaedeeb9cc836a7dedc5741e4_404x278.png) 作為一種捷徑，`matplotlib`在`plt`模塊上提供了繪制方法，會自動初始化圖形和軸域。 ```py # Shorthand to create figure and axes and call ax.plot plt.plot(x, np.sin(x)) # When plt methods are called multiple times in the same cell, the # existing figure and axes are reused. plt.scatter(x, np.cos(x)); ``` ![](https://img.kancloud.cn/57/78/5778bd5c980ef87ba8fb497cc9e94a1e_383x252.png) `plt`模塊的方法與軸域相似，因此我們可以使用`plt`捷徑重新創建其中一個上述的圖。 ```py x = np.arange(0, 10, 0.1) plt.plot(x, np.sin(x), label='sin(x)') plt.plot(x, np.cos(x), label='cos(x)') plt.legend() # Shorthand for ax.set_title plt.title('Sinusoids') plt.xlabel('x') plt.ylabel('y') # Set the x and y-axis limits plt.xlim(-1, 11) plt.ylim(-1.2, 1.2); ``` ![](https://img.kancloud.cn/00/26/00263d8690193d1369335f0678104129_397x278.png) ### 定制標記為了改變繪圖標記本身的屬性（例如上圖中的線），我們可以將其他參數傳遞給`plt.plot`。 ```py additional arguments into plt.plot. plt.plot(x, np.sin(x), linestyle='--', color='purple'); ``` 查看`matplotlib`文檔，是找出哪些參數可用于每種方法的最簡單方法。另一種方法是存儲返回的線條對象： ```py In [1]: line, = plot([1,2,3]) ``` 這些線條對象有許多可以控制的屬性，下面是在 IPython 中使用 Tab 補全的完整列表： ``` In [2]: line.set line.set line.set_drawstyle line.set_mec line.set_aa line.set_figure line.set_mew line.set_agg_filter line.set_fillstyle line.set_mfc line.set_alpha line.set_gid line.set_mfcalt line.set_animated line.set_label line.set_ms line.set_antialiased line.set_linestyle line.set_picker line.set_axes line.set_linewidth line.set_pickradius line.set_c line.set_lod line.set_rasterized line.set_clip_box line.set_ls line.set_snap line.set_clip_on line.set_lw line.set_solid_capstyle line.set_clip_path line.set_marker line.set_solid_joinstyle line.set_color line.set_markeredgecolor line.set_transform line.set_contains line.set_markeredgewidth line.set_url line.set_dash_capstyle line.set_markerfacecolor line.set_visible line.set_dashes line.set_markerfacecoloralt line.set_xdata line.set_dash_joinstyle line.set_markersize line.set_ydata line.set_data line.set_markevery line.set_zorder ``` 但是`setp`調用（設置屬性的縮寫）可能非常有用，尤其是在交互式工作時，因為它支持自檢，所以你可以在工作時了解有效的調用： ``` In [7]: line, = plot([1,2,3]) In [8]: setp(line, 'linestyle') linestyle: [ ``'-'`` | ``'--'`` | ``'-.'`` | ``':'`` | ``'None'`` | ``' '`` | ``''`` ] and any drawstyle in combination with a linestyle, e.g. ``'steps--'``. In [9]: setp(line) agg_filter: unknown alpha: float (0.0 transparent through 1.0 opaque) animated: [True | False] antialiased or aa: [True | False] ... ... much more output omitted ... ``` 在第一種形式中，它向你顯示`'linestyle'`屬性的有效值，并在第二種形式中向你顯示，可以在線條對象上設置的所有可接受屬性。這使得你可以輕松發現，如何定制圖形來獲得你所需的視覺效果。 ### 任意文本和 LaTeX 支持在`matplotlib`中，可以相對于單獨的軸對象或整個圖形添加文本。這些命令將文本添加到軸域： + `set_title()` - 添加標題 + `set_xlabel()` - 向 x 軸添加軸標簽 + `set_ylabel()` - 向 y 軸添加軸標簽 + `text()` - 在任意位置添加文本 + `annotate()` - 添加注解，帶有可選的箭頭以及這些作用于整個圖形： + `figtext()` - 在任意位置添加文本 + `suptitle()` - 添加標題并且任何文本字段都可以包含用于數學的 LaTeX 表達式，只要它們包含在`$`符號中即可。這個例子演示了所有這些： ```py fig = plt.figure() fig.suptitle('bold figure suptitle', fontsize=14, fontweight='bold') ax = fig.add_subplot(1, 1, 1) fig.subplots_adjust(top=0.85) ax.set_title('axes title') ax.set_xlabel('xlabel') ax.set_ylabel('ylabel') ax.text(3, 8, 'boxed italics text in data coords', style='italic', bbox={'facecolor':'red', 'alpha':0.5, 'pad':10}) ax.text(2, 6, 'an equation: $E=mc^2$', fontsize=15) ax.text(3, 2, 'unicode: Institut für Festk?rperphysik') ax.text(0.95, 0.01, 'colored text in axes coords', verticalalignment='bottom', horizontalalignment='right', transform=ax.transAxes, color='green', fontsize=15) ax.plot([2], [1], 'o') ax.annotate('annotate', xy=(2, 1), xytext=(3, 4), arrowprops=dict(facecolor='black', shrink=0.05)) ax.axis([0, 10, 0, 10]); ``` ![](https://img.kancloud.cn/9d/47/9d47fdf0561ff0687d6c19e44abdf681_389x291.png) ### 使用`matplotlib`定制`seaborn`繪圖既然我們已經看到了如何使用`matplotlib`來定制繪圖，我們可以使用相同的方法來定制`seaborn`繪圖，因為`seaborn`在背后使用`matplotlib`創建了繪圖。 ```py # Load seaborn import seaborn as sns sns.set() sns.set_context('talk') # Load dataset ti = sns.load_dataset('titanic').dropna().reset_index(drop=True) ti.head() ``` | | survived | pclass | sex | age | ... | deck | embark_town | alive | alone | | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | | 0 | 1 | 1 | female | 38.0 | ... | C | Cherbourg | yes | False | | 1 | 1 | 1 | female | 35.0 | ... | C | Southampton | yes | False | | 2 | 0 | 1 | male | 54.0 | ... | E | Southampton | no | True | | 3 | 1 | 3 | female | 4.0 | ... | G | Southampton | yes | False | | 4 | 1 | 1 | female | 58.0 | ... | C | Southampton | yes | True | 5 rows × 15 columns 我們以這個繪圖開始： ```py sns.lmplot(x='age', y='fare', hue='who', data=ti, fit_reg=False); ``` ![](https://img.kancloud.cn/8b/1b/8b1bf95db61abcfe76f0fb87a051c97d_423x340.png) 我們可以看到，該圖需要標題和更好的`x`和`y`軸標簽。另外，票價最高的兩個人幸存了下來，所以我們可以在我們的繪圖中注明他們。 ```py sns.lmplot(x='age', y='fare', hue='who', data=ti, fit_reg=False) plt.title('Fare Paid vs. Age of Passenger, Colored by Passenger Type') plt.xlabel('Age of Passenger') plt.ylabel('Fare in USD') plt.annotate('Both survived', xy=(35, 500), xytext=(35, 420), arrowprops=dict(facecolor='black', shrink=0.05)); ``` ![](https://img.kancloud.cn/14/c6/14c6e8e5a43c6b1419e894cbe88eabf5_449x357.png) 在實踐中，我們使用`seaborn`快速探索數據，然后一旦我們決定在論文或展示中使用這些圖，轉向`matplotlib`以便調優。 ## 可視化的原則現在我們有了創建和更改繪圖的工具，現在我們轉向數據可視化的關鍵原則。與數據科學的其他部分非常相似，很難準確地用一個數字來衡量特定可視化的效率。盡管如此，還是有一些通用原則，可以使可視化更有效地顯示數據趨勢。我們討論了六類原則：尺度，條件，感知，轉換，上下文和平滑。 ### 尺度原則尺度原則與涉及用于繪制數據的`x`和`y`軸的選擇。在 2015 年的美國國會聽證會上，Chaffetz 代表討論了計劃生育項目的調查。他展示了以下繪圖，最初出現在美國生命聯合會的一篇報告中。它比較了墮胎和癌癥篩查程序的數量，兩者都是由計劃生育提供的。（完整報告可在 <https://oversight.house.gov/interactivepage/plannedparenthood> 查閱。）這個繪圖有什么疑點？它繪制了多少個數據點？ ![](https://img.kancloud.cn/f6/4a/f64aff560d02425b1ede8b6c15331a5e_400x336.png) 這個繪圖違反了尺度原則；它沒有很好選擇其`x`和`y`軸。當我們為我們的繪圖選擇`x`軸和`y`軸時，我們應該在整個軸上保持一致的尺度。然而，上圖用于墮胎和癌癥篩查的線條具有不同的尺度 - 墮胎的線條的起點和癌癥篩查線條的終點在`y`軸上彼此接近，但代表了大不相同的數字。此外，僅繪制了 2006 年和 2013 年的點數，但`x`軸包含之間的每年的不必要的刻度線。為了改善這個繪圖，我們應該在相同的`y`軸尺度上重新繪制相同的點： ```py # HIDDEN pp = pd.read_csv("data/plannedparenthood.csv") plt.plot(pp['year'], pp['screening'], linestyle="solid", marker="o", label='Cancer') plt.plot(pp['year'], pp['abortion'], linestyle="solid", marker="o", label='Abortion') plt.title('Planned Parenthood Procedures') plt.xlabel("Year") plt.ylabel("Service") plt.xticks([2006, 2013]) plt.legend(); ``` ![](https://img.kancloud.cn/fa/5d/fa5d81fd6c998536acb0ec721b5bf5a3_426x287.png) 我們可以看到，與癌癥篩查數量的大幅下降相比，墮胎數量的變化非常小。我們可能會對數量變化的百分比感興趣，而不是過程的數量。 ```py # HIDDEN percent_change = pd.DataFrame({ 'percent_change': [ pp['screening'].iloc[1] / pp['screening'].iloc[0] - 1, pp['abortion'].iloc[1] / pp['abortion'].iloc[0] - 1, ], 'procedure': ['cancer', 'abortion'], 'type': ['percent_change', 'percent_change'], }) ax = sns.barplot(x='procedure', y='percent_change', data=percent_change) plt.title('Percent Change in Number of Procedures') plt.xlabel('') plt.ylabel('Percent Change') plt.ylim(-0.6, 0.6) plt.axhline(y=0, c='black'); ``` ![](https://img.kancloud.cn/93/8c/938cb88828512ccf023ca2b72fc8b193_401x269.png) 選擇`x`軸和`y`軸的限制時，我們傾向于重點關注帶有大部分數據的區域，特別是在處理長尾數據時。考慮以下繪圖，它的放大版本在右側： ![](https://img.kancloud.cn/76/11/761167c93e470925c47377e38b8b2577_700x279.png) 右圖更有助于理解數據集。如果需要，我們可以對數據的不同區域繪制多個圖，來顯示整個數據范圍。在本節的后面，我們討論數據轉換，這也有助于可視化長尾數據。 ### 條件原則條件原則為我們提供了技巧，來展示我們數據的子組之間的分布和關系。美國勞工統計局負責監督與美國經濟健康有關的科學調查。他們的網站包含一個工具，使用這些數據生成報告。數據用于生成這個圖表，它比較了每周收入的中位數，按性別分組。使用這種繪圖，最容易實現哪些比較？它們是最有趣或最重要的比較嘛？ ![](https://img.kancloud.cn/4b/e5/4be5818912226d3160d93e7397e3924c_600x504.png) 這個繪圖讓我們一眼就看到，隨著教育水平提高，每周收入會增加。然而，很難準確地確定，對于每種教育水平收入增加的程度，在相同的教育水平上，比較男性和女性的每周收入甚至更加困難。我們可以通過使用點圖而不是條形圖來發現這兩種趨勢。 ```py # HIDDEN cps = pd.read_csv("data/edInc2.csv") ax = sns.pointplot(x="educ", y="income", hue="gender", data=cps) ticks = ["<HS", "HS", "<BA", "BA", ">BA"] ax.set_xticklabels(ticks) ax.set_xlabel("Education") ax.set_ylabel("Income") ax.set_title("2014 Median Weekly Earnings\nFull-Time Workers over 25 years old"); ``` ![](https://img.kancloud.cn/44/b8/44b8b25ab28bdf578ef8c066611dd292_405x304.png) 連接點的線條更清晰地表明，本科學位的每周收入相對較大。將男性和女性的數值直接放在一起，可以更容易地看出，隨著教育水平的提高，男性和女性之間的工資差距趨于增加。為了有助于比較數據中的兩個子組，請沿`x`或`y`軸對齊標記，并為不同的子組使用不同的顏色或標記。線條比條形更傾向于顯示數據的趨勢，對于序數和數值數據都是有用的選擇。 ### 感知原則人類感知具有特定屬性，在可視化設計中考慮它們十分重要。人類感知的第一個重要屬性是，比起其它顏色，我們更強烈地感知某些顏色，尤其是綠色。此外我們認為，較淺的陰影區域比較深的陰影區域大。例如，在我們剛剛討論的每周收入圖中，較淺的條形似乎比較深的條形吸引更多注意力： ![](https://img.kancloud.cn/4b/e5/4be5818912226d3160d93e7397e3924c_600x504.png) 實際上，你應該確保圖表的調色板在感知上是統一的。這意味著，例如，條形圖中條形之間的顏色的感知強度不會發生變化。對于定量數據，你有兩種選擇：如果你的數據從低到高排列，并且你想強調較大的值，則使用順序配色方案，將較淺的顏色分配給較大的值。如果應該強調低值和高值，則使用分色配色方案，將更濃的顏色分配給更接近中心的值。 `seaborn `內置了許多有用的調色板。你可以瀏覽其文檔來了解如何在調色板之間切換：<http://seaborn.pydata.org/tutorial/color_palettes.html> 人類感知的第二個重要屬性是，當我們比較長度時，我們通常更準確，當比較面積時我們更不準確。考慮下面的非洲國家的 GDP 圖。 ![](https://img.kancloud.cn/d4/d1/d4d1fe7231c67b5619d22c59a8b744e4_600x504.png) 按數值計算，南非的 GDP 是阿爾及利亞的兩倍，但從上面的繪圖來看并不容易分辨。相反，我們可以在點圖上繪制 GDP： ![](https://img.kancloud.cn/d9/23/d923d3dbda2b62483401fc70cf867dff_600x504.png) 這更清晰，因為它使我們能夠比較長度而不是面積。出于相同的原因，餅圖和三維圖很難解釋；我們傾向于在實踐中避免使用這些圖表。人類感知的第三個也是最后一個屬性是，人眼在改變基線方面存在困難。考慮以下的層疊面積圖，繪制隨時間變化的二氧化碳排放量，按國家分組。 ![](https://img.kancloud.cn/3b/69/3b6902d254090e2c97da86f6c398683a_600x504.png) 觀察英國的排放量隨時間增加或減少是很困難的，由于基線擺動問題：該面積的基線上下擺動。當兩個高度相似時（例如 2000 年），英國的排放量是否大于中國的排放量，也難以比較。層疊條形圖中出現類似的基線擺動問題。在下面的圖中，很難比較德國和墨西哥的 15-64 歲的人數。 ![](https://img.kancloud.cn/73/cc/73cc1791bf499f05af403c82b15739b6_640x360.png) 我們通常可以通過切換為折線圖，來改善層疊面積或條形圖。以下是隨時間變化的排放量數據，繪制成線條而不是面積： ```py # HIDDEN co2 = pd.read_csv("data/CAITcountryCO2.csv", skiprows = 2, names = ["Country", "Year", "CO2"]) last_year = co2.Year.iloc[-1] q = f"Country != 'World' and Country != 'European Union (15)' and Year == {last_year}" top14_lasty = co2.query(q).sort_values('CO2', ascending=False).iloc[:14] top14 = co2[co2.Country.isin(top14_lasty.Country) & (co2.Year >= 1950)] from cycler import cycler linestyles = (['-', '--', ':', '-.']*3)[:7] colors = sns.color_palette('colorblind')[:4] lines_c = cycler('linestyle', linestyles) color_c = cycler('color', colors) fig, ax = plt.subplots(figsize=(9, 9)) ax.set_prop_cycle(lines_c * color_c) x, y ='Year', 'CO2' for name, df in top14.groupby('Country'): ax.semilogy(df[x], df[y], label=name) ax.set_xlabel(x) ax.set_ylabel(y + "Emissions [Million Tons]") ax.legend(ncol=2, frameon=True); ``` ![](https://img.kancloud.cn/d3/ec/d3ecd4e69b881a478f3a52f8b44d5d1f_565x542.png) 這個繪圖并不會改動基線，因此比較國家間的排放量更容易。我們也可以更清楚地看到，哪些國家的排放量增加最多。 ### 轉換原則數據轉換的原則為我們提供了實用方法，來轉換可視化數據，以便更有效地揭示趨勢。我們通常應用數據轉換來揭示偏斜數據中的模式，和變量之間的非線性關系。下圖顯示了泰坦尼克號上每位乘客的票價分布。正如你所看到的，分布是左偏的。 ```py # HIDDEN ti = sns.load_dataset('titanic') sns.distplot(ti['fare']) plt.title('Fares for Titanic Passengers') plt.xlabel('Fare in USD') plt.ylabel('Density'); ``` ![](https://img.kancloud.cn/85/1c/851c9fc62a6f17feaf9392a076e51530_401x287.png) 雖然此直方圖顯示所有票價，但由于票價聚集在直方圖的左側，因此很難在數據中看到詳細模式。為了解決這個問題，我們可以在繪制票價之前對票價取自然對數： ```py # HIDDEN sns.distplot(np.log(ti.loc[ti['fare'] > 0, 'fare']), bins=25) plt.title('log(Fares) for Titanic Passengers') plt.xlabel('log(Fare) in USD') plt.ylabel('Density'); ``` ![](https://img.kancloud.cn/e1/b6/e1b6472df608cabcdb8c56ccc6c324b8_401x287.png) 我們可以從對數數據圖中看到，票價分布的眾數大致為 $e^2 = 7.40$ 美元，較小的眾數為大約 $e^{3.4} = 30.00$ 美元。為什么繪制數據的自然對數有助于避免偏斜？較大數的對數趨于接近較小數的對數： | 值 | 對數值 | | --- | --- | | 1 | 0.00 | | 10 | 2.30 | | 50 | 3.91 | | 100 | 4.60 | | 500 | 6.21 | | 1000 | 6.90 |