線性關系可視化 · seaborn 0.9 中文文檔

# 線性關系可視化 > 譯者：[cancan233](https://github.com/cancan233) 許多數據集包含多定量變量，并且分析的目的通常是將這些變量聯系起來。我們[之前討論](#/docs/5)可以通過顯示兩個變量相關性的來實現此目的的函數。但是，使用統計模型來估計兩組噪聲觀察量之間的簡單關系可能會非常有效。本章討論的函數將通過線性回歸的通用框架實現。本著圖凱(Tukey)精神，seaborn 中的回歸圖主要用于添加視覺指南，以助于在探索性數據分析中強調存在于數據集的模式。換而言之，seaborn 本身不是為統計分析而生。要獲得與回歸模型擬合相關定量度量，你應當使用 [statsmodels](https://www.statsmodels.org/). 然而，seaborn 的目標是通過可視化快速簡便地 3 探索數據集，因為這樣做，如果說不上更，是與通過統計表探索數據集一樣重要。 ``` python import numpy as np import seaborn as sns import matplotlib.pyplot as plt ``` ``` python sns.set(color_codes=True) ``` ``` python tips = sns.load_dataset("tips") ``` ## 繪制線性回歸模型的函數 seaborn 中兩個主要函數主要用于顯示回歸確定的線性關系。這些函數，[`regplot()`](../generated/seaborn.regplot.html＃seaborn.regplot"seaborn.regplot") 和 [`lmplot()`](../generated/seaborn.lmplot.html＃seaborn.lmplot"seaborn.lmplot")，之間密切關聯，并且共享核心功能。但是，了解它們的不同之處非常重要，這樣你就可以快速為特定工作選擇正確的工具。在最簡單的調用中，兩個函數都繪制了兩個變量，`x`和`y`，然后擬合回歸模型`y~x`并繪制得到回歸線和該回歸的 95%置信區間： ```python sns.regplot(x="total_bill", y="tip", data=tips); ``` ![http://seaborn.pydata.org/_images/regression_7_0.png](https://img.kancloud.cn/48/a2/48a2515949213bfbe4da13817a0e7fab_390x271.jpg) ```python sns.lmplot(x="total_bill", y="tip", data=tips); ``` ![http://seaborn.pydata.org/_images/regression_8_0.png](https://img.kancloud.cn/92/db/92db9b67f4d806e2106598b5356a4386_352x352.jpg) 你應當注意到，除了圖形形狀不同，兩幅結果圖是完全一致的。我們會在后面解釋原因。目前，要了解的另一個主要區別是[`regplot()`](../generated/seaborn.regplot.html＃seaborn.regplot"seaborn.regplot")接受多種格式的`x`和`y`變量，包括簡單的 numpy 數組，pandas `Series`對象，或者作為對傳遞給`data`的 pandas `DataFrame`對象。相反，[`lmplot()`](../generated/seaborn.lmplot.html＃seaborn.lmplot"seaborn.lmplot")將`data`作為必須參數，`x`和`y`變量必須被指定為字符串。這種數據格式被稱為"長格式"或["整齊"](https://vita.had.co.nz/papers/tidy-data.pdf)數據。除了這種輸入的靈活性之外，[`regplot()`](../ generated / seaborn.regplot.html＃seaborn.regplot"seaborn.regplot")擁有[`lmplot()`](../generated/seaborn.lmplot.html＃seaborn.lmplot"seaborn.lmplot")一個子集的功能，所以我們將使用后者來演示它們。當其中一個變量采用離散值時，可以擬合線性回歸。但是，這種數據集生成的簡單散點圖通常不是最優的： ```python sns.lmplot(x="size", y="tip", data=tips); ``` ![http://seaborn.pydata.org/_images/regression_10_0.png](https://img.kancloud.cn/52/22/5222d012642f4645db315c26014ebac3_352x352.jpg) 一種選擇是向離散值添加隨機噪聲("抖動")，以使這些值分布更清晰。需要注意的是，抖動僅用于散點圖數據，而不會影響回歸線本身擬合： ```python sns.lmplot(x="size", y="tip", data=tips, x_jitter=.05); ``` ![http://seaborn.pydata.org/_images/regression_12_0.png](https://img.kancloud.cn/e5/f1/e5f1851bc7f8c940a5fffced3d928616_352x352.jpg) 第二種選擇是綜合每個離散箱中的觀測值，以繪制集中趨勢的估計值和置信區間： ```python sns.lmplot(x="size", y="tip", data=tips, x_estimator=np.mean); ``` ![http://seaborn.pydata.org/_images/regression_14_0.png](https://img.kancloud.cn/78/4d/784d74bca4eaf00da2fdbc2550547e7e_352x352.jpg) ## 擬合不同模型上面使用的簡單線性回歸模型非常容易擬合，但是它不適合某些類型的數據集。[Anscombe 的四重奏](https://en.wikipedia.org/wiki/Anscombe%27s_quartet)數據集展示了一些實例，其中簡單線性回歸提供了相同的關系估計，然而簡單的視覺檢查清楚地顯示了差異。例如，在第一種情況下，線性回歸是一個很好的模型： ```python anscombe = sns.load_dataset("anscombe") ``` ```python sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'I'"), ci=None, scatter_kws={"s": 80}); ``` ![http://seaborn.pydata.org/_images/regression_17_0.png](https://img.kancloud.cn/6c/53/6c5316d8f7bbdee0a7baecd572804d26_352x352.jpg) 第二個數據集的線性關系是相同的，但是圖表清楚地表明這并不是一個好的模型： ```python sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'II'"), ci=None, scatter_kws={"s": 80}); ``` ![http://seaborn.pydata.org/_images/regression_19_0.png](https://img.kancloud.cn/c3/67/c367004001b0c55984037b09f82c5b0d_352x352.jpg) 在這些存在高階關系的情況下，[`regplot()`](../generated/seaborn.regplot.html＃seaborn.regplot"seaborn.regplot")和[`lmplot()`](./generated/seaborn.regplot.html#seaborn.regplot"seaborn.regplot")可以擬合多項式回歸模型來探索數據集中的簡單非線性趨勢： ```python sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'II'"), order=2, ci=None, scatter_kws={"s": 80}); ``` ![http://seaborn.pydata.org/_images/regression_21_0.png](https://img.kancloud.cn/9e/83/9e833730073cc36b2d5b9d6c533d1165_352x352.jpg) "離群值"觀察引起的另一個問題是，除了研究中的主要關系之外，由于某種原因導致的偏離： ```python sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'III'"), ci=None, scatter_kws={"s": 80}); ``` ![http://seaborn.pydata.org/_images/regression_23_0.png](https://img.kancloud.cn/82/10/8210d8bdd5bfbc09bf5d90c9c47c6e80_352x352.jpg) 在存在異常值的情況下，擬合穩健回歸可能會很有用，該回歸使用了一種不同的損失函數來降低相對較大的殘差的權重： ```python sns.lmplot(x="x", y="y", data=anscombe.query("dataset == 'III'"), robust=True, ci=None, scatter_kws={"s": 80}); ``` ![http://seaborn.pydata.org/_images/regression_25_0.png](https://img.kancloud.cn/52/55/5255764aac07a85e0a9928a9c8ebeef6_352x352.jpg) 當`y`變量是二進制時，簡單線性回歸也"有效"，但提供了難以置信的預測： ```python tips["big_tip"] = (tips.tip / tips.total_bill) > .15 sns.lmplot(x="total_bill", y="big_tip", data=tips, y_jitter=.03); ``` ![http://seaborn.pydata.org/_images/regression_27_0.png](https://img.kancloud.cn/6d/4d/6d4d64519fe56f074879c00cb3b0b588_352x352.jpg) 在這種情況下的解決方案是擬合邏輯回歸，使得回歸線對給定值`x`顯示的估計概率`y=1`。 ```python sns.lmplot(x="total_bill", y="big_tip", data=tips, logistic=True, y_jitter=.03); ``` ![http://seaborn.pydata.org/_images/regression_29_0.png](https://img.kancloud.cn/22/18/2218fe1d03b77697eca8f59f76a4701f_352x352.jpg) 請注意，邏輯回歸估計比簡單回歸計算密集程度更高(穩健回歸也是如此)，并且由于回歸線周圍的置信區間是使用自舉程度計算，你可能希望關閉它來達到更快的迭代(使用`ci=None`)。一種完全不同的方法是使用[lowess smoother](https://en.wikipedia.org/wiki/Local_regression)擬合非參數回歸。盡管它是計算密集型的，這種方法的假設最少，因此目前置信區間根本沒有計算： ```python sns.lmplot(x="total_bill", y="tip", data=tips, lowess=True); ``` ![http://seaborn.pydata.org/_images/regression_31_0.png](https://img.kancloud.cn/fc/94/fc94522fd459462bc46e34811592cd6d_352x352.jpg) [`residplot()`](../generated/seaborn.residplot.html#seaborn.residplot "seaborn.residplot") 函數可以用作檢查簡單回歸模型是否適合數據集的有效工具。它擬合并刪除簡單的線性回歸，然后繪制每個觀察值的殘差值。理想情況下，這些值應隨機散步在`y=0`周圍： ```python sns.residplot(x="x", y="y", data=anscombe.query("dataset == 'I'"), scatter_kws={"s": 80}); ``` ![http://seaborn.pydata.org/_images/regression_33_0.png](https://img.kancloud.cn/14/10/1410e4a57c1d1f839e78d0b0611f0b07_400x271.jpg) 如果殘差中存在結構形狀，則表明簡單的線性回歸不合適： ```python sns.residplot(x="x", y="y", data=anscombe.query("dataset == 'II'"), scatter_kws={"s": 80}); ``` ![http://seaborn.pydata.org/_images/regression_35_0.png](https://img.kancloud.cn/1b/15/1b15e23701ad2e338b56bb1e9bfcca7c_400x271.jpg) ## 其他變量關系上面的圖顯示了探索一對變量之間關系的許多方法。然而，通常，一個更有趣的問題是"這兩個變量之間的關系如何隨第三個變量的變化而變化？"這就是[`regplot()`](../generated/seaborn.regplot.html#seaborn.regplot "seaborn.regplot")和[`lmplot()`](../generated/seaborn.lmplot.html#seaborn.lmplot "seaborn.lmplot")的區別所在。[`regplot()`](../generated/seaborn.regplot.html#seaborn.regplot "seaborn.regplot")總是表現單一關系, [`lmplot()`](../generated/seaborn.lmplot.html#seaborn.lmplot "seaborn.lmplot")把[`regplot()`](../generated/seaborn.regplot.html#seaborn.regplot "seaborn.regplot")和 [`FacetGrid`](../generated/seaborn.FacetGrid.html#seaborn.FacetGrid "seaborn.FacetGrid")結合，以提供一個簡單的界面，顯示"facet"圖的線性回歸，使你可以探索與最多三個其他分類變量的交互。分離關系的最佳方法是在同一軸上繪制兩個級別并使用顏色來區分它們： ```python sns.lmplot(x="total_bill", y="tip", hue="smoker", data=tips); ``` ![http://seaborn.pydata.org/_images/regression_37_0.png](https://img.kancloud.cn/77/b1/77b18e84354a2007076129b7471f7cd5_408x352.jpg) 除了顏色之外，還可以使用不同的散點圖標記來使繪圖更好地再現為黑白。你還可以完全控制使用的顏色： ```python sns.lmplot(x="total_bill", y="tip", hue="smoker", data=tips, markers=["o", "x"], palette="Set1"); ``` ![http://seaborn.pydata.org/_images/regression_39_0.png](https://img.kancloud.cn/6d/84/6d84522d8d33a8a4a31407d2c6e1db85_408x352.jpg) 要添加另一個變量，你可以繪制多個"facet"，其中每個級別的變量出現在網絡的行或列中： ```python sns.lmplot(x="total_bill", y="tip", hue="smoker", col="time", data=tips); ``` ![http://seaborn.pydata.org/_images/regression_41_0.png](https://img.kancloud.cn/0f/86/0f86cb74791d00d5f9e61a4f7a388321_772x352.jpg) ```python sns.lmplot(x="total_bill", y="tip", hue="smoker", col="time", row="sex", data=tips); ``` ![http://seaborn.pydata.org/_images/regression_42_0.png](https://img.kancloud.cn/f7/f5/f7f526432f96b3e66c51e1765fa2c2e6_772x712.jpg) ## 控制繪圖的大小和形狀在之前，我們注意到[`regplot()`](../generated/seaborn.regplot.html#seaborn.regplot "seaborn.regplot")和[`lmplot()`](../generated/seaborn.lmplot.html#seaborn.lmplot "seaborn.lmplot")生成的默認圖看起來相同，但卻具有不同的大小和形狀。這是因為[`regplot()`](../generated/seaborn.regplot.html#seaborn.regplot "seaborn.regplot")是一個"軸級"函數，它繪制在特定的軸上。這意味著你可以自己制作多面板圖形并精確控制回歸圖的位置。如果沒有明確提供軸對象，它只使用"當前活動"軸，這就是默認繪圖與大多數其他 matplotlib 函數具有相同大小和形狀的原因。要控制大小，你需要自己創建一個圖形對象。 ```python f, ax = plt.subplots(figsize=(5, 6)) sns.regplot(x="total_bill", y="tip", data=tips, ax=ax); ``` ![http://seaborn.pydata.org/_images/regression_44_0.png](https://img.kancloud.cn/7e/aa/7eaa5c99c23b0d0f82c703ebb7e477c5_335x380.jpg) 相比之下，[`lmplot()`](../generated/seaborn.lmplot.html#seaborn.lmplot "seaborn.lmplot")圖的大小和形狀是通過[`lmplot()`](http://typora-app/generated/seaborn.lmplot.html#seaborn.lmplot)接口，使用`size`和`aspect`參數控制，這些參數適用于繪圖中的每個`facet`，而不是整個圖形本身： ```python sns.lmplot(x="total_bill", y="tip", col="day", data=tips, col_wrap=2, height=3); ``` ![http://seaborn.pydata.org/_images/regression_46_0.png](https://img.kancloud.cn/94/68/9468473b9a89743e2b2af12bb6da4d0b_424x424.jpg) ```python sns.lmplot(x="total_bill", y="tip", col="day", data=tips, aspect=.5); ``` ![http://seaborn.pydata.org/_images/regression_47_0.png](https://img.kancloud.cn/ef/46/ef4667f4ad87931972386ad2c490f2f5_712x352.jpg) ## 在其他情境中繪制回歸其他一些 seaborn 函數在更大，更復雜的圖中使用[`regplot()`](../generated/seaborn.regplot.html#seaborn.regplot "seaborn.regplot")。第一個是我們在[發行教程](distributions.html#distribution-tutorial)中引入的[`jointplot()`](../generated/seaborn.jointplot.html#seaborn.jointplot "seaborn.jointplot")。除了前面討論的繪制風格，[`jointplot()`](../generated/seaborn.jointplot.html#seaborn.jointplot "seaborn.jointplot") 可以使用[`regplot()`](../generated/seaborn.regplot.html#seaborn.regplot "seaborn.regplot")通過傳遞`kind="reg"`來顯示軸上的線性回歸擬合： ```python sns.jointplot(x="total_bill", y="tip", data=tips, kind="reg"); ``` ![http://seaborn.pydata.org/_images/regression_49_0.png](https://img.kancloud.cn/b7/87/b787f1e75150a9aa91dfe6564d2ef71e_421x424.jpg) 使用[`pairplot()`](../generated/seaborn.pairplot.html#seaborn.pairplot "seaborn.pairplot")函數與`kind="reg"`將 [`regplot()`](../generated/seaborn.regplot.html#seaborn.regplot "seaborn.regplot")和[`PairGrid`](../generated/seaborn.PairGrid.html#seaborn.PairGrid "seaborn.PairGrid") 結合起來，來顯示數據集中變量的線性關系。請注意這與[`lmplot()`](../generated/seaborn.lmplot.html#seaborn.lmplot "seaborn.lmplot")的不同之處。在下圖中，兩個軸在第三變量上的兩個級別上沒有顯示相同的關系；相反，[`PairGrid()`](../generated/seaborn.PairGrid.html#seaborn.PairGrid "seaborn.PairGrid")用于顯示數據集中變量的不同配對之間的多個關系。 ```python sns.pairplot(tips, x_vars=["total_bill", "size"], y_vars=["tip"], height=5, aspect=.8, kind="reg"); ``` ![http://seaborn.pydata.org/_images/regression_51_0.png](https://img.kancloud.cn/ca/b3/cab342e0a913e7c8ccfd822ae3d14ee9_563x352.jpg) 像[`lmplot()`](../generated/seaborn.lmplot.html#seaborn.lmplot "seaborn.lmplot")，但不像[`jointplot()`](../generated/seaborn.jointplot.html#seaborn.jointplot "seaborn.jointplot")，額外的分類變量調節是通過`hue`參數內置在函數[`pairplot()`](../generated/seaborn.pairplot.html#seaborn.pairplot "seaborn.pairplot")中： ```python sns.pairplot(tips, x_vars=["total_bill", "size"], y_vars=["tip"], hue="smoker", height=5, aspect=.8, kind="reg"); ``` ![http://seaborn.pydata.org/_images/regression_53_0.png](https://img.kancloud.cn/9e/b9/9eb9d6c3524e00bb19f8994388fc62a0_624x352.jpg)