梯度下降 · UCB DS100 數據科學的原理與技巧

# 梯度下降 > 原文：[https://www.bookbookmark.ds100.org/ch/11/gradient_descence_define.html](https://www.bookbookmark.ds100.org/ch/11/gradient_descence_define.html) ``` # HIDDEN # Clear previously defined variables %reset -f # Set directory for data loading to work properly import os os.chdir(os.path.expanduser('~/notebooks/11')) ``` ``` # HIDDEN import warnings # Ignore numpy dtype warnings. These warnings are caused by an interaction # between numpy and Cython and can be safely ignored. # Reference: https://stackoverflow.com/a/40846742 warnings.filterwarnings("ignore", message="numpy.dtype size changed") warnings.filterwarnings("ignore", message="numpy.ufunc size changed") import numpy as np import matplotlib.pyplot as plt import pandas as pd import seaborn as sns %matplotlib inline import ipywidgets as widgets from ipywidgets import interact, interactive, fixed, interact_manual import nbinteract as nbi sns.set() sns.set_context('talk') np.set_printoptions(threshold=20, precision=2, suppress=True) pd.options.display.max_rows = 7 pd.options.display.max_columns = 8 pd.set_option('precision', 2) # This option stops scientific notation for pandas # pd.set_option('display.float_format', '{:.2f}'.format) ``` ``` # HIDDEN tips = sns.load_dataset('tips') tips['pcttip'] = tips['tip'] / tips['total_bill'] * 100 ``` ``` # HIDDEN def mse(theta, y_vals): return np.mean((y_vals - theta) ** 2) def grad_mse(theta, y_vals): return -2 * np.mean(y_vals - theta) def plot_loss(y_vals, xlim, loss_fn): thetas = np.arange(xlim[0], xlim[1] + 0.01, 0.05) losses = [loss_fn(theta, y_vals) for theta in thetas] plt.figure(figsize=(5, 3)) plt.plot(thetas, losses, zorder=1) plt.xlim(*xlim) plt.title(loss_fn.__name__) plt.xlabel(r'$ \theta $') plt.ylabel('Loss') def plot_theta_on_loss(y_vals, theta, loss_fn, **kwargs): loss = loss_fn(theta, y_vals) default_args = dict(label=r'$ \theta $', zorder=2, s=200, c=sns.xkcd_rgb['green']) plt.scatter([theta], [loss], **{**default_args, **kwargs}) def plot_tangent_on_loss(y_vals, theta, loss_fn, eps=1e-6): slope = ((loss_fn(theta + eps, y_vals) - loss_fn(theta - eps, y_vals)) / (2 * eps)) xs = np.arange(theta - 1, theta + 1, 0.05) ys = loss_fn(theta, y_vals) + slope * (xs - theta) plt.plot(xs, ys, zorder=3, c=sns.xkcd_rgb['green'], linestyle='--') ``` 我們有興趣創建一個函數，它可以最小化損失函數，而不必強制用戶預先確定要嘗試的值是\theta$。換句話說，雖然`simple_minimize`函數具有以下簽名： ``` simple_minimize(loss_fn, dataset, thetas) ``` 我們想要一個具有以下簽名的函數： ``` minimize(loss_fn, dataset) ``` 此函數需要自動查找最小化的$\theta$值，無論其大小。我們將使用一種稱為梯度下降的技術來實現這個新的`minimize`函數。 ### 直覺與損失函數一樣，我們將首先討論梯度下降的直覺，然后用數學形式化我們的理解。由于`minimize`函數沒有給定要嘗試的$\theta$的值，因此我們從選擇需要的任何位置的$\theta$開始。然后，我們可以迭代地改進對$\theta$的估計。為了改進對$\theta$的估計，我們研究了在選擇$\theta$時損失函數的斜率。例如，假設我們對簡單數據集$\textbf y=[12.1、12.8、14.9、16.3、17.2]$使用 mse，而當前選擇的$\theta$是 12。 ``` # HIDDEN pts = np.array([12.1, 12.8, 14.9, 16.3, 17.2]) plot_loss(pts, (11, 18), mse) plot_theta_on_loss(pts, 12, mse) ``` ![](https://img.kancloud.cn/88/55/8855d293936eeafe358572524b842815_355x246.jpg) 我們想為.\theta$選擇一個減少損失的新值。要做到這一點，我們看損失函數在$\theta=12$時的斜率： ``` # HIDDEN pts = np.array([12.1, 12.8, 14.9, 16.3, 17.2]) plot_loss(pts, (11, 18), mse) plot_tangent_on_loss(pts, 12, mse) ``` ![](https://img.kancloud.cn/73/6e/736e39ce0679f984501767f33e2c1692_355x246.jpg) 坡度為負，這意味著增加$\theta$將減少損失。另一方面，如果$\theta=16.5 美元，則損失函數的斜率為正： ``` # HIDDEN pts = np.array([12.1, 12.8, 14.9, 16.3, 17.2]) plot_loss(pts, (11, 18), mse) plot_tangent_on_loss(pts, 16.5, mse) ``` ![](https://img.kancloud.cn/a7/09/a7099d8803b196e403225cac1133e0aa_355x246.jpg) 當坡度為正時，降低$\theta$將減少損失。切線的斜率告訴我們移動$\theta$的方向，以減少損失。如果坡度為負，我們希望$\theta$朝正方向移動。如果坡度為正，則$\theta$應朝負方向移動。在數學上，我們寫道： $$\theta^（t+1）=\theta^（t）-\frac \部分\部分\theta l（\theta ^（t），\textbf y）$$ 其中，$\theta^（t）$是當前估計數，$\theta^（t+1）$是下一個估計數。對于 MSE，我們有： $$ \begin{aligned} L(\theta, \textbf{y}) &= \frac{1}{n} \sum_{i = 1}^{n}(y_i - \theta)^2\\ \frac{\partial}{\partial \hat{\theta}} L(\theta, \textbf{y}) &= \frac{1}{n} \sum_{i = 1}^{n} -2(y_i - \theta) \\ &= -\frac{2}{n} \sum_{i = 1}^{n} (y_i - \theta) \\ \end{aligned} $$ 當$\theta^（t）=12$時，我們可以計算$-\frac 2 n sum i=1 n（y i-\theta）=-5.32$。因此，$\theta^（t+1）=12-（-5.32）=17.32 美元。我們將舊的$theta$值繪制為綠色輪廓圓圈，新的值繪制為下面損失曲線上的填充圓圈。 ``` # HIDDEN pts = np.array([12.1, 12.8, 14.9, 16.3, 17.2]) plot_loss(pts, (11, 18), mse) plot_theta_on_loss(pts, 12, mse, c='none', edgecolor=sns.xkcd_rgb['green'], linewidth=2) plot_theta_on_loss(pts, 17.32, mse) ``` ![](https://img.kancloud.cn/16/f1/16f11378c5e11b488a4f1284d10dbdb4_355x246.jpg) 雖然$\theta$朝著正確的方向發展，但最終卻遠遠超出了最低值。我們可以通過將斜率乘以一個小常數，然后從$\theta$中減去它來解決這個問題。我們的最終更新公式是： $$\theta^（t+1）=\theta^（t）-\alpha\cdot\frac \部分\部分\theta l（\theta ^（t），\textbf y）$$ 其中，$\alpha$是一個小常量。例如，如果我們設置$\alpha=0.3$，這是新的$\theta^（t+1）$： ``` # HIDDEN def plot_one_gd_iter(y_vals, theta, loss_fn, grad_loss, alpha=0.3): new_theta = theta - alpha * grad_loss(theta, y_vals) plot_loss(pts, (11, 18), loss_fn) plot_theta_on_loss(pts, theta, loss_fn, c='none', edgecolor=sns.xkcd_rgb['green'], linewidth=2) plot_theta_on_loss(pts, new_theta, loss_fn) print(f'old theta: {theta}') print(f'new theta: {new_theta}') ``` ``` # HIDDEN plot_one_gd_iter(pts, 12, mse, grad_mse) ``` ``` old theta: 12 new theta: 13.596 ``` ![](https://img.kancloud.cn/45/26/4526883471c3031f5633f84f12690bed_355x246.jpg) 以下是此過程連續迭代的$\theta$值。請注意，$\theta$隨著接近最小損失而變化得更慢，因為坡度也更小。 ``` # HIDDEN plot_one_gd_iter(pts, 13.60, mse, grad_mse) ``` ``` old theta: 13.6 new theta: 14.236 ``` ![](https://img.kancloud.cn/7f/72/7f7277692df60f71c835aa0890e47eca_355x246.jpg) ``` # HIDDEN plot_one_gd_iter(pts, 14.24, mse, grad_mse) ``` ``` old theta: 14.24 new theta: 14.492 ``` ![](https://img.kancloud.cn/57/45/5745727916bc6795daab99b228bd4e91_355x246.jpg) ``` # HIDDEN plot_one_gd_iter(pts, 14.49, mse, grad_mse) ``` ``` old theta: 14.49 new theta: 14.592 ``` ![](https://img.kancloud.cn/3a/74/3a7452dc16def8ad38c724e5d5891006_355x246.jpg) ### 梯度下降分析現在我們有了完整的梯度下降算法： 1. 選擇一個起始值$\theta$（0 是一個常見的選擇）。 2. 計算$\theta-\alpha\cdot\frac \partial \partial\theta l（\theta、\textbf y）$并將其存儲為新值$\theta$。 3. 重復直到$\theta$在迭代之間不改變。您將更常見地看到梯度$\nabla_uta$代替部分導數$\frac \部分\部分\theta$。這兩個符號本質上是等效的，但是由于梯度符號更為常見，從現在起我們將在梯度更新公式中使用它： $$\theta^（t+1）=\theta^（t）-\alpha\cdot\nabla\theta l（\theta^（t），\textbf y）$$ 要查看符號： * $\theta^（t）$是第$t$次迭代時的當前估計值$\theta^*。 * $\theta^（t+1）$是$\theta$的下一個選擇。 * $\alpha$稱為學習率，通常設置為一個小常量。有時，從一個更大的$\alpha$開始并隨著時間的推移減少它是有用的。如果在迭代之間$\alpha$發生變化，我們使用變量$\alpha^t$來標記$\alpha$隨時間變化$t$。 * $\nabla_ \theta l（\theta^（t），\textbf y）$是損失函數相對于時間$t$的偏導數/梯度。現在您可以看到選擇一個可微分損失函數的重要性：$\nabla_theta l（\theta、\textbf y）$是梯度下降算法的關鍵部分。（雖然可以通過計算兩個稍有不同的$theta$值的損失差異并除以$theta$值之間的距離來估計梯度，但這通常會顯著增加梯度下降的運行時間，因此使用它變得不切實際。）梯度算法簡單而強大，因為我們可以將它用于許多類型的模型和許多類型的損失函數。它是擬合許多重要模型的計算工具，包括大數據集和神經網絡上的線性回歸。 ### 定義`minimize`函數[?](#Defining-the-minimize-Function) 現在，我們回到原來的任務：定義`minimize`函數。我們將不得不稍微改變我們的函數簽名，因為我們現在需要計算損失函數的梯度。 ``` def minimize(loss_fn, grad_loss_fn, dataset, alpha=0.2, progress=True): ''' Uses gradient descent to minimize loss_fn. Returns the minimizing value of theta_hat once theta_hat changes less than 0.001 between iterations. ''' theta = 0 while True: if progress: print(f'theta: {theta:.2f} | loss: {loss_fn(theta, dataset):.2f}') gradient = grad_loss_fn(theta, dataset) new_theta = theta - alpha * gradient if abs(new_theta - theta) < 0.001: return new_theta theta = new_theta ``` 然后我們可以定義函數來計算 mse 及其梯度： ``` def mse(theta, y_vals): return np.mean((y_vals - theta) ** 2) def grad_mse(theta, y_vals): return -2 * np.mean(y_vals - theta) ``` 最后，我們可以使用`minimize`函數計算$\textbf y=[12.1，12.8，14.9，16.3，17.2]$的最小化值。 ``` %%time theta = minimize(mse, grad_mse, np.array([12.1, 12.8, 14.9, 16.3, 17.2])) print(f'Minimizing theta: {theta}') print() ``` ``` theta: 0.00 | loss: 218.76 theta: 5.86 | loss: 81.21 theta: 9.38 | loss: 31.70 theta: 11.49 | loss: 13.87 theta: 12.76 | loss: 7.45 theta: 13.52 | loss: 5.14 theta: 13.98 | loss: 4.31 theta: 14.25 | loss: 4.01 theta: 14.41 | loss: 3.90 theta: 14.51 | loss: 3.86 theta: 14.57 | loss: 3.85 theta: 14.61 | loss: 3.85 theta: 14.63 | loss: 3.84 theta: 14.64 | loss: 3.84 theta: 14.65 | loss: 3.84 theta: 14.65 | loss: 3.84 theta: 14.66 | loss: 3.84 theta: 14.66 | loss: 3.84 Minimizing theta: 14.658511131035242 CPU times: user 7.88 ms, sys: 3.58 ms, total: 11.5 ms Wall time: 8.54 ms ``` 我們可以看到，梯度下降很快找到了與解析法相同的解： ``` np.mean([12.1, 12.8, 14.9, 16.3, 17.2]) ``` ``` 14.66 ``` ### 最小化 Huber 損失現在，我們可以應用梯度下降來最小化提示百分比數據集上的 Huber 損失。 Huber 損失為： L 123; 123; 1 2 \ delta）&amp；\text 否則\結束案例$$ Huber 損失的梯度為： $$\nabla_\theta l_\delta（\theta，\textbf y）=\frac 1 n \sum i=1 n\ begin cases-（y \theta）&amp；y i-\theta \le\delta\ ``` - \delta \cdot \text{sign} (y_i - \theta) & \text{otherwise} ``` \結束案例$$ （注意，在之前的 Huber 損失定義中，我們使用變量$\alpha$來表示轉換點。為了避免與梯度下降中使用的$\alpha$混淆，我們將 Huber 損失的過渡點參數替換為$\delta$。） ``` def huber_loss(theta, dataset, delta = 1): d = np.abs(theta - dataset) return np.mean( np.where(d <= delta, (theta - dataset)**2 / 2.0, delta * (d - delta / 2.0)) ) def grad_huber_loss(theta, dataset, delta = 1): d = np.abs(theta - dataset) return np.mean( np.where(d <= delta, -(dataset - theta), -delta * np.sign(dataset - theta)) ) ``` 讓我們最小化 Tips 數據集上的 Huber 損失： ``` %%time theta = minimize(huber_loss, grad_huber_loss, tips['pcttip'], progress=False) print(f'Minimizing theta: {theta}') print() ``` ``` Minimizing theta: 15.506849531471964 CPU times: user 194 ms, sys: 4.13 ms, total: 198 ms Wall time: 208 ms ``` ### 摘要[?](#Summary) 梯度下降給了我們一種一般的方法來最小化損失函數，當我們無法通過分析來求解$\theta$的最小值時。隨著我們的模型和損失函數的復雜性增加，我們將轉向梯度下降作為我們選擇適合模型的工具。