使用混合距離函數的計算 · TensorFlow 機器學習秘籍中文第二版

# 使用混合距離函數的計算在處理具有多個特征的數據觀察時，我們應該意識到特征可以在不同的尺度上以不同的方式縮放。在這個方案中，我們將考慮到這一點，以改善我們的住房價值預測。 ## 做好準備擴展最近鄰算法很重要，要考慮不同縮放的變量。在這個例子中，我們將說明如何縮放不同變量的距離函數。具體來說，我們將距離函數作為特征方差的函數進行縮放。加權距離函數的關鍵是使用權重矩陣。用矩陣運算寫的距離函數變為以下公式： ![](https://img.kancloud.cn/a1/66/a166465e9c744df8645d918e24216abd_2670x370.png) 這里，`A`是一個對角線權重矩陣，我們將用它來縮放每個特征的距離度量。在本文中，我們將嘗試在波士頓住房價值數據集上改進我們的 MSE。該數據集是不同尺度上的特征的一個很好的例子，并且最近鄰算法將受益于縮放距離函數。 ## 操作步驟我們將按如下方式處理秘籍： 1. 首先，我們將加載必要的庫并啟動圖會話： ```py import matplotlib.pyplot as plt import numpy as np import tensorflow as tf import requests sess = tf.Session() ``` 1. 接下來，我們將加載數據并將其存儲在 NumPy 數組中。再次注意，我們只會使用某些列進行預測。我們不使用 id，也不使用方差非常低的變量： ```py housing_url = 'https://archive.ics.uci.edu/ml/machine-learning-databases/housing/housing.data' housing_header = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV'] cols_used = ['CRIM', 'INDUS', 'NOX', 'RM', 'AGE', 'DIS', 'TAX', 'PTRATIO', 'B', 'LSTAT'] num_features = len(cols_used) housing_file = requests.get(housing_url) housing_data = [[float(x) for x in y.split(' ') if len(x)>=1] for y in housing_file.text.split('\n') if len(y)>=1] y_vals = np.transpose([np.array([y[13] for y in housing_data])]) x_vals = np.array([[x for i,x in enumerate(y) if housing_header[i] in cols_used] for y in housing_data]) ``` 1. 現在，我們將`x`值縮放到 0 到 1 之間，最小 - 最大縮放： ```py x_vals = (x_vals - x_vals.min(0)) / x_vals.ptp(0) ``` 1. 然后，我們將創建對角線權重矩陣，該矩陣將通過特征的標準偏差提供距離度量的縮放： ```py weight_diagonal = x_vals.std(0) weight_matrix = tf.cast(tf.diag(weight_diagonal), dtype=tf.float32) ``` 1. 現在，我們將數據分成訓練和測試集。我們還將聲明`k`，最近鄰居的數量，并使批量大小等于測試集大小： ```py train_indices = np.random.choice(len(x_vals), round(len(x_vals)*0.8), replace=False) test_indices = np.array(list(set(range(len(x_vals))) - set(train_indices))) x_vals_train = x_vals[train_indices] x_vals_test = x_vals[test_indices] y_vals_train = y_vals[train_indices] y_vals_test = y_vals[test_indices] k = 4 batch_size=len(x_vals_test) ``` 1. 我們將聲明接下來需要的占位符。我們有四個占位符 - 訓練和測試集的[??HTG0] - 輸入和`y` - 目標： ```py x_data_train = tf.placeholder(shape=[None, num_features], dtype=tf.float32) x_data_test = tf.placeholder(shape=[None, num_features], dtype=tf.float32) y_target_train = tf.placeholder(shape=[None, 1], dtype=tf.float32) y_target_test = tf.placeholder(shape=[None, 1], dtype=tf.float32) ``` 1. 現在，我們可以聲明我們的距離函數。為了便于閱讀，我們將把距離函數分解為其組件。請注意，我們必須按批量大小平鋪權重矩陣，并使用`batch_matmul()`函數在批量大小中執行批量矩陣乘法： ```py subtraction_term = tf.subtract(x_data_train, tf.expand_dims(x_data_test,1)) first_product = tf.batch_matmul(subtraction_term, tf.tile(tf.expand_dims(weight_matrix,0), [batch_size,1,1])) second_product = tf.batch_matmul(first_product, tf.transpose(subtraction_term, perm=[0,2,1])) distance = tf.sqrt(tf.batch_matrix_diag_part(second_product)) ``` 1. 在我們計算每個測試點的所有訓練距離之后，我們將需要返回頂部 k-NN。我們可以使用`top_k()`函數執行此操作。由于此函數返回最大值，并且我們想要最小距離，因此我們返回最大的負距離值。然后，我們將預測作為頂部`k`鄰居的距離的加權平均值： ```py top_k_xvals, top_k_indices = tf.nn.top_k(tf.neg(distance), k=k) x_sums = tf.expand_dims(tf.reduce_sum(top_k_xvals, 1),1) x_sums_repeated = tf.matmul(x_sums,tf.ones([1, k], tf.float32)) x_val_weights = tf.expand_dims(tf.div(top_k_xvals,x_sums_repeated), 1) top_k_yvals = tf.gather(y_target_train, top_k_indices) prediction = tf.squeeze(tf.batch_matmul(x_val_weights,top_k_yvals), squeeze_dims=[1]) ``` 1. 為了評估我們的模型，我們將計算預測的 MSE： ```py mse = tf.divide(tf.reduce_sum(tf.square(tf.subtract(prediction, y_target_test))), batch_size) ``` 1. 現在，我們可以遍歷我們的測試批次并計算每個的 MSE： ```py num_loops = int(np.ceil(len(x_vals_test)/batch_size)) for i in range(num_loops): min_index = i*batch_size max_index = min((i+1)*batch_size,len(x_vals_train)) x_batch = x_vals_test[min_index:max_index] y_batch = y_vals_test[min_index:max_index] predictions = sess.run(prediction, feed_dict={x_data_train: x_vals_train, x_data_test: x_batch, y_target_train: y_vals_train, y_target_test: y_batch}) batch_mse = sess.run(mse, feed_dict={x_data_train: x_vals_train, x_data_test: x_batch, y_target_train: y_vals_train, y_target_test: y_batch}) print('Batch #' + str(i+1) + ' MSE: ' + str(np.round(batch_mse,3))) Batch #1 MSE: 21.322 ``` 1. 作為最終比較，我們可以使用以下代碼繪制實際測試集的住房值分布和測試集的預測： ```py bins = np.linspace(5, 50, 45) plt.hist(predictions, bins, alpha=0.5, label='Prediction') plt.hist(y_batch, bins, alpha=0.5, label='Actual') plt.title('Histogram of Predicted and Actual Values') plt.xlabel('Med Home Value in $1,000s') plt.ylabel('Frequency') plt.legend(loc='upper right') plt.show() ``` 我們將獲得前面代碼的以下直方圖： ![](https://img.kancloud.cn/fd/90/fd90a70c425edefe695971e149370acf_387x281.png) 圖 3：Boston 數據集上預測房屋價值和實際房屋價值的兩個直方圖;這一次，我們為每個特征不同地縮放了距離函數 ## 工作原理我們通過引入一種縮放每個特征的距離函數的方法來減少測試集上的 MSE。在這里，我們通過特征標準偏差的因子來縮放距離函數。這提供了更準確的測量視圖，其中測量哪些點是最近的鄰居。由此，我們還將頂部`k`鄰域的加權平均值作為距離的函數，以獲得住房價值預測。 ## 更多該縮放因子還可以用于最近鄰距離計算中的向下加權或向上加權的特征。這在我們比某些特征更信任某些特征的情況下非常有用。