使用數據源 · TensorFlow 機器學習秘籍中文第二版

# 使用數據源對于本書的大部分內容，我們將依賴數據集的使用來適應機器學習算法。本節介紹如何通過 TensorFlow 和 Python 訪問每個數據集。 > 一些數據源依賴于外部網站的維護，以便您可以訪問數據。如果這些網站更改或刪除此數據，則可能需要更新本節中的以下某些代碼。您可以在作者的 GitHub 頁面上找到更新的代碼： [https://github.com/nfmcclure/tensorflow_cookbook](https://github.com/nfmcclure/tensorflow_cookbook) 。 ## 做好準備在 TensorFlow 中，我們將使用的一些數據集構建在 Python 庫中，一些將需要 Python 腳本下載，一些將通過 Internet 手動下載。幾乎所有這些數據集都需要有效的 Internet 連接，以便您可以檢索它們。 ## 操作步驟 1. 虹膜數據：該數據集可以說是機器學習中使用的最經典的數據集，也可能是所有統計數據。它是一個數據集，可以測量三種不同類型鳶尾花的萼片長度，萼片寬度，花瓣長度和花瓣寬度：Iris setosa，Iris virginica 和 Iris versicolor。總共有 150 個測量值，這意味著每個物種有 50 個測量值。要在 Python 中加載數據集，我們將使用 scikit-learn 的數據集函數，如下所示： ```py from sklearn import datasets iris = datasets.load_iris() print(len(iris.data)) 150 print(len(iris.target)) 150 print(iris.data[0]) # Sepal length, Sepal width, Petal length, Petal width [ 5.1 3.5 1.4 0.2] print(set(iris.target)) # I. setosa, I. virginica, I. versicolor {0, 1, 2} ``` 1. 出生體重數據：該數據最初來自 Baystate Medical Center，Springfield，Mass 1986（1）。該數據集包含出生體重的測量以及母親和家族病史的其他人口統計學和醫學測量。有 11 個變量的 189 個觀測值。以下代碼顯示了如何在 Python 中訪問此數據： ```py import requests birthdata_url = 'https://github.com/nfmcclure/tensorflow_cookbook/raw/master/01_Introduction/07_Working_with_Data_Sources/birthweight_data/birthweight.dat' birth_file = requests.get(birthdata_url) birth_data = birth_file.text.split('\r\n') birth_header = birth_data[0].split('\t') birth_data = [[float(x) for x in y.split('\t') if len(x)>=1] for y in birth_data[1:] if len(y)>=1] print(len(birth_data)) 189 print(len(birth_data[0])) 9 ``` 1. 波士頓住房數據：卡內基梅隆大學在其 StatLib 庫中維護著一個數據集庫。這些數據可通過加州大學歐文分校的機器學習庫（ [https://archive.ics.uci.edu/ml/index.php](https://archive.ics.uci.edu/ml/index.php) ）輕松訪問。有 506 個房屋價值觀察，以及各種人口統計數據和住房屬性（14 個變量）。以下代碼顯示了如何通過 Keras 庫在 Python 中訪問此數據： ```py from keras.datasets import boston_housing (x_train, y_train), (x_test, y_test) = boston_housing.load_data() housing_header = ['CRIM', 'ZN', 'INDUS', 'CHAS', 'NOX', 'RM', 'AGE', 'DIS', 'RAD', 'TAX', 'PTRATIO', 'B', 'LSTAT', 'MEDV'] print(x_train.shape[0]) 404 print(x_train.shape[1]) 13 ``` 1. MNIST 手寫數據：MNIST（混合國家標準與技術研究院）數據集是較大的 NIST 手寫數據庫的子集。 MNIST 手寫數據集托管在 Yann LeCun 的網站上[ [https://yann.lecun.com/exdb/mnist/](https://yann.lecun.com/exdb/mnist/) ）。它是一個包含 70,000 個單元數字圖像（0-9）的數據庫，其中約 60,000 個用于訓練集注釋，10,000 個用于測試集。 TensorFlow 在圖像識別中經常使用此數據集，TensorFlow 提供了訪問此數據的內置函數。在機器學習中，提供驗證數據以防止過擬合（目標泄漏）也很重要。因此，TensorFlow 將 5000 列訓練圖像留在驗證集中。以下代碼顯示了如何在 Python 中訪問此數據： ```py from tensorflow.examples.tutorials.mnist import input_data mnist = input_data.read_data_sets("MNIST_data/"," one_hot=True) print(len(mnist.train.images)) 55000 print(len(mnist.test.images)) 10000 print(len(mnist.validation.images)) 5000 print(mnist.train.labels[1,:]) # The first label is a 3 [ 0\. 0\. 0\. 1\. 0\. 0\. 0\. 0\. 0\. 0.] ``` 1. 垃圾郵件文本數據。 UCI 的機器學習數據集庫還包含垃圾短信文本消息數據集。我們可以訪問此`.zip`文件并獲取垃圾郵件文本數據，如下所示： ```py import requests import io from zipfile import ZipFile zip_url = 'http://archive.ics.uci.edu/ml/machine-learning-databases/00228/smsspamcollection.zip' r = requests.get(zip_url) z = ZipFile(io.BytesIO(r.content)) file = z.read('SMSSpamCollection') text_data = file.decode() text_data = text_data.encode('ascii',errors='ignore') text_data = text_data.decode().split('\n') text_data = [x.split('\t') for x in text_data if len(x)>=1] [text_data_target, text_data_train] = [list(x) for x in zip(*text_data)] print(len(text_data_train)) 5574 print(set(text_data_target)) {'ham', 'spam'} print(text_data_train[1]) Ok lar... Joking wif u oni... ``` 1. 電影評論數據：來自康奈爾大學的 Bo Pang 發布了一個電影評論數據集，將評論分為好或壞（3）。您可以在以下網站上找到數據： [http://www.cs.cornell.edu/people/pabo/movie-review-data/](http://www.cs.cornell.edu/people/pabo/movie-review-data/) 。要下載，提取和轉換此數據，我們可以運行以下代碼： ```py import requests import io import tarfile movie_data_url = 'http://www.cs.cornell.edu/people/pabo/movie-review-data/rt-polaritydata.tar.gz' r = requests.get(movie_data_url) # Stream data into temp object stream_data = io.BytesIO(r.content) tmp = io.BytesIO() while True: s = stream_data.read(16384) if not s: break tmp.write(s) stream_data.close() tmp.seek(0) # Extract tar file tar_file = tarfile.open(fileobj=tmp, mode="r:gz") pos = tar_file.extractfile('rt-polaritydata/rt-polarity.pos') neg = tar_file.extractfile('rt-polaritydata/rt-polarity.neg') # Save pos/neg reviews (Also deal with encoding) pos_data = [] for line in pos: pos_data.append(line.decode('ISO-8859-1').encode('ascii',errors='ignore').decode()) neg_data = [] for line in neg: neg_data.append(line.decode('ISO-8859-1').encode('ascii',errors='ignore').decode()) tar_file.close() print(len(pos_data)) 5331 print(len(neg_data)) 5331 # Print out first negative review print(neg_data[0]) simplistic , silly and tedious . ``` 1. CIFAR-10 圖像數據：加拿大高級研究院發布了一個圖像集，其中包含 8000 萬個帶標簽的彩色圖像（每個圖像縮放為 32 x 32 像素）。有 10 種不同的目標類別（飛機，汽車，鳥類等）。 CIFAR-10 是包含 60,000 張圖像的子集。訓練集中有 50,000 個圖像，測試集中有 10,000 個。由于我們將以多種方式使用此數據集，并且因為它是我們較大的數據集之一，因此我們不會在每次需要時運行腳本。要獲取此數據集，請導航至 [http://www.cs.toronto.edu/~kriz/cifar.html](http://www.cs.toronto.edu/~kriz/cifar.html) 并下載 CIFAR-10 數據集。我們將在相應的章節中介紹如何使用此數據集。 2. 莎士比亞文本數據的作品：Project Gutenberg（5）是一個發布免費書籍電子版的項目。他們一起編輯了莎士比亞的所有作品。以下代碼顯示了如何通過 Python 訪問此文本文件： ```py import requests shakespeare_url = 'http://www.gutenberg.org/cache/epub/100/pg100.txt' # Get Shakespeare text response = requests.get(shakespeare_url) shakespeare_file = response.content # Decode binary into string shakespeare_text = shakespeare_file.decode('utf-8') # Drop first few descriptive paragraphs. shakespeare_text = shakespeare_text[7675:] print(len(shakespeare_text)) # Number of characters 5582212 ``` 1. 英語 - 德語句子翻譯數據：Tatoeba 項目（ [http://tatoeba.org](http://tatoeba.org) ）收集多種語言的句子翻譯。他們的數據已在 Creative Commons License 下發布。根據這些數據，ManyThings.org（ [http://www.manythings.org](http://www.manythings.org) ）編譯了可供下載的文本文件中的句子到句子的翻譯。在這里，我們將使用英語 - 德語翻譯文件，但您可以將 URL 更改為您想要使用的語言： ```py import requests import io from zipfile import ZipFile sentence_url = 'http://www.manythings.org/anki/deu-eng.zip' r = requests.get(sentence_url) z = ZipFile(io.BytesIO(r.content)) file = z.read('deu.txt') # Format Data eng_ger_data = file.decode() eng_ger_data = eng_ger_data.encode('ascii',errors='ignore') eng_ger_data = eng_ger_data.decode().split('\n') eng_ger_data = [x.split('\t') for x in eng_ger_data if len(x)>=1] [english_sentence, german_sentence] = [list(x) for x in zip(*eng_ger_data)] print(len(english_sentence)) 137673 print(len(german_sentence)) 137673 print(eng_ger_data[10]) ['I' won!, 'Ich habe gewonnen!'] ``` ## 工作原理當在秘籍中使用這些數據集之一時，我們將引用您到本節并假設數據以上一節中描述的方式加載。如果需要進一步的數據轉換或預處理，那么這些代碼將在秘籍本身中提供。 ## 另見以下是我們在本書中使用的數據資源的其他參考： * Hosmer，D.W.，Lemeshow，S。和 Sturdivant，R。X.（2013）Applied Logistic Regression：3rd Edition * Lichman，M。（2013）。 UCI 機器學習庫 [http://archive.ics.uci.edu/ml](http://archive.ics.uci.edu/ml) Irvine，CA：加州大學信息與計算機科學學院 * Bo Pang，Lillian Lee 和 Shivakumar Vaithyanathan，豎起大拇指？使用機器學習技術的情感分類，EMNLP 2002 年會議錄 [http://www.cs.cornell.edu/people/pabo/movie-review-data/](http://www.cs.cornell.edu/people/pabo/movie-review-data/) * Krizhevsky。（2009 年）。從微小圖像學習多層特征 [http://www.cs.toronto.edu/~kriz/cifar.html](http://www.cs.toronto.edu/~kriz/cifar.html) * 古騰堡項目。 2016 年 4 月訪問 [http://www.gutenberg.org/](http://www.gutenberg.org/)