加載和準備 text8 數據集 · 精通 TensorFlow 1.x

# 加載和準備 text8 數據集現在我們使用 text8 數據集執行相同的加載和預處理步驟： ```py from datasetslib.text8 import Text8 text8 = Text8() text8.load_data() # downloads data, converts words to ids, converts files to a list of ids print('Train:', text8.part['train'][0:5]) print('Vocabulary Length = ',text8.vocab_len) ``` 我們發現詞匯長度大約是 254,000 字： ```py Train: [5233, 3083, 11, 5, 194] Vocabulary Length = 253854 ``` 一些教程通過查找最常用的單詞或將詞匯量大小截斷為 10,000 個單詞來操縱此數據。但是，我們使用了 text8 數據集的第一個文件中的完整數據集和完整詞匯表。準備 CBOW 對： ```py text8.skip_window=2 text8.reset_index_in_epoch() # in CBOW input is the context word and output is the target word y_batch, x_batch = text8.next_batch_cbow() print('The CBOW pairs : context,target') for i in range(5 * text8.skip_window): print('(', [text8.id2word[x_i] for x_i in x_batch[i]], ',', y_batch[i], text8.id2word[y_batch[i]], ')') ``` 輸出是： ```py The CBOW pairs : context,target ( ['anarchism', 'originated', 'a', 'term'] , 11 as ) ( ['originated', 'as', 'term', 'of'] , 5 a ) ( ['as', 'a', 'of', 'abuse'] , 194 term ) ( ['a', 'term', 'abuse', 'first'] , 1 of ) ( ['term', 'of', 'first', 'used'] , 3133 abuse ) ( ['of', 'abuse', 'used', 'against'] , 45 first ) ( ['abuse', 'first', 'against', 'early'] , 58 used ) ( ['first', 'used', 'early', 'working'] , 155 against ) ( ['used', 'against', 'working', 'class'] , 127 early ) ( ['against', 'early', 'class', 'radicals'] , 741 working ) ``` 準備 skip-gram 對： ```py text8.skip_window=2 text8.reset_index_in_epoch() # in skip-gram input is the target word and output is the context word x_batch, y_batch = text8.next_batch() print('The skip-gram pairs : target,context') for i in range(5 * text8.skip_window): print('(',x_batch[i], text8.id2word[x_batch[i]], ',', y_batch[i], text8.id2word[y_batch[i]],')') ``` 輸出為： ```py The skip-gram pairs : target,context ( 11 as , 5233 anarchism ) ( 11 as , 3083 originated ) ( 11 as , 5 a ) ( 11 as , 194 term ) ( 5 a , 3083 originated ) ( 5 a , 11 as ) ( 5 a , 194 term ) ( 5 a , 1 of ) ( 194 term , 11 as ) ( 194 term , 5 a ) ```