one-hot 編碼（one-hot encoding） · python深度學習

* [ ] 每個**單詞**與一個唯一的**整數索引**相關聯 * [ ] 這個**整數索引*i***轉換為長度為*N*的**二進制向量**（*N*是詞表大小）這個向量只有第*i*個元素是 1，其余元素都為 0 ***** **單詞級的 one-hot 編碼** ~~~ import numpy as np samples = ['The cat sat on the mat.', 'The dog ate my homework.'] #初始數據：每個樣本是列表的一個元素（本例中的樣本是一個句子，但也可以是一整篇文檔） token_index = {} #構建數據中所有標記的索引 for sample in samples: for word in sample.split(): #利用split方法對樣本進行分詞。在實際應用中，還需要從樣本中去掉標點和特殊字符 if word not in token_index: token_index[word] = len(token_index) + 1 #為每個唯一單詞指定一個唯一索引。注意，沒有為索引編號0指定單詞 max_length = 10 #對樣本進行分詞。只考慮每個樣本前max_length個單詞 results = np.zeros(shape=(len(samples), max_length, max(token_index.values()) + 1)) #將結果保存在results中 for i, sample in enumerate(samples): for j, word in list(enumerate(sample.split()))[:max_length]: index = token_index.get(word) results[i, j, index] = 1. ~~~ **字符級的 one-hot 編碼** ~~~ import string import numpy as np samples = ['The cat sat on the mat.', 'The dog ate my homework.'] characters = string.printable #所有可打印的ASCII字符 token_index = dict(zip(range(1, len(characters) + 1), characters)) max_length = 50 results = np.zeros((len(samples), max_length, max(token_index.keys()) + 1)) for i, sample in enumerate(samples): for j, character in enumerate(sample): index = token_index.get(character) results[i, j, index] = 1. ~~~ **Keras 實現單詞級的 one-hot 編碼** * Keras 的內置函數可以對原始文本數據進行單詞級或字符級的 one-hot 編碼 * 實現了許多重要的特性，比如從字符串中去除特殊字符、只考慮數據集中前*N*個最常見的單詞（這是一種常用的限制，以避免處理非常大的輸入向量空間） ~~~ from keras.preprocessing.text import Tokenizer samples = ['The cat sat on the mat.', 'The dog ate my homework.'] tokenizer = Tokenizer(num_words=1000) #創建一個分詞器（tokenizer），設置為只考慮前1000個最常見的單詞 tokenizer.fit_on_texts(samples) #構建單詞索引 sequences = tokenizer.texts_to_sequences(samples) #將字符串轉換為整數索引組成的列表 one_hot_results = tokenizer.texts_to_matrix(samples, mode='binary') #也可以直接得到one-hot二進制表示。這個分詞器也支持除one-hot編碼外的其他向量化模式 word_index = tokenizer.word_index #找回單詞索引 ~~~ **one-hot 散列技巧**（one-hot hashing trick） * 詞表中唯一標記的數量太大而無法直接處理 * 將單詞散列編碼為固定長度的向量，通常用一個非常簡單的散列函數來實現 * 優點：避免了維護一個顯式的單詞索引，從而節省內存并允許數據的在線編碼（在讀取完所有數據之前，你就可以立刻生成標記向量） * 缺點：可能會出現**散列沖突**（hash collision），即兩個不同的單詞可能具有相同的散列值，隨后任何機器學習模型觀察這些散列值，都無法區分它們所對應的單詞。 * 如果散列空間的維度遠大于需要散列的唯一標記的個數，散列沖突的可能性會減小。 ~~~ import numpy as np samples = ['The cat sat on the mat.', 'The dog ate my homework.'] dimensionality = 1000 #將單詞保存為長度為1000的向量。 # 如果單詞數量接近1000個（或更多），那么會遇到很多散列沖突，這會降低這種編碼方法的準確性 max_length = 10 results = np.zeros((len(samples), max_length, dimensionality)) for i, sample in enumerate(samples): for j, word in list(enumerate(sample.split()))[:max_length]: index = abs(hash(word)) % dimensionality #將單詞散列為0~1000范圍內的一個隨機整數索引 results[i, j, index] = 1. ~~~