主題和轉換 · Gensim 中文文檔

# 主題和轉換別忘了設置 ```py >>> import logging >>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) ``` 如果你想看到記錄事件。 ## [轉換接口](https://radimrehurek.com/gensim/tut2.html#transformation-interface "永久鏈接到這個標題") 在上一篇關于[Corpora和Vector Spaces的](https://radimrehurek.com/gensim/tut1.html)教程中，我們創建了一個文檔語料庫，表示為向量流。要繼續，讓我們啟動gensim并使用該語料庫： ```py >>> from gensim import corpora, models, similarities >>> if (os.path.exists("/tmp/deerwester.dict")): >>> dictionary = corpora.Dictionary.load('/tmp/deerwester.dict') >>> corpus = corpora.MmCorpus('/tmp/deerwester.mm') >>> print("Used files generated from first tutorial") >>> else: >>> print("Please run first tutorial to generate data set") ``` MmCorpus（9個文件，12個特征，28個非零項）在本教程中，我將展示如何將文檔從一個矢量表示轉換為另一個矢量表示。這個過程有兩個目標： 1. 為了在語料庫中顯示隱藏的結構，發現單詞之間的關系并使用它們以新的（希望）更加語義的方式描述文檔。 2. 使文檔表示更緊湊。這既提高了效率（新表示消耗更少的資源）和功效（邊際數據趨勢被忽略，降噪）。 ### [創建轉換](https://radimrehurek.com/gensim/tut2.html#creating-a-transformation "永久鏈接到這個標題") 轉換是標準的Python對象，通常通過*訓練語料庫進行*初始化： ```py >>> tfidf = models.TfidfModel(corpus) # step 1 -- initialize a model ``` 我們使用教程1中的舊語料庫初始化（訓練）轉換模型。不同的轉換可能需要不同的初始化參數;?在TfIdf的情況下，“訓練”僅包括通過提供的語料庫一次并計算其所有特征的文檔頻率。訓練其他模型，例如潛在語義分析或潛在Dirichlet分配，涉及更多，因此需要更多時間。注意轉換總是在兩個特定的向量空間之間轉換。必須使用相同的向量空間（=同一組特征id）進行訓練以及后續的向量轉換。無法使用相同的輸入要素空間，例如應用不同的字符串預處理，使用不同的特征ID，或使用預期為TfIdf向量的詞袋輸入向量，將導致轉換調用期間的特征不匹配，從而導致垃圾中的任何一個輸出和/或運行時異常。 ### [變換向量](https://radimrehurek.com/gensim/tut2.html#transforming-vectors "永久鏈接到這個標題") 從現在開始，`tfidf` 被視為一個只讀對象，可用于將任何向量從舊表示（bag-of-words整數計數）轉換為新表示（TfIdf實值權重）： ```py >>> doc_bow = [(0, 1), (1, 1)] >>> print(tfidf[doc_bow]) # step 2 -- use the model to transform vectors [(0, 0.70710678), (1, 0.70710678)] ``` 或者將轉換應用于整個語料庫： ```py >>> corpus_tfidf = tfidf[corpus] >>> for doc in corpus_tfidf: ... print(doc) [(0, 0.57735026918962573), (1, 0.57735026918962573), (2, 0.57735026918962573)] [(0, 0.44424552527467476), (3, 0.44424552527467476), (4, 0.44424552527467476), (5, 0.32448702061385548), (6, 0.44424552527467476), (7, 0.32448702061385548)] [(2, 0.5710059809418182), (5, 0.41707573620227772), (7, 0.41707573620227772), (8, 0.5710059809418182)] [(1, 0.49182558987264147), (5, 0.71848116070837686), (8, 0.49182558987264147)] [(3, 0.62825804686700459), (6, 0.62825804686700459), (7, 0.45889394536615247)] [(9, 1.0)] [(9, 0.70710678118654746), (10, 0.70710678118654746)] [(9, 0.50804290089167492), (10, 0.50804290089167492), (11, 0.69554641952003704)] [(4, 0.62825804686700459), (10, 0.45889394536615247), (11, 0.62825804686700459)] ``` 在這種特殊情況下，我們正在改變我們用于訓練的同一語料庫，但這只是偶然的。一旦初始化了轉換模型，它就可以用在任何向量上（當然它們來自相同的向量空間），即使它們根本沒有用在訓練語料庫中。這是通過LSA的折疊過程，LDA的主題推斷等來實現的。 > 注意調用`model[corpus]`僅在舊`corpus`?文檔流周圍創建一個包裝器- 實際轉換在文檔迭代期間即時完成。我們無法在調用 `corpus_transformed = model[corpus]` 時轉換整個語料庫，因為這意味著將結果存儲在主存中，這與gensim的內存獨立目標相矛盾。如果您將多次迭代轉換，并且轉換成本[很高，請先將生成的語料庫序列化為磁盤](https://radimrehurek.com/gensim/tut1.html#corpus-formats)并繼續使用它。轉換也可以序列化，一個在另一個之上，在一個鏈中： ```py >>> lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2) # initialize an LSI transformation >>> corpus_lsi = lsi[corpus_tfidf] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lsi ``` 在這里，我們通過[潛在語義索引](https://en.wikipedia.org/wiki/Latent_semantic_indexing)將我們的Tf-Idf語料庫?轉換為潛在的2-D空間（因為我們設置了2-D?`num_topics=2`）。現在你可能想知道：這兩個潛在的維度代表什么？讓我們檢查一下`models.LsiModel.print_topics()`： ```py >>> lsi.print_topics(2) topic #0(1.594): -0.703*"trees" + -0.538*"graph" + -0.402*"minors" + -0.187*"survey" + -0.061*"system" + -0.060*"response" + -0.060*"time" + -0.058*"user" + -0.049*"computer" + -0.035*"interface" topic #1(1.476): -0.460*"system" + -0.373*"user" + -0.332*"eps" + -0.328*"interface" + -0.320*"response" + -0.320*"time" + -0.293*"computer" + -0.280*"human" + -0.171*"survey" + 0.161*"trees" ``` （主題打印到日志 - 請參閱本頁頂部有關激活日志記錄的說明）根據LSI的說法，“樹”，“圖”和“未成年人”都是相關詞（并且對第一個主題的方向貢獻最大），而第二個主題實際上與所有其他詞有關。正如所料，前五個文件與第二個主題的關聯性更強，而剩下的四個文件與第一個主題相關： ```py >>> for doc in corpus_lsi: # both bow->tfidf and tfidf->lsi transformations are actually executed here, on the fly ... print(doc) [(0, -0.066), (1, 0.520)] # "Human machine interface for lab abc computer applications" [(0, -0.197), (1, 0.761)] # "A survey of user opinion of computer system response time" [(0, -0.090), (1, 0.724)] # "The EPS user interface management system" [(0, -0.076), (1, 0.632)] # "System and human system engineering testing of EPS" [(0, -0.102), (1, 0.574)] # "Relation of user perceived response time to error measurement" [(0, -0.703), (1, -0.161)] # "The generation of random binary unordered trees" [(0, -0.877), (1, -0.168)] # "The intersection graph of paths in trees" [(0, -0.910), (1, -0.141)] # "Graph minors IV Widths of trees and well quasi ordering" [(0, -0.617), (1, 0.054)] # "Graph minors A survey" ``` 使用`save()`和`load()`函數實現模型持久性： ```py >>> lsi.save('/tmp/model.lsi') # same for tfidf, lda, ... >>> lsi = models.LsiModel.load('/tmp/model.lsi') ``` 接下來的問題可能是：這些文件之間的相似程度如何？有沒有辦法形式化相似性，以便對于給定的輸入文檔，我們可以根據它們的相似性訂購一些其他文檔？[下一個教程](https://radimrehurek.com/gensim/tut3.html)將介紹相似性查詢。 ## [可用的轉換](https://radimrehurek.com/gensim/tut2.html#available-transformations "永久鏈接到這個標題") gensim實現了幾種流行的矢量空間模型算法： * [術語頻率*反向文檔頻率，Tf-Idf](https://en.wikipedia.org/wiki/Tf%E2%80%93idf)?期望初始化期間的詞袋（整數值）訓練語料庫。在變換期間，它將采用向量并返回具有相同維度的另一個向量，除了在訓練語料庫中罕見的特征將增加其值。因此，它將整數值向量轉換為實值向量，同時保持維度的數量不變。它還可以任選地將得到的矢量歸一化為（歐幾里得）單位長度。 `>>> model = models.TfidfModel(corpus, normalize=True)` * [潛在語義索引，LSI（或有時LSA）](https://en.wikipedia.org/wiki/Latent_semantic_indexing)?將文檔從單詞袋或（優選地）TfIdf加權空間轉換為較低維度的潛在空間。對于上面的玩具語料庫，我們只使用了2個潛在維度，但在實際語料庫中，建議將200-500的目標維度作為“黃金標準”?[[1]](https://radimrehurek.com/gensim/tut2.html#id6)。 `>>> model = models.LsiModel(tfidf_corpus, id2word=dictionary, num_topics=300)` LSI培訓的獨特之處在于我們可以隨時繼續“培訓”，只需提供更多培訓文件即可。這是通過在稱為在線培訓的過程中對底層模型的增量更新來完成的。由于這個特性，輸入文檔流甚至可能是無限的 - 只需在LSI新文檔到達時繼續提供它們，同時使用計算的轉換模型作為只讀！ ```py >>> model.add_documents(another_tfidf_corpus) # now LSI has been trained on tfidf_corpus + another_tfidf_corpus >>> lsi_vec = model[tfidf_vec] # convert some new document into the LSI space, without affecting the model >>> ... >>> model.add_documents(more_documents) # tfidf_corpus + another_tfidf_corpus + more_documents >>> lsi_vec = model[tfidf_vec] >>> ... ``` 有關[`gensim.models.lsimodel`](https://radimrehurek.com/gensim/models/lsimodel.html#module-gensim.models.lsimodel "gensim.models.lsimodel：潛在語義索引")如何使LSI逐漸“忘記”無限流中的舊觀察的詳細信息，請參閱文檔。如果你想變臟，還有一些你可以調整的參數會影響速度與內存占用量和LSI算法的數值精度。 gensim使用了一種新穎的在線增量流分布式訓練算法（相當滿口！），我在[[5]中](https://radimrehurek.com/gensim/tut2.html#id10)發表過。gensim還執行Halko等人的隨機多遍算法。[[4]](https://radimrehurek.com/gensim/tut2.html#id9)內部，加速核心部分的計算。另請參閱[英語維基百科上的實驗，](https://radimrehurek.com/gensim/wiki.html)以便通過在計算機集群中分配計算來進一步提高速度。 * [隨機投影，RP](http://www.cis.hut.fi/ella/publications/randproj_kdd.pdf)旨在減少向量空間維度。這是一種非常有效的（內存和CPU友好的）方法，通過投入一點隨機性來近似文檔之間的TfIdf距離。建議的目標維度再次為數百/數千，具體取決于您的數據集。 `>>> model = models.RpModel(tfidf_corpus, num_topics=500)` * [Latent Dirichlet Allocation，LDA](https://en.wikipedia.org/wiki/Latent_Dirichlet_allocation)?是另一種從詞袋計數轉變為低維度主題空間的轉變。LDA是LSA（也稱為多項PCA）的概率擴展，因此LDA的主題可以解釋為對單詞的概率分布。與LSA一樣，這些分布也是從訓練語料庫中自動推斷出來的。文檔又被解釋為這些主題的（軟）混合（再次，就像LSA一樣）。 `>>> model = models.LdaModel(corpus, id2word=dictionary, num_topics=100)` gensim使用基于[[2]](https://radimrehurek.com/gensim/tut2.html#id7)的在線LDA參數估計的快速實現，修改為在計算機集群上以[分布式模式](https://radimrehurek.com/gensim/distributed.html)運行。 * [分層Dirichlet過程，HDP](http://jmlr.csail.mit.edu/proceedings/papers/v15/wang11a/wang11a.pdf)?是一種非參數貝葉斯方法（請注意缺少的請求主題數）： `>>> model = models.HdpModel(corpus, id2word=dictionary)` gensim使用基于[[3]](https://radimrehurek.com/gensim/tut2.html#id8)的快速在線實現。HDP模型是gensim的新成員，并且在學術方面仍然很粗糙 - 謹慎使用。添加新的VSM轉換（例如不同的加權方案）相當簡單;?有關更多信息和示例，請參閱[API參考](https://radimrehurek.com/gensim/apiref.html)或直接參閱[Python代碼](https://github.com/piskvorky/gensim/blob/develop/gensim/models/tfidfmodel.py)。值得重申的是，這些都是獨特的**增量**實現，不需要整個訓練語料庫一次性存在于主存儲器中。有了內存，我現在正在改進[分布式計算](https://radimrehurek.com/gensim/distributed.html)，以提高CPU效率。如果您認為自己可以做出貢獻（通過測試，提供用例或代碼），請[告訴我們](mailto:radimrehurek%40seznam.cz)。繼續閱讀下一個關于[相似性查詢的](https://radimrehurek.com/gensim/tut3.html)教程。 --- [[1]](https://radimrehurek.com/gensim/tut2.html#id1) 布拉德福德。2008.對大規模潛在語義索引應用程序所需維度的實證研究。 [[2]](https://radimrehurek.com/gensim/tut2.html#id4) 霍夫曼，布萊，巴赫。2010.潛在Dirichlet分配的在線學習。 [[3]](https://radimrehurek.com/gensim/tut2.html#id5) 王，佩斯利，布萊。2011.層級Dirichlet過程的在線變分推理。 [[4]](https://radimrehurek.com/gensim/tut2.html#id3) Halko，Martinsson，Tropp。2009.找到隨機性的結構。 [[5]](https://radimrehurek.com/gensim/tut2.html#id2) ?eh??ek。2011.潛在語義分析的子空間跟蹤。