英語維基百科上的實驗 · Gensim 中文文檔

# 英語維基百科上的實驗為了測試gensim性能，我們針對英文版的Wikipedia運行它。此頁面描述了獲取和處理Wikipedia的過程，以便任何人都可以重現結果。假設您已正確[安裝](https://radimrehurek.com/gensim/install.html)?gensim。[](https://radimrehurek.com/gensim/install.html) ## [準備語料庫](https://radimrehurek.com/gensim/wiki.html#preparing-the-corpus "永久鏈接到這個標題") 1. 首先，從[http://download.wikimedia.org/enwiki/](https://download.wikimedia.org/enwiki/)下載所有維基百科文章的轉儲?（您需要文件enwiki-latest-pages-articles.xml.bz2或enwiki-YYYYMMDD-pages-articles.xml。 bz2用于特定于日期的轉儲）。此文件大小約為8GB，包含英語維基百科的所有文章（壓縮版本）。 2. 將文章轉換為純文本（處理Wiki標記）并將結果存儲為稀疏TF-IDF向量。在Python中，這很容易在運行中進行，我們甚至不需要將整個存檔解壓縮到磁盤。gensim中包含一個腳本?可以執行此操作，運行： `$ python -m gensim.scripts.make_wiki` > 注意 * 這個預處理步驟通過8.2GB壓縮wiki轉儲進行兩次傳遞（一次用于提取字典，一次用于創建和存儲稀疏向量），并且在筆記本電腦上花費大約9個小時，因此您可能想要喝咖啡或二。 * 此外，您將需要大約35GB的可用磁盤空間來存儲稀疏輸出向量。我建議立即壓縮這些文件，例如使用bzip2（低至~13GB）。gensim可以直接使用壓縮文件，因此可以節省磁盤空間。 ## [潛在語義分析](https://radimrehurek.com/gensim/wiki.html#latent-semantic-analysis "永久鏈接到這個標題") 首先讓我們加載在上面第二步中創建的語料庫迭代器和字典： ```py >>> import logging, gensim >>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) >>> # load id->word mapping (the dictionary), one of the results of step 2 above >>> id2word = gensim.corpora.Dictionary.load_from_text('wiki_en_wordids.txt') >>> # load corpus iterator >>> mm = gensim.corpora.MmCorpus('wiki_en_tfidf.mm') >>> # mm = gensim.corpora.MmCorpus('wiki_en_tfidf.mm.bz2') # use this if you compressed the TFIDF output (recommended) >>> print(mm) MmCorpus(3931787 documents, 100000 features, 756379027 non-zero entries) ``` 我們看到我們的語料庫包含3.9M文檔，100K特征（不同的標記）和稀疏TF-IDF矩陣中的0.76G非零條目。維基百科語料庫共包含約22.4億個令牌。現在我們準備計算英語維基百科的LSA： ```py >>> # extract 400 LSI topics; use the default one-pass algorithm >>> lsi = gensim.models.lsimodel.LsiModel(corpus=mm, id2word=id2word, num_topics=400) >>> # print the most contributing words (both positively and negatively) for each of the first ten topics >>> lsi.print_topics(10) topic #0(332.762): 0.425*"utc" + 0.299*"talk" + 0.293*"page" + 0.226*"article" + 0.224*"delete" + 0.216*"discussion" + 0.205*"deletion" + 0.198*"should" + 0.146*"debate" + 0.132*"be" topic #1(201.852): 0.282*"link" + 0.209*"he" + 0.145*"com" + 0.139*"his" + -0.137*"page" + -0.118*"delete" + 0.114*"blacklist" + -0.108*"deletion" + -0.105*"discussion" + 0.100*"diff" topic #2(191.991): -0.565*"link" + -0.241*"com" + -0.238*"blacklist" + -0.202*"diff" + -0.193*"additions" + -0.182*"users" + -0.158*"coibot" + -0.136*"user" + 0.133*"he" + -0.130*"resolves" topic #3(141.284): -0.476*"image" + -0.255*"copyright" + -0.245*"fair" + -0.225*"use" + -0.173*"album" + -0.163*"cover" + -0.155*"resolution" + -0.141*"licensing" + 0.137*"he" + -0.121*"copies" topic #4(130.909): 0.264*"population" + 0.246*"age" + 0.243*"median" + 0.213*"income" + 0.195*"census" + -0.189*"he" + 0.184*"households" + 0.175*"were" + 0.167*"females" + 0.166*"males" topic #5(120.397): 0.304*"diff" + 0.278*"utc" + 0.213*"you" + -0.171*"additions" + 0.165*"talk" + -0.159*"image" + 0.159*"undo" + 0.155*"www" + -0.152*"page" + 0.148*"contribs" topic #6(115.414): -0.362*"diff" + -0.203*"www" + 0.197*"you" + -0.180*"undo" + -0.180*"kategori" + 0.164*"users" + 0.157*"additions" + -0.150*"contribs" + -0.139*"he" + -0.136*"image" topic #7(111.440): 0.429*"kategori" + 0.276*"categoria" + 0.251*"category" + 0.207*"kategorija" + 0.198*"kategorie" + -0.188*"diff" + 0.163*"категория" + 0.153*"categoría" + 0.139*"kategoria" + 0.133*"categorie" topic #8(109.907): 0.385*"album" + 0.224*"song" + 0.209*"chart" + 0.204*"band" + 0.169*"released" + 0.151*"music" + 0.142*"diff" + 0.141*"vocals" + 0.138*"she" + 0.132*"guitar" topic #9(102.599): -0.237*"league" + -0.214*"he" + -0.180*"season" + -0.174*"football" + -0.166*"team" + 0.159*"station" + -0.137*"played" + -0.131*"cup" + 0.131*"she" + -0.128*"utc" ``` 在我的筆記本電腦上創建維基百科的LSI模型大約需要4小時9分鐘[[1]](https://radimrehurek.com/gensim/wiki.html#id6)。這是約**每分鐘16000的文件，包括所有的I / O**。 > 注意如果您需要更快的結果，請參閱[分布式計算](https://radimrehurek.com/gensim/distributed.html)教程。請注意，gensim中的BLAS庫透明地使用多個內核，因此可以“免費”在多核計算機上更快地處理相同的數據，而無需任何分布式設置。我們看到總處理時間主要是從原始維基百科XML轉儲準備TF-IDF語料庫的預處理步驟，花費了9小時。[[2]](https://radimrehurek.com/gensim/wiki.html#id7) gensim中使用的算法只需要查看每個輸入文檔一次，因此它適用于文檔作為不可重復的流，或者多次存儲/迭代語料庫的成本太高的環境。 ## [潛在Dirichlet分配](https://radimrehurek.com/gensim/wiki.html#latent-dirichlet-allocation "永久鏈接到這個標題") 與上面的Latent Semantic Analysis一樣，首先加載語料庫迭代器和字典： ```py >>> import logging, gensim >>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) >>> # load id->word mapping (the dictionary), one of the results of step 2 above >>> id2word = gensim.corpora.Dictionary.load_from_text('wiki_en_wordids.txt') >>> # load corpus iterator >>> mm = gensim.corpora.MmCorpus('wiki_en_tfidf.mm') >>> # mm = gensim.corpora.MmCorpus('wiki_en_tfidf.mm.bz2') # use this if you compressed the TFIDF output >>> print(mm) MmCorpus(3931787 documents, 100000 features, 756379027 non-zero entries) ``` 我們將運行在線LDA（參見Hoffman等人[[3]](https://radimrehurek.com/gensim/wiki.html#id8)），這是一個算法，需要一大堆文件，更新LDA模型，取另一個塊，更新模型等。在線LDA可以與批處理LDA進行對比，批處理LDA處理整個語料庫（一次完整通過），然后更新模型，然后另一個傳遞，另一個更新...不同的是，給定一個相當固定的文檔流（沒有太多的主題漂移），較小的塊（子扇區）上的在線更新本身相當不錯，因此模型估計收斂更快。因此，我們可能只需要對語料庫進行一次完整傳遞：如果語料庫有300萬篇文章，并且我們在每10,000篇文章后更新一次，這意味著我們將在一次傳遞中完成300次更新，很可能足以有一個非常準確的主題估計： ```py >>> # extract 100 LDA topics, using 1 pass and updating once every 1 chunk (10,000 documents) >>> lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=100, update_every=1, chunksize=10000, passes=1) using serial LDA version on this node running online LDA training, 100 topics, 1 passes over the supplied corpus of 3931787 documents, updating model once every 10000 documents ... ``` 與LSA不同，來自LDA的主題更容易理解： ```py >>> # print the most contributing words for 20 randomly selected topics >>> lda.print_topics(20) topic #0: 0.009*river + 0.008*lake + 0.006*island + 0.005*mountain + 0.004*area + 0.004*park + 0.004*antarctic + 0.004*south + 0.004*mountains + 0.004*dam topic #1: 0.026*relay + 0.026*athletics + 0.025*metres + 0.023*freestyle + 0.022*hurdles + 0.020*ret + 0.017*divis?o + 0.017*athletes + 0.016*bundesliga + 0.014*medals topic #2: 0.002*were + 0.002*he + 0.002*court + 0.002*his + 0.002*had + 0.002*law + 0.002*government + 0.002*police + 0.002*patrolling + 0.002*their topic #3: 0.040*courcelles + 0.035*centimeters + 0.023*mattythewhite + 0.021*wine + 0.019*stamps + 0.018*oko + 0.017*perennial + 0.014*stubs + 0.012*ovate + 0.011*greyish topic #4: 0.039*al + 0.029*sysop + 0.019*iran + 0.015*pakistan + 0.014*ali + 0.013*arab + 0.010*islamic + 0.010*arabic + 0.010*saudi + 0.010*muhammad topic #5: 0.020*copyrighted + 0.020*northamerica + 0.014*uncopyrighted + 0.007*rihanna + 0.005*cloudz + 0.005*knowles + 0.004*gaga + 0.004*zombie + 0.004*wigan + 0.003*maccabi topic #6: 0.061*israel + 0.056*israeli + 0.030*sockpuppet + 0.025*jerusalem + 0.025*tel + 0.023*aviv + 0.022*palestinian + 0.019*ifk + 0.016*palestine + 0.014*hebrew topic #7: 0.015*melbourne + 0.014*rovers + 0.013*vfl + 0.012*australian + 0.012*wanderers + 0.011*afl + 0.008*dinamo + 0.008*queensland + 0.008*tracklist + 0.008*brisbane topic #8: 0.011*film + 0.007*her + 0.007*she + 0.004*he + 0.004*series + 0.004*his + 0.004*episode + 0.003*films + 0.003*television + 0.003*best topic #9: 0.019*wrestling + 0.013*chateau + 0.013*ligue + 0.012*discus + 0.012*estonian + 0.009*uci + 0.008*hockeyarchives + 0.008*wwe + 0.008*estonia + 0.007*reign topic #10: 0.078*edits + 0.059*notability + 0.035*archived + 0.025*clearer + 0.022*speedy + 0.021*deleted + 0.016*hook + 0.015*checkuser + 0.014*ron + 0.011*nominator topic #11: 0.013*admins + 0.009*acid + 0.009*molniya + 0.009*chemical + 0.007*ch + 0.007*chemistry + 0.007*compound + 0.007*anemone + 0.006*mg + 0.006*reaction topic #12: 0.018*india + 0.013*indian + 0.010*tamil + 0.009*singh + 0.008*film + 0.008*temple + 0.006*kumar + 0.006*hindi + 0.006*delhi + 0.005*bengal topic #13: 0.047*bwebs + 0.024*malta + 0.020*hobart + 0.019*basa + 0.019*columella + 0.019*huon + 0.018*tasmania + 0.016*popups + 0.014*tasmanian + 0.014*modèle topic #14: 0.014*jewish + 0.011*rabbi + 0.008*bgwhite + 0.008*lebanese + 0.007*lebanon + 0.006*homs + 0.005*beirut + 0.004*jews + 0.004*hebrew + 0.004*caligari topic #15: 0.025*german + 0.020*der + 0.017*von + 0.015*und + 0.014*berlin + 0.012*germany + 0.012*die + 0.010*des + 0.008*kategorie + 0.007*cross topic #16: 0.003*can + 0.003*system + 0.003*power + 0.003*are + 0.003*energy + 0.002*data + 0.002*be + 0.002*used + 0.002*or + 0.002*using topic #17: 0.049*indonesia + 0.042*indonesian + 0.031*malaysia + 0.024*singapore + 0.022*greek + 0.021*jakarta + 0.016*greece + 0.015*dord + 0.014*athens + 0.011*malaysian topic #18: 0.031*stakes + 0.029*webs + 0.018*futsal + 0.014*whitish + 0.013*hyun + 0.012*thoroughbred + 0.012*dnf + 0.012*jockey + 0.011*medalists + 0.011*racehorse topic #19: 0.119*oblast + 0.034*uploaded + 0.034*uploads + 0.033*nordland + 0.025*selsoviet + 0.023*raion + 0.022*krai + 0.018*okrug + 0.015*h?logaland + 0.015*russiae + 0.020*manga + 0.017*dragon + 0.012*theme + 0.011*dvd + 0.011*super + 0.011*hunter + 0.009*ash + 0.009*dream + 0.009*angel ``` 在我的筆記本電腦上創建維基百科的這個LDA模型需要大約6小時20分鐘[[1]](https://radimrehurek.com/gensim/wiki.html#id6)。如果您需要更快地獲得結果，請考慮在計算機群集上運行[Distributed Latent Dirichlet Allocation](https://radimrehurek.com/gensim/dist_lda.html)。注意LDA和LSA運行之間的兩個區別：我們要求LSA提取400個主題，LDA只有100個主題（因此速度差異實際上更大）。其次，gensim中的LSA實現是真正的在線：如果輸入流的性質隨時間變化，LSA將在相當少量的更新中重新定位自己以反映這些變化。相比之下，LDA并不是真正的在線（?盡管[[3]](https://radimrehurek.com/gensim/wiki.html#id8)文章的名稱），因為后來更新對模型的影響逐漸減弱。如果輸入文檔流中存在主題偏差，LDA將會變得混亂，并且在調整自身以適應新的狀態時會越來越慢。簡而言之，如果使用LDA逐步將新文檔添加到模型中，請務必小心。**批量使用LDA**，其中整個訓練語料庫事先已知或未顯示主題漂移，**是可以的并且不受影響**。要運行批量LDA（不在線），請使用以下方法訓練LdaModel： ```py >>> # extract 100 LDA topics, using 20 full passes, no online updates >>> lda = gensim.models.ldamodel.LdaModel(corpus=mm, id2word=id2word, num_topics=100, update_every=0, passes=20) ``` 像往常一樣，訓練有素的模型可以用來將新的，看不見的文檔（簡單的詞袋計數向量）轉換為LDA主題分布： ```py >>> doc_lda = lda[doc_bow] ``` --- [1] *（[1](https://radimrehurek.com/gensim/wiki.html#id1)，[2](https://radimrehurek.com/gensim/wiki.html#id4)）*我的筆記本=的MacBook Pro，英特爾酷睿i7 2.3GHz的，16GB DDR3 RAM，具有OS X?libVec。 [[2]](https://radimrehurek.com/gensim/wiki.html#id2) 在這里，我們最感興趣的是性能，但是查看檢索到的LSA概念也很有趣。我不是維基百科的專家，也沒有看到維基百科的內容，但Brian Mingus對結果有這樣的說法： ```py There appears to be a lot of noise in your dataset. The first three topics in your list appear to be meta topics, concerning the administration and cleanup of Wikipedia. These show up because you didn't exclude templates such as these, some of which are included in most articles for quality control: http://en.wikipedia.org/wiki/Wikipedia:Template_messages/Cleanup The fourth and fifth topics clearly shows the influence of bots that import massive databases of cities, countries, etc. and their statistics such as population, capita, etc. The sixth shows the influence of sports bots, and the seventh of music bots. ``` 因此，十大概念顯然由維基百科機器人和擴展模板主導;?這是一個很好的提醒，LSA是一個強大的數據分析工具，但沒有銀彈。一如既往，它是[垃圾，垃圾輸出](https://en.wikipedia.org/wiki/Garbage_In,_Garbage_Out)?......順便說一句，歡迎改進Wiki標記解析代碼:-) [3] *（[1](https://radimrehurek.com/gensim/wiki.html#id3)，[2](https://radimrehurek.com/gensim/wiki.html#id5)）*霍夫曼，Blei，巴赫。2010.潛在Dirichlet分配的在線學習[?[pdf](https://www.cs.princeton.edu/~blei/papers/HoffmanBleiBach2010b.pdf)?] [?[code](https://www.cs.princeton.edu/~mdhoffma/)?]