<ruby id="bdb3f"></ruby>

    <p id="bdb3f"><cite id="bdb3f"></cite></p>

      <p id="bdb3f"><cite id="bdb3f"><th id="bdb3f"></th></cite></p><p id="bdb3f"></p>
        <p id="bdb3f"><cite id="bdb3f"></cite></p>

          <pre id="bdb3f"></pre>
          <pre id="bdb3f"><del id="bdb3f"><thead id="bdb3f"></thead></del></pre>

          <ruby id="bdb3f"><mark id="bdb3f"></mark></ruby><ruby id="bdb3f"></ruby>
          <pre id="bdb3f"><pre id="bdb3f"><mark id="bdb3f"></mark></pre></pre><output id="bdb3f"></output><p id="bdb3f"></p><p id="bdb3f"></p>

          <pre id="bdb3f"><del id="bdb3f"><progress id="bdb3f"></progress></del></pre>

                <ruby id="bdb3f"></ruby>

                企業??AI智能體構建引擎,智能編排和調試,一鍵部署,支持知識庫和私有化部署方案 廣告
                # 語料庫和向量空間 本教程[在此處](https://github.com/piskvorky/gensim/blob/develop/docs/notebooks/Corpora_and_Vector_Spaces.ipynb)以Jupyter Notebook的形式提供。 別忘了設置 ```py >>> import logging >>> logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO) ``` 如果你想看到記錄事件。 ## [從字符串到向量](https://radimrehurek.com/gensim/tut1.html#from-strings-to-vectors "永久鏈接到這個標題") 這一次,讓我們從表示為字符串的文檔開始: ```py >>> from gensim import corpora >>> >>> documents = ["Human machine interface for lab abc computer applications", >>> "A survey of user opinion of computer system response time", >>> "The EPS user interface management system", >>> "System and human system engineering testing of EPS", >>> "Relation of user perceived response time to error measurement", >>> "The generation of random binary unordered trees", >>> "The intersection graph of paths in trees", >>> "Graph minors IV Widths of trees and well quasi ordering", >>> "Graph minors A survey"] ``` 這是一個由九個文檔組成的小型語料庫,每個文檔只包含一個句子。 首先,讓我們對文檔進行標記,刪除常用單詞(使用玩具停止列表)以及僅在語料庫中出現一次的單詞: ```py >>> # remove common words and tokenize >>> stoplist = set('for a of the and to in'.split()) >>> texts = [[word for word in document.lower().split() if word not in stoplist] >>> for document in documents] >>> >>> # remove words that appear only once >>> from collections import defaultdict >>> frequency = defaultdict(int) >>> for text in texts: >>> for token in text: >>> frequency[token] += 1 >>> >>> texts = [[token for token in text if frequency[token] > 1] >>> for text in texts] >>> >>> from pprint import pprint # pretty-printer >>> pprint(texts) [['human', 'interface', 'computer'], ['survey', 'user', 'computer', 'system', 'response', 'time'], ['eps', 'user', 'interface', 'system'], ['system', 'human', 'system', 'eps'], ['user', 'response', 'time'], ['trees'], ['graph', 'trees'], ['graph', 'minors', 'trees'], ['graph', 'minors', 'survey']] ``` 您處理文件的方式可能會有所不同;?在這里,我只拆分空格來標記,然后小寫每個單詞。實際上,我使用這種特殊的(簡單和低效)設置來模仿Deerwester等人的原始LSA文章[[1]中](https://radimrehurek.com/gensim/tut1.html#id3)所做的實驗。 處理文檔的方式是多種多樣的,依賴于應用程序和語言,我決定*不*通過任何接口約束它們。相反,文檔由從中提取的特征表示,而不是由其“表面”字符串形式表示:如何使用這些特征取決于您。下面我描述一種常見的通用方法(稱為?*詞袋*),但請記住,不同的應用程序域需要不同的功能,而且,一如既往,它是[垃圾,垃圾輸出](https://en.wikipedia.org/wiki/Garbage_In,_Garbage_Out)?...... 要將文檔轉換為向量,我們將使用名為[bag-of-words](https://en.wikipedia.org/wiki/Bag_of_words)的文檔表示?。在此表示中,每個文檔由一個向量表示,其中每個向量元素表示問題 - 答案對,格式為: > “單詞系統出現在文檔中的次數是多少?一旦。” 僅通過它們的(整數)id來表示問題是有利的。問題和ID之間的映射稱為字典: ```py >>> dictionary = corpora.Dictionary(texts) >>> dictionary.save('/tmp/deerwester.dict') # store the dictionary, for future reference >>> print(dictionary) Dictionary(12 unique tokens) ``` 在這里,我們為語料庫中出現的所有單詞分配了一個唯一的整數id?[`gensim.corpora.dictionary.Dictionary`](https://radimrehurek.com/gensim/corpora/dictionary.html#gensim.corpora.dictionary.Dictionary "gensim.corpora.dictionary.Dictionary")。這會掃描文本,收集字數和相關統計數據。最后,我們看到在處理過的語料庫中有12個不同的單詞,這意味著每個文檔將由12個數字表示(即,通過12-D向量)。要查看單詞及其ID之間的映射: ```py >>> print(dictionary.token2id) {'minors': 11, 'graph': 10, 'system': 5, 'trees': 9, 'eps': 8, 'computer': 0, 'survey': 4, 'user': 7, 'human': 1, 'time': 6, 'interface': 2, 'response': 3} ``` 要將標記化文檔實際轉換為向量: ```py >>> new_doc = "Human computer interaction" >>> new_vec = dictionary.doc2bow(new_doc.lower().split()) >>> print(new_vec) # the word "interaction" does not appear in the dictionary and is ignored [(0, 1), (1, 1)] ``` 該函數`doc2bow()`只計算每個不同單詞的出現次數,將單詞轉換為整數單詞id,并將結果作為稀疏向量返回。?因此,稀疏向量 `[(0, 1), (1, 1)]` 讀取:在文檔“人機交互”中,單詞computer?(id 0)和human(id 1)出現一次;?其他十個字典單詞(隱含地)出現零次。 ```py >>> corpus = [dictionary.doc2bow(text) for text in texts] >>> corpora.MmCorpus.serialize('/tmp/deerwester.mm', corpus) # store to disk, for later use >>> print(corpus) [(0, 1), (1, 1), (2, 1)] [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)] [(2, 1), (5, 1), (7, 1), (8, 1)] [(1, 1), (5, 2), (8, 1)] [(3, 1), (6, 1), (7, 1)] [(9, 1)] [(9, 1), (10, 1)] [(9, 1), (10, 1), (11, 1)] [(4, 1), (10, 1), (11, 1)] ``` 到目前為止,應該清楚的是,矢量要素 `id=10` 代表問題“文字中出現多少次文字?”,前六個文件的答案為“零”,其余三個答案為“一” 。事實上,我們已經得到了與[快速示例](https://radimrehurek.com/gensim/tutorial.html#first-example)中完全相同的向量語料庫。 ## [語料庫流 - 一次一個文檔](https://radimrehurek.com/gensim/tut1.html#corpus-streaming-one-document-at-a-time "永久鏈接到這個標題") 請注意,上面的語料庫完全駐留在內存中,作為普通的Python列表。在這個簡單的例子中,它并不重要,但為了使事情清楚,讓我們假設語料庫中有數百萬個文檔。將所有這些存儲在RAM中是行不通的。相反,我們假設文檔存儲在磁盤上的文件中,每行一個文檔。gensim只要求語料庫必須能夠一次返回一個文檔向量: ```py >>> class MyCorpus(object): >>> def __iter__(self): >>> for line in open('mycorpus.txt'): >>> # assume there's one document per line, tokens separated by whitespace >>> yield dictionary.doc2bow(line.lower().split()) ``` 在[此處](https://radimrehurek.com/gensim/mycorpus.txt)下載示例[mycorpus.txt文件](https://radimrehurek.com/gensim/mycorpus.txt)。假設每個文檔在單個文件中占據一行并不重要;?您可以模擬__iter__函數以適合您的輸入格式,無論它是什么。行走目錄,解析XML,訪問網絡......只需解析輸入以在每個文檔中檢索一個干凈的標記列表,然后通過字典將標記轉換為它們的ID,并在__iter__中生成生成的稀疏向量。 ```py >>> corpus_memory_friendly = MyCorpus() # doesn't load the corpus into memory! >>> print(corpus_memory_friendly) ``` 語料庫現在是一個對象。我們沒有定義任何打印方式,因此print只輸出內存中對象的地址。不是很有用。要查看構成向量,讓我們遍歷語料庫并打印每個文檔向量(一次一個): ```py >>> for vector in corpus_memory_friendly: # load one vector into memory at a time ... print(vector) [(0, 1), (1, 1), (2, 1)] [(0, 1), (3, 1), (4, 1), (5, 1), (6, 1), (7, 1)] [(2, 1), (5, 1), (7, 1), (8, 1)] [(1, 1), (5, 2), (8, 1)] [(3, 1), (6, 1), (7, 1)] [(9, 1)] [(9, 1), (10, 1)] [(9, 1), (10, 1), (11, 1)] [(4, 1), (10, 1), (11, 1)] ``` 盡管輸出與普通Python列表的輸出相同,但語料庫現在更加內存友好,因為一次最多只有一個向量駐留在RAM中。您的語料庫現在可以隨意擴展。 類似地,構造字典而不將所有文本加載到內存中: ```py >>> from six import iteritems >>> # collect statistics about all tokens >>> dictionary = corpora.Dictionary(line.lower().split() for line in open('mycorpus.txt')) >>> # remove stop words and words that appear only once >>> stop_ids = [dictionary.token2id[stopword] for stopword in stoplist >>> if stopword in dictionary.token2id] >>> once_ids = [tokenid for tokenid, docfreq in iteritems(dictionary.dfs) if docfreq == 1] >>> dictionary.filter_tokens(stop_ids + once_ids) # remove stop words and words that appear only once >>> dictionary.compactify() # remove gaps in id sequence after words that were removed >>> print(dictionary) Dictionary(12 unique tokens) ``` 這就是它的全部!至少就字袋表示而言。當然,我們用這種語料庫做的是另一個問題;?如何計算不同單詞的頻率可能是有用的,這一點都不清楚。事實證明,它不是,我們需要首先對這個簡單的表示應用轉換,然后才能使用它來計算任何有意義的文檔與文檔的相似性。轉換將[在下一個教程中介紹](https://radimrehurek.com/gensim/tut2.html),但在此之前,讓我們簡單地將注意力轉向*語料庫持久性*。 ## [語料庫格式](https://radimrehurek.com/gensim/tut1.html#corpus-formats "永久鏈接到這個標題") 存在幾種用于將Vector Space語料庫(?矢量序列)序列化到磁盤的文件格式。?gensim通過前面提到的*流式語料庫接口*實現它們:文件以懶惰的方式從(分別存儲到)磁盤讀取,一次一個文檔,而不是一次將整個語料庫讀入主存儲器。 [市場矩陣格式](http://math.nist.gov/MatrixMarket/formats.html)是一種比較值得注意的文件[格式](http://math.nist.gov/MatrixMarket/formats.html)。要以Matrix Market格式保存語料庫: ```py >>> # create a toy corpus of 2 documents, as a plain Python list >>> corpus = [[(1, 0.5)], []] # make one document empty, for the heck of it >>> >>> corpora.MmCorpus.serialize('/tmp/corpus.mm', corpus) ``` 其他格式包括[Joachim的SVMlight格式](http://svmlight.joachims.org/),?[Blei的LDA-C格式](https://www.cs.princeton.edu/~blei/lda-c/)和?[GibbsLDA ++格式](http://gibbslda.sourceforge.net/)。 ```py >>> corpora.SvmLightCorpus.serialize('/tmp/corpus.svmlight', corpus) >>> corpora.BleiCorpus.serialize('/tmp/corpus.lda-c', corpus) >>> corpora.LowCorpus.serialize('/tmp/corpus.low', corpus) ``` 相反,要從Matrix Market文件加載語料庫迭代器: ```py >>> corpus = corpora.MmCorpus('/tmp/corpus.mm') ``` 語料庫對象是流,因此通常您將無法直接打印它們: ```py >>> print(corpus) MmCorpus(2 documents, 2 features, 1 non-zero entries) ``` 相反,要查看語料庫的內容: ```py >>> # one way of printing a corpus: load it entirely into memory >>> print(list(corpus)) # calling list() will convert any sequence to a plain Python list [[(1, 0.5)], []] ``` 要么 ```py >>> # another way of doing it: print one document at a time, making use of the streaming interface >>> for doc in corpus: ... print(doc) [(1, 0.5)] [] ``` 第二種方式顯然對內存更友好,但是出于測試和開發目的,沒有什么比調用的簡單性更好`list(corpus)`。 要以Blei的LDA-C格式保存相同的Matrix Market文檔流, ```py >>> corpora.BleiCorpus.serialize('/tmp/corpus.lda-c', corpus) ``` 通過這種方式,gensim還可以用作內存高效的**I / O格式轉換工具**:只需使用一種格式加載文檔流,然后立即以另一種格式保存。添加新格式非常容易,請查看[SVMlight語料庫](https://github.com/piskvorky/gensim/blob/develop/gensim/corpora/svmlightcorpus.py)的[代碼](https://github.com/piskvorky/gensim/blob/develop/gensim/corpora/svmlightcorpus.py)示例。 ## [與NumPy和SciPy的兼容性](https://radimrehurek.com/gensim/tut1.html#compatibility-with-numpy-and-scipy "永久鏈接到這個標題") gensim還包含[有效的實用程序函數](https://radimrehurek.com/gensim/matutils.html)?來幫助轉換為/ numpy矩陣: ```py >>> import gensim >>> import numpy as np >>> numpy_matrix = np.random.randint(10, size=[5,2]) # random matrix as an example >>> corpus = gensim.matutils.Dense2Corpus(numpy_matrix) >>> numpy_matrix = gensim.matutils.corpus2dense(corpus, num_terms=number_of_corpus_features) ``` 從/到scipy.sparse矩陣: ```py >>> import scipy.sparse >>> scipy_sparse_matrix = scipy.sparse.random(5,2) # random sparse matrix as example >>> corpus = gensim.matutils.Sparse2Corpus(scipy_sparse_matrix) >>> scipy_csc_matrix = gensim.matutils.corpus2csc(corpus) ``` --- 要獲得完整的參考(想要將字典修剪為更小的尺寸?優化語料庫和NumPy / SciPy數組之間的轉換?),請參閱[API文檔](https://radimrehurek.com/gensim/apiref.html)。或者繼續下一個關于[主題和轉換的](https://radimrehurek.com/gensim/tut2.html)教程。 [[1]](https://radimrehurek.com/gensim/tut1.html#id1) 這與[Deerwester等人](http://www.cs.bham.ac.uk/~pxt/IDA/lsa_ind.pdf)使用的語料庫相同?[。](http://www.cs.bham.ac.uk/~pxt/IDA/lsa_ind.pdf)[(1990):通過潛在語義分析進行索引](http://www.cs.bham.ac.uk/~pxt/IDA/lsa_ind.pdf),表2。
                  <ruby id="bdb3f"></ruby>

                  <p id="bdb3f"><cite id="bdb3f"></cite></p>

                    <p id="bdb3f"><cite id="bdb3f"><th id="bdb3f"></th></cite></p><p id="bdb3f"></p>
                      <p id="bdb3f"><cite id="bdb3f"></cite></p>

                        <pre id="bdb3f"></pre>
                        <pre id="bdb3f"><del id="bdb3f"><thead id="bdb3f"></thead></del></pre>

                        <ruby id="bdb3f"><mark id="bdb3f"></mark></ruby><ruby id="bdb3f"></ruby>
                        <pre id="bdb3f"><pre id="bdb3f"><mark id="bdb3f"></mark></pre></pre><output id="bdb3f"></output><p id="bdb3f"></p><p id="bdb3f"></p>

                        <pre id="bdb3f"><del id="bdb3f"><progress id="bdb3f"></progress></del></pre>

                              <ruby id="bdb3f"></ruby>

                              哎呀哎呀视频在线观看