n-gram · python深度學習

從一個句子中提取的*N*個（或更少）連續單詞的集合。這一概念中的“單詞”也可以替換為“字符” > “The cat sat on the mat.”（“貓坐在墊子上”） > **二元語法（2-grams）:** ~~~ {"The", "The cat", "cat", "cat sat", "sat", "sat on", "on", "on the", "the", "the mat", "mat"} ~~~ **三元語法（3-grams）:** ~~~ {"The", "The cat", "cat", "cat sat", "The cat sat", "sat", "sat on", "on", "cat sat on", "on the", "the", "sat on the", "the mat", "mat", "on the mat"} ~~~ * 這樣的集合分別叫作**二元語法袋**（bag-of-2-grams）及**三元語法袋**（bag-of-3-grams） * **袋**（bag）這一術語指的是，我們處理的是標記組成的集合，而不是一個列表或序列，即標記**沒有特定的順序** * 詞袋是一種**不保存順序**的分詞方法（生成的標記組成一個集合，而不是一個序列，舍棄了句子的總體結構） ***** * 使用**輕量級**的淺層文本處理模型時（比如 logistic 回歸和隨機森林），n-gram 是一種功能強大、不可或缺的特征工程工具