利用馬爾可夫鏈生成隨機文本 · 程序算法藝術與實踐

### 問題描述二階馬爾可夫鏈：**例如：of the people, by thepeople, for the people** ### 分析 Prefix(后綴數組)???? Suffix of the?????????? ? ? ? ? ? ? people??????????????????????????????????? 比如 thepeople后面可以跟by for 空，可根據概率選擇一個如果選擇for，則過度到狀態people for，從 the people??????????????? by???????????????????????????????????????? 而可以知道后續為the依次類推，給定一個前綴，可以一直往下找，直到”空”為止 people, by???????????????? the????????????????????????????????????????**幾種實現方法：(note:C++ stl中可用map>實現)** by ?the? ??????? people???????? 1)普通的hash表，記錄前綴與后綴(比如前綴thepeople的后綴包括by, for,空)，給定前綴可以通過hash the people??? for?????????表很快查找到后綴，然后根據概率選擇一個后綴(根據在后綴中出現次數)，過渡到下一個狀態 people for?the?????? ?2)使用后綴數組，數組中每個元素指向一個單詞的開始的位置，**先對后綴數組排序**，然后**用二分查找** for the????????? people????????**得到prefix第一次出現的位置?，**最后往后遍歷根據概率選擇一個后綴。 the people??????????????????????? 空????????3) hash表與后綴數組相結合，使用后綴數組構造hash表 **首先解決一個問題** 當有多個suffix時，如何按照概率選擇一個，比如the people by for 空， ~~~ int nmatch=0; for everyone in suffix if( rand()%++nmatch==0 ) select=this_suffix; ~~~ 對每一個后綴都執行上述的判斷，可知第一個suffix一定被選中，第二個suffix以1/2的概率替換，第三個以1/3的概率替換 ~~~ #include <stdio.h> #include <string.h> #include <stdlib.h> #define NHASH 49979 #define MULT 31 #define MAXWORDS 80000 char inputchars[4300000];//存儲輸入數據 char *word[MAXWORDS];//后綴數組 int nword=0;//記錄單詞數 int k=2;//2階 int next[MAXWORDS];//用于構建hash表 int bin[NHASH]; //以k個單詞為單位，進行hash unsigned int hash(char* str){ int n; unsigned int h=0; char* p=str; for(n=k;n>0;++p){ h=MULT*h+*p; if(*p=='\0') --n; } return h%NHASH; } //比較前k個單詞的大小 int wordncmp(char* p,char *q){ int n; for(n=k;*p==*q;++p,++q){ if(*p=='\0'&&(--n)==0) return 0; } return *p-*q; } //從當前單詞出發，跳過前n個單詞 char* skip(char* p,int n){ for(;n>0;++p){ if(*p=='\0') --n; } return p; } int main(){ int i,j; //步驟1：構建后綴數組 word[0]=inputchars; //scanf以空格作為分隔符, 并且自動加上'\0' while((scanf("%s",word[nword]))!=EOF){ word[nword+1]=word[nword]+strlen(word[nword])+1; ++nword; } //附加k個空字符,保證wordncmp()正確（感覺不需要這個） for(i=0;i<k;++i) word[nword][i]='\0'; //步驟2：構建hash table //初始化hash table for(i=0;i<NHASH;++i) bin[i]=-1; //hash表采用前插的方式。例如：word[0], word[1], word[5]擁有相同的hash值15 //則： bin[15](5)->next[5](1)->next[1](0)->next[0](-1) for(i=0;i<=nword-k;++i) { j=hash(word[i]); next[i]=bin[j]; bin[j]=i; } //步驟3：生成隨機文本 int wordsleft;//生成單詞數 int psofar; char *phrase,*p; phrase=inputchars; for(wordsleft=10000;wordsleft>0;--wordsleft){ psofar=0; for(j=bin[hash(phrase)];j>=0;j=next[j]) //在hash值相同的項中找出字符串值相同的后綴數組表項，根據概率選擇一個 if(wordncmp(phrase,word[j])==0&&rand()%(++psofar)==0) p=word[j]; //將phrase重新設置 phrase=skip(p,1); //輸出符合要求單詞的后面第k個單詞 if(strlen(skip(phrase,k-1))==0) break; printf("%s\n",skip(phrase,k-1)); } return 0; } ~~~ **轉載請注明出處：**[http://blog.csdn.net/utimes/article/details/8864122](http://blog.csdn.net/utimes/article/details/8864122)