本教程介紹如何使用Nutch的readdb,readlinkdb和readseg來對Nutch的數據進行分析
## 1 readdb
用于讀取或者導出Nutch的抓取數據庫,通常用于查看數據庫的狀態信息,查看readdb的用法:
~~~
bin/nutch?readdb
Usage:?CrawlDbReader?<crawldb>?(-stats?|?-dump?<out_dir>?|?-topN?<nnnn>?<out_dir>?[<min>]?|?-url?<url>)
<crawldb>directory?name?where?crawldb?is?located
-stats?[-sort]?print?overall?statistics?to?System.out
[-sort]list?status?sorted?by?host
-dump?<out_dir>?[-format?normal|csv|crawldb]dump?the?whole?db?to?a?text?file?in?<out_dir>
[-format?csv]dump?in?Csv?format
[-format?normal]dump?in?standard?format?(default?option)
[-format?crawldb]dump?as?CrawlDB
[-regex?<expr>]filter?records?with?expression
[-retry?<num>]minimum?retry?count
[-status?<status>]filter?records?by?CrawlDatum?status
-url?<url>print?information?on?<url>?to?System.out
-topN?<nnnn>?<out_dir>?[<min>]dump?top?<nnnn>?urls?sorted?by?score?to?<out_dir>
[<min>]skip?records?with?scores?below?this?value.
This?can?significantly?improve?performance.
~~~
這里的crawldb即為保存URL信息的數據庫,具體可參閱http://www.sanesee.com/article/step-by-step-nutch-crawl-by-step(Nutch 1.10入門教程(五)——分步抓取),-stats表示查看統計狀態信息,-dump表示導出統計信息,url表示查看指定URL的信息,查看數據庫狀態信息:
~~~
bin/nutch?readdb?data/crawldb?–stats
~~~
得到的統計結果如下:
~~~
Statistics for CrawlDb: data/crawldb
TOTAL urls: 59
retry 0: 59
min score: 0.001
avg score: 0.049677964
max score: 1.124
status 1 (db_unfetched): 34
status 2 (db_fetched): 25
CrawlDb statistics: done
~~~
TOTAL urls表示URL總數,retry表示重試次數,mins score為最低分數,max score為最高分數,status 1 (db_unfetched)為未抓取的數目,status 2 (db_fetched)為已抓取的數目。
導出crawldb信息:
~~~
bin/nutch?readdb?data/crawldb?-dump?crawldb_dump
~~~
將數據導入到crawldb_dump這個文件夾中,查看導出的數據信息:
~~~
cat?crawldb_dump/*
~~~
可以看到,導出的信息類似以下格式:
~~~
http://www.sanesee.com/psy/pdp Version: 7
Status: 2 (db_fetched)
Fetch time: Fri Aug 14 12:47:10 CST 2015
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 0.082285136
Signature: e567e99a1d008ae29266a7ef9ea43414
Metadata:?
? _pst_=success(1), lastModified=0
_rs_=205
Content-Type=text/html
~~~
我們就可以清楚地看到crawldb是如何保存我們的URL的。
## 2 readlinkdb
readlinkdb用于導出全部URL和錨文本,查看用法:
~~~
bin/nutch?readlinkdb
Usage:?LinkDbReader?<linkdb>?(-dump?<out_dir>?[-regex?<regex>])?|?-url?<url>
-dump?<out_dir>dump?whole?link?db?to?a?text?file?in?<out_dir>
-regex?<regex>restrict?to?url's?matching?expression
-url?<url>print?information?about?<url>?to?System.out
~~~
這里的dump和url參數與readdb命令同理,導出數據:
bin/nutch readlinkdb data/linkdb -dump linkdb_dump
將數據導入到linkdb_dump這個文件夾中,查看導出的數據信息:
~~~
cat?linkdb_dump?/*
~~~
可以看到,導出的信息類似以下格式:
http://archive.apache.org/dist/nutch/ Inlinks:
?fromUrl: http://www.sanesee.com/article/step-by-step-nutch-introduction anchor: http://archive.apache.org/dist/nutch/
即記錄了來源URL。
## 3 readseg
readseg用于查看或導出segment里面的數據,查看使用方法:
~~~
bin/nutch?readseg
Usage:?SegmentReader?(-dump?...?|?-list?...?|?-get?...)?[general?options]
*?General?options:
-nocontentignore?content?directory
-nofetchignore?crawl_fetch?directory
-nogenerateignore?crawl_generate?directory
-noparseignore?crawl_parse?directory
-noparsedataignore?parse_data?directory
-noparsetextignore?parse_text?directory
*?SegmentReader?-dump?<segment_dir>?<output>?[general?options]
??Dumps?content?of?a?<segment_dir>?as?a?text?file?to?<output>.
<segment_dir>name?of?the?segment?directory.
<output>name?of?the?(non-existent)?output?directory.
*?SegmentReader?-list?(<segment_dir1>?...?|?-dir?<segments>)?[general?options]
??List?a?synopsis?of?segments?in?specified?directories,?or?all?segments?in
??a?directory?<segments>,?and?print?it?on?System.out
<segment_dir1>?...list?of?segment?directories?to?process
-dir?<segments>directory?that?contains?multiple?segments
*?SegmentReader?-get?<segment_dir>?<keyValue>?[general?options]
??Get?a?specified?record?from?a?segment,?and?print?it?on?System.out.
<segment_dir>name?of?the?segment?directory.
<keyValue>value?of?the?key?(url).
Note:?put?double-quotes?around?strings?with?spaces.
~~~
導出segment數據:
bin/nutch readseg -dump data/segments/20150715124521 segment_dump
將數據導入到segment_dump這個文件夾中,查看導出的數據信息:
~~~
cat?segment_dump?/*
~~~
可以看到,里面包含非常具體的網頁信息。
到此,本教程對Nutch最主要的命令就介紹完了,其它的命令讀者可以自己去研究一下。