轉載請注明出處:[http://blog.csdn.net/xiaojimanman/article/details/46785223](http://blog.csdn.net/xiaojimanman/article/details/46785223)
[http://www.llwjy.com/blogdetail/efda32f346445dd8423a942aa4c8c2cd.html](http://www.llwjy.com/blogdetail/efda32f346445dd8423a942aa4c8c2cd.html)
個人博客站已經上線了,網址 [www.llwjy.com](http://www.llwjy.com) ~歡迎各位吐槽~
-------------------------------------------------------------------------------------------------
首先和大家說一生抱歉,由于最近經常在外面出差,博客斷更了很長時間,后面不出意外的話,博客會恢復更新。
在上次的博客中已經介紹了縱橫小說的數據庫表結構,這里需要說明的是,我在設計數據表的時候,取消了數據表之間的外鍵,至于為什么這樣做這里就不再多說,感興趣的可以自行百度下。下面我們就開始今天的介紹:
**模版類**
在介紹數據庫的操作之前,我們首先看一下定義的模版(javabean),這里定義了四個模版分別為抓取入口信息模版、小說簡介頁模版、小說章節列表模版、小說閱讀頁模版,類中只有一些簡單的set和put方法,下面就看下具體的代碼實現:
1.CrawlListInfo
~~~
/**
*@Description:
*/
package com.lulei.crawl.novel.zongheng.model;
public class CrawlListInfo {
private String url;
private String info;
private int frequency;
public String getUrl() {
return url;
}
public void setUrl(String url) {
this.url = url;
}
public String getInfo() {
return info;
}
public void setInfo(String info) {
this.info = info;
}
public int getFrequency() {
return frequency;
}
public void setFrequency(int frequency) {
this.frequency = frequency;
}
}
~~~
2.NovelIntroModel
~~~
/**
*@Description:
*/
package com.lulei.crawl.novel.zongheng.model;
public class NovelIntroModel {
private String md5Id;
private String name;
private String author;
private String description;
private String type;
private String lastChapter;
private String chapterlisturl;
private int wordCount;
private String keyWords;
private int chapterCount;
public String getMd5Id() {
return md5Id;
}
public void setMd5Id(String md5Id) {
this.md5Id = md5Id;
}
public String getName() {
return name;
}
public void setName(String name) {
this.name = name;
}
public String getAuthor() {
return author;
}
public void setAuthor(String author) {
this.author = author;
}
public String getDescription() {
return description;
}
public void setDescription(String description) {
this.description = description;
}
public String getType() {
return type;
}
public void setType(String type) {
this.type = type;
}
public String getLastChapter() {
return lastChapter;
}
public void setLastChapter(String lastChapter) {
this.lastChapter = lastChapter;
}
public String getChapterlisturl() {
return chapterlisturl;
}
public void setChapterlisturl(String chapterlisturl) {
this.chapterlisturl = chapterlisturl;
}
public int getWordCount() {
return wordCount;
}
public void setWordCount(int wordCount) {
this.wordCount = wordCount;
}
public String getKeyWords() {
return keyWords;
}
public void setKeyWords(String keyWords) {
this.keyWords = keyWords;
}
public int getChapterCount() {
return chapterCount;
}
public void setChapterCount(int chapterCount) {
this.chapterCount = chapterCount;
}
}
~~~
3.NovelChapterModel
~~~
/**
*@Description:
*/
package com.lulei.crawl.novel.zongheng.model;
public class NovelChapterModel {
private String url;
private int chapterId;
private long time;
public String getUrl() {
return url;
}
public void setUrl(String url) {
this.url = url;
}
public int getChapterId() {
return chapterId;
}
public void setChapterId(int chapterId) {
this.chapterId = chapterId;
}
public long getTime() {
return time;
}
public void setTime(long time) {
this.time = time;
}
}
~~~
4.NovelReadModel
~~~
/**
*@Description:
*/
package com.lulei.crawl.novel.zongheng.model;
public class NovelReadModel extends NovelChapterModel {
private String title;
private int wordCount;
private String content;
public String getTitle() {
return title;
}
public void setTitle(String title) {
this.title = title;
}
public int getWordCount() {
return wordCount;
}
public void setWordCount(int wordCount) {
this.wordCount = wordCount;
}
public String getContent() {
return content;
}
public void setContent(String content) {
this.content = content;
}
}
~~~
**數據庫操作**
這里的數據庫操作我們使用博客[《基于lucene的案例開發:數據庫連接池》](http://www.llwjy.com/blogdetail/9f4d773be6ae1408b4b70ddd789360f4.html)介紹的數據庫連接池,在采集這個業務過程中,主要是插入和查詢操作,當然還有記錄的狀態值的更新操作,下面我們就每一個操作介紹一個方法,方面大家理解如何使用我們自己的數據庫連接池操作來完成數據庫的增刪改查操作。
1.數據表查詢:隨機獲取一條記錄
我們之后的爬蟲希望可以做成分布式的采集,因此這里我們在獲取簡介頁的URL時候,我們可以每次獲取一個隨機值,這樣在線程之間出現同時采集一個URL的情況就會大大降低,至于Mysql中的隨機我們可以使用?order by rand() limit n 來獲取前n條記錄,其他的數據庫實現方式稍微有點差異。
~~~
/**
* @param state
* @return
* @Author:lulei
* @Description: 隨機獲取一個簡介url
*/
public String getRandIntroPageUrl(int state) {
DBServer dbServer = new DBServer(POOLNAME);
try {
String sql = "select * from novelinfo where state = '" + state + "' order by rand() limit 1";
ResultSet rs = dbServer.select(sql);
while (rs.next()) {
return rs.getString("url");
}
} catch (Exception e) {
e.printStackTrace();
} finally{
dbServer.close();
}
return null;
}
~~~
在這個方法中,我們直接使用DBServer中的select(String sql)方法即可執行對應的sql語句,他的返回值就是查詢的結果集。
2.數據表更新:修改簡介頁的抓取狀態
在簡介頁一次采集完成之后或者更新列表頁檢測到該簡介頁有更新的時候,我們需要對小說的簡介頁的抓取狀態進行修改,標識這個簡介頁已經完成采集或需要采集,我們直接使用DBServer中的update(String sql)方法即可執行對應的sql語句。
~~~
/**
* @param md5Id
* @param state
* @Author:lulei
* @Description: 修改簡介頁的抓取狀態
*/
public void updateInfoState(String md5Id, int state) {
DBServer dbServer = new DBServer(POOLNAME);
try {
String sql = "update novelinfo set state = '" + state + "' where id = '" + md5Id + "'";
dbServer.update(sql);
} catch (Exception e) {
e.printStackTrace();
} finally{
dbServer.close();
}
}
~~~
3.數據表插入:保存小說閱讀頁信息
在完成小說閱讀頁數據解析之后,我們需要將解析后的數據持久化到數據庫中,這里我們可以使用DBServer中的insert(String table, String columns, HashMap<Integer, Object> params)方法即可執行相關的插入操作。
~~~
/**
* @param novel
* @Author:lulei
* @Description: 保存小說閱讀頁信息
*/
public void saveNovelRead(NovelReadModel novel) {
if (novel == null) {
return;
}
DBServer dbServer = new DBServer(POOLNAME);
try {
HashMap<Integer, Object> params = new HashMap<Integer, Object>();
int i = 1;
String md5Id = ParseMD5.parseStrToMd5L32(novel.getUrl());
//如果已經存在,則直接返回
if (haveReadUrl(md5Id)) {
return;
}
long now = System.currentTimeMillis();
params.put(i++, md5Id);
params.put(i++, novel.getUrl());
params.put(i++, novel.getTitle());
params.put(i++, novel.getWordCount());
params.put(i++, novel.getChapterId());
params.put(i++, novel.getContent());
params.put(i++, novel.getTime());
params.put(i++, now);
params.put(i++, now);
dbServer.insert("novelchapterdetail", "id,url,title,wordcount,chapterid,content,chaptertime,createtime,updatetime", params);
} catch (Exception e) {
e.printStackTrace();
} finally{
dbServer.close();
}
}
~~~
**廬山真面目**
完整的縱橫小說數據庫操作類代碼如下:
~~~
/**
*@Description: 縱橫中文小說數據庫操作
*/
package com.lulei.db.novel.zongheng;
import java.sql.ResultSet;
import java.util.ArrayList;
import java.util.HashMap;
import java.util.List;
import com.lulei.crawl.novel.zongheng.model.CrawlListInfo;
import com.lulei.crawl.novel.zongheng.model.NovelChapterModel;
import com.lulei.crawl.novel.zongheng.model.NovelIntroModel;
import com.lulei.crawl.novel.zongheng.model.NovelReadModel;
import com.lulei.db.manager.DBServer;
import com.lulei.util.ParseMD5;
public class ZonghengDb {
private static final String POOLNAME = "proxool.test";
/**
* @param urls
* @Author:lulei
* @Description: 保存更新列表采集到的URL
*/
public void saveInfoUrls(List<String> urls) {
if (urls == null || urls.size() < 1) {
return;
}
for (String url : urls) {
String md5Id = ParseMD5.parseStrToMd5L32(url);
if (haveInfoUrl(md5Id)) {
updateInfoState(md5Id, 1);
} else {
insertInfoUrl(md5Id, url);
}
}
}
/**
* @param state
* @return
* @Author:lulei
* @Description: 隨機獲取一個簡介url
*/
public String getRandIntroPageUrl(int state) {
DBServer dbServer = new DBServer(POOLNAME);
try {
String sql = "select * from novelinfo where state = '" + state + "' order by rand() limit 1";
ResultSet rs = dbServer.select(sql);
while (rs.next()) {
return rs.getString("url");
}
} catch (Exception e) {
e.printStackTrace();
} finally{
dbServer.close();
}
return null;
}
/**
* @param state
* @return
* @Author:lulei
* @Description: 隨機獲取一個章節信息
*/
public NovelChapterModel getRandReadPageUrl(int state) {
DBServer dbServer = new DBServer(POOLNAME);
try {
String sql = "select * from novelchapter where state = '" + state + "' order by rand() limit 1";
ResultSet rs = dbServer.select(sql);
while (rs.next()) {
NovelChapterModel chapter = new NovelChapterModel();
chapter.setChapterId(rs.getInt("chapterid"));
chapter.setTime(rs.getLong("chaptertime"));
chapter.setUrl(rs.getString("url"));
return chapter;
}
} catch (Exception e) {
e.printStackTrace();
} finally{
dbServer.close();
}
return null;
}
/**
* @param novel
* @Author:lulei
* @Description: 保存小說閱讀頁信息
*/
public void saveNovelRead(NovelReadModel novel) {
if (novel == null) {
return;
}
DBServer dbServer = new DBServer(POOLNAME);
try {
HashMap<Integer, Object> params = new HashMap<Integer, Object>();
int i = 1;
String md5Id = ParseMD5.parseStrToMd5L32(novel.getUrl());
//如果已經存在,則直接返回
if (haveReadUrl(md5Id)) {
return;
}
long now = System.currentTimeMillis();
params.put(i++, md5Id);
params.put(i++, novel.getUrl());
params.put(i++, novel.getTitle());
params.put(i++, novel.getWordCount());
params.put(i++, novel.getChapterId());
params.put(i++, novel.getContent());
params.put(i++, novel.getTime());
params.put(i++, now);
params.put(i++, now);
dbServer.insert("novelchapterdetail", "id,url,title,wordcount,chapterid,content,chaptertime,createtime,updatetime", params);
} catch (Exception e) {
e.printStackTrace();
} finally{
dbServer.close();
}
}
/**
* @return
* @Author:lulei
* @Description: 獲取監控的更新列表頁
*/
public List<CrawlListInfo> getCrawlListInfos(){
List<CrawlListInfo> infos = new ArrayList<CrawlListInfo>();
DBServer dbServer = new DBServer(POOLNAME);
try {
String sql = "select * from crawllist where state = '1'";
ResultSet rs = dbServer.select(sql);
while (rs.next()) {
CrawlListInfo info = new CrawlListInfo();
infos.add(info);
info.setFrequency(rs.getInt("frequency"));
info.setInfo(rs.getString("info"));
info.setUrl(rs.getString("url"));
}
} catch (Exception e) {
e.printStackTrace();
} finally{
dbServer.close();
}
return infos;
}
/**
* @param bean
* @Author:lulei
* @Description: 更新簡介頁記錄
*/
public void updateInfo(NovelIntroModel bean) {
if (bean == null) {
return;
}
DBServer dbServer = new DBServer(POOLNAME);
try {
HashMap<Integer, Object> params = new HashMap<Integer, Object>();
int i = 1;
params.put(i++, bean.getName());
params.put(i++, bean.getAuthor());
params.put(i++, bean.getDescription());
params.put(i++, bean.getType());
params.put(i++, bean.getLastChapter());
params.put(i++, bean.getChapterCount());
params.put(i++, bean.getChapterlisturl());
params.put(i++, bean.getWordCount());
params.put(i++, bean.getKeyWords());
long now = System.currentTimeMillis();
params.put(i++, now);
params.put(i++, "0");
String columns = "name, author, description, type, lastchapter, chaptercount, chapterlisturl, wordcount, keywords, updatetime, state";
String condition = "where id = '" + bean.getMd5Id() + "'";
dbServer.update("novelinfo", columns, condition, params);
} catch (Exception e) {
e.printStackTrace();
} finally{
dbServer.close();
}
}
/**
* @param chapters
* @Author:lulei
* @Description: 保存章節列表信息
*/
public void saveChapters(List<String[]> chapters) {
if (chapters == null) {
return;
}
DBServer dbServer = new DBServer(POOLNAME);
try {
for (int i = 0; i < chapters.size(); i++) {
String[] chapter = chapters.get(i);
if (chapter.length != 4) {
continue;
}
//name、wordcount、time、url
String md5Id = ParseMD5.parseStrToMd5L32(chapter[3]);
if (!haveChapterUrl(md5Id)) {
insertChapterUrl(chapter, i);
}
}
} catch (Exception e) {
e.printStackTrace();
} finally{
dbServer.close();
}
}
/**
* @param md5Id
* @param state
* @Author:lulei
* @Description: 修改簡介頁的抓取狀態
*/
public void updateInfoState(String md5Id, int state) {
DBServer dbServer = new DBServer(POOLNAME);
try {
String sql = "update novelinfo set state = '" + state + "' where id = '" + md5Id + "'";
dbServer.update(sql);
} catch (Exception e) {
e.printStackTrace();
} finally{
dbServer.close();
}
}
/**
* @param md5Id
* @param state
* @Author:lulei
* @Description: 更新章節列表采集狀態
*/
public void updateChapterState(String md5Id, int state) {
DBServer dbServer = new DBServer(POOLNAME);
try {
String sql = "update novelchapter set state = '" + state + "' where id = '" + md5Id + "'";
dbServer.update(sql);
} catch (Exception e) {
e.printStackTrace();
} finally{
dbServer.close();
}
}
/**
* @param md5Id
* @param url
* @Author:lulei
* @Description: 新增一個抓取簡介頁
*/
private void insertInfoUrl(String md5Id, String url) {
DBServer dbServer = new DBServer(POOLNAME);
try {
HashMap<Integer, Object> params = new HashMap<Integer, Object>();
int i = 1;
params.put(i++, md5Id);
params.put(i++, url);
long now = System.currentTimeMillis();
params.put(i++, now);
params.put(i++, now);
params.put(i++, "1");
dbServer.insert("novelinfo", "id, url, createtime, updatetime, state", params);
} catch (Exception e) {
e.printStackTrace();
} finally{
dbServer.close();
}
}
/**
* @param md5Id
* @return
* @Author:lulei
* @Description: 判斷簡介頁是否存在
*/
private boolean haveInfoUrl(String md5Id) {
DBServer dbServer = new DBServer(POOLNAME);
try {
ResultSet rs = dbServer.select("select sum(1) as count from novelinfo where id = '" + md5Id + "'");
if (rs.next()) {
int count = rs.getInt("count");
return count > 0;
}
return false;
} catch (Exception e) {
e.printStackTrace();
return true;
} finally{
dbServer.close();
}
}
/**
* @param md5Id
* @return
* @Author:lulei
* @Description: 判斷閱讀頁信息是否存在
*/
private boolean haveReadUrl(String md5Id) {
DBServer dbServer = new DBServer(POOLNAME);
try {
ResultSet rs = dbServer.select("select sum(1) as count from novelchapterdetail where id = '" + md5Id + "'");
if (rs.next()) {
int count = rs.getInt("count");
return count > 0;
}
return false;
} catch (Exception e) {
e.printStackTrace();
return true;
} finally{
dbServer.close();
}
}
/**
* @param chapter
* @param chapterId
* @Author:lulei
* @Description: 插入章節列表頁信息
*/
private void insertChapterUrl(String[] chapter, int chapterId) {
//name、wordcount、time、url
DBServer dbServer = new DBServer(POOLNAME);
try {
HashMap<Integer, Object> params = new HashMap<Integer, Object>();
int i = 1;
params.put(i++, ParseMD5.parseStrToMd5L32(chapter[3]));
params.put(i++, chapter[3]);
params.put(i++, chapter[0]);
params.put(i++, chapter[1]);
params.put(i++, chapterId);
params.put(i++, chapter[2]);
long now = System.currentTimeMillis();
params.put(i++, now);
params.put(i++, "1");
dbServer.insert("novelchapter", "id, url, title, wordcount, chapterid, chaptertime, createtime, state", params);
} catch (Exception e) {
e.printStackTrace();
} finally{
dbServer.close();
}
}
/**
* @param md5Id
* @return
* @Author:lulei
* @Description: 是否存在章節信息
*/
private boolean haveChapterUrl(String md5Id) {
DBServer dbServer = new DBServer(POOLNAME);
try {
ResultSet rs = dbServer.select("select sum(1) as count from novelchapter where id = '" + md5Id + "'");
if (rs.next()) {
int count = rs.getInt("count");
return count > 0;
}
return false;
} catch (Exception e) {
e.printStackTrace();
return true;
} finally{
dbServer.close();
}
}
public static void main(String[] args) {
// TODO Auto-generated method stub
}
}
~~~
對于上面的代碼還希望大家可以認真的閱讀下,里面有一些簡單的去重操作;在下一篇博客中我們將會介紹如何基于這寫數據庫操作來實現分布式采集。
----------------------------------------------------------------------------------------------------
ps:最近發現其他網站可能會對博客轉載,上面并沒有源鏈接,如想查看更多關于[ 基于lucene的案例開發](http://blog.csdn.net/xiaojimanman/article/category/2841877) 請[點擊這里](http://www.llwjy.com/blogtype/lucene.html)。或訪問網址http://blog.csdn.net/xiaojimanman/article/category/2841877 或 http://www.llwjy.com/blogtype/lucene.html
-------------------------------------------------------------------------------------------------
小福利
-------------------------------------------------------------------------------------------------
個人在極客學院上《Lucene案例開發》課程已經上線了(目前上線到第二課),歡迎大家吐槽~
[第一課:Lucene概述](http://www.jikexueyuan.com/course/937.html)
[第二課:Lucene 常用功能介紹](http://www.jikexueyuan.com/course/1292.html)
- 前言
- 寫在開始之前
- lucene初始認知
- 索引數學模型
- 索引文件結構
- 創建索引
- 搜索索引
- 分詞器介紹
- Query查詢
- IndexSearcher中檢索方法
- 更新說明
- 案例初識
- JsonUtil &amp; XmlUtil
- 基ClassUtil &amp; CharsetUtil
- ParseUtil &amp; ParseRequest
- 數據庫連接池
- 實現實時索引基本原理
- 實時索引管理類IndexManager
- 實時索引的檢索
- 實時索引的修改
- 查詢語句創建PackQuery
- 縱橫小說更新列表頁抓取
- 縱橫小說簡介頁采集
- 縱橫小說章節列表采集
- 縱橫小說閱讀頁采集
- 縱橫小說數據庫設計
- 縱橫小說數據庫操作
- 縱橫小說分布式采集