【非必需，建議安裝】pdftotext · WenKuOS仿百度文庫系統源碼使用文檔

# pdftotext ## 作用提取 PDF 中的文本內容 ## 安裝 ### Windows `Windows 下不需要安裝`，因為我目前也沒有發現存在Windows的版本。不安裝這個工具，對程序有影響，但是影響不大，因為從PDF中提取txt文本內容，還可以使用 calibre 進行提取。 ### Linux ~~~ [sudo] apt install poppler-utils ~~~ ### Mac ~~~ [sudo] brew install poppler-utils ~~~ ## 是否安裝成功執行如下命令： ~~~ pdftotext --help ~~~ 看到如下結果，則表示安裝成功。 ~~~ pdftotext --help------pdftotext version 0.41.0Copyright 2005-2016 The Poppler Developers - http://poppler.freedesktop.orgCopyright 1996-2011 Glyph & Cog, LLCUsage: pdftotext [options] <PDF-file> [<text-file>] -f <int> : first page to convert -l <int> : last page to convert -r <fp> : resolution, in DPI (default is 72) -x <int> : x-coordinate of the crop area top left corner -y <int> : y-coordinate of the crop area top left corner -W <int> : width of crop area in pixels (default is 0) -H <int> : height of crop area in pixels (default is 0) -layout : maintain original physical layout -fixed <fp> : assume fixed-pitch (or tabular) text -raw : keep strings in content stream order -htmlmeta : generate a simple HTML file, including the meta information -enc <string> : output text encoding name -listenc : list available encodings -eol <string> : output end-of-line convention (unix, dos, or mac) -nopgbrk : don't insert page breaks between pages -bbox : output bounding box for each word and page size to html. Sets -htmlmeta -bbox-layout : like -bbox but with extra layout bounding box data. Sets -htmlmeta -opw <string> : owner password (for encrypted files) -upw <string> : user password (for encrypted files) -q : don't print any messages or errors -v : print copyright and version info -h : print usage information -help : print usage information --help : print usage information -? : print usage information ~~~ ## 測試使用如下命令，測試文本提取結果。 ~~~ pdftotext -f 1 -l 5 example.pdf example.txt ~~~ 如果提取到 txt 文件中的文本內容沒有出現亂碼，則表示內容提取成功。如果出現亂碼，需要從字符編碼和中文字體排查。