1.3.4 tesserocr的安裝 · python3爬蟲筆記

# 1.3.4 tesserocr的安裝 ## 1.相關鏈接 {#2-相關鏈接} * Tesserocr GitHub：[https://github.com/sirfz/tesserocr](https://github.com/sirfz/tesserocr) * Tesserocr PyPi：[https://pypi.python.org/pypi/tesserocr](https://pypi.python.org/pypi/tesserocr) * Tesseract下載地址：[http://digi.bib.uni-mannheim.de/tesseract](http://digi.bib.uni-mannheim.de/tesseract) * Tesseract GitHub：[https://github.com/tesseract-ocr/tesseract](https://github.com/tesseract-ocr/tesseract) * Tesseract 語言包：[https://github.com/tesseract-ocr/tessdata](https://github.com/tesseract-ocr/tessdata) * Tesseract 文檔：[https://github.com/tesseract-ocr/tesseract/wiki/Documentation](https://github.com/tesseract-ocr/tesseract/wiki/Documentation) * Tesseract wheel:[https://github.com/simonflueckiger/tesserocr-windows\_build/releases](https://github.com/simonflueckiger/tesserocr-windows_build/releases) ## 2.windows下安裝 ![](https://box.kancloud.cn/16ba6d8277e18fa69acb972b21f21540_818x802.png) 其中文件名中帶有 dev 的為開發版本，不帶 dev 的為穩定版本，可以選擇下載不帶 dev 的最新版本，例如可以選擇下載 tesseract-ocr-setup-3.05.01.exe。接下來安裝Tesserocr，直接使用pip 安裝 ```text pip3 install tesserocr pillow ``` 如果tesserocr安裝報錯需要使用whl安裝，并且需要安裝Microsoft Visual C++ 14.0.exe [下載地址](https://pan.baidu.com/s/1lL3WVCE2T-4zQJbjloP6-w) ```text 首先切換到有whl的目錄然后通過pip安裝 pip install tesserocr-2.2.2-cp36-cp36m-win_amd64.whl 在這之前需要安裝Microsoft Visual C++ 14.0.exe ``` ## 3.Linux下的安裝 {#4-linux下的安裝} 對于 Linux 來說，不同系統已經有了不同的發行包了，它可能叫做 tesseract-ocr 或者 tesseract，直接用對應的命令安裝即可。 ### Ubuntu、Debian、Deepin {#ubuntu、debian、deepin} 安裝命令如下： ```text sudo apt-get install -y tesseract-ocr libtesseract-dev libleptonica-dev ``` ### CentOS、RedHat {#centos、redhat} 安裝命令如下： ```text yum install -y tesseract ``` 不同發行版本運行如上命令即可完成 Tesseract 的安裝。安裝完成之后便可以調用 tesseract 命令了。我們查看一下其支持的語言： ```text tesseract --list-langs ``` 運行結果示例： ```text List of available languages (3): eng osd equ ``` 結果顯示其只支持幾種語言，如果我們想要安裝多國語言還需要安裝語言包，官方叫做 tessdata。 tessdata 的下載鏈接為：[https://github.com/tesseract-ocr/tessdata](https://github.com/tesseract-ocr/tessdata)。利用 Git 命令將其下載下來并遷移到相關目錄即可，不同的版本遷移命令如下： ### Ubuntu、Debian、Deepin {#ubuntu、debian、deepin} ```text git clone https://github.com/tesseract-ocr/tessdata.git sudo mv tessdata/* /usr/share/tesseract-ocr/tessdata ``` ### CentOS、RedHat {#centos、redhat} ```text git clone https://github.com/tesseract-ocr/tessdata.git sudo mv tessdata/* /usr/share/tesseract/tessdata ``` 這樣就可以將下載下來的語言包全部安裝了。這時我們重新運行列出所有語言的命令： ```text tesseract --list-langs ``` 結果如下： ```text List of available languages (107): afr amh ara asm aze aze_cyrl bel ben bod bos bul cat ceb ces chi_sim chi_tra ... ``` 即可發現其列出的語言就多了非常多，比如 chi\_sim 就代表簡體中文，這就證明語言包安裝成功了。接下來再安裝 Tesserocr 即可，直接使用 Pip 安裝： ```text pip3 install tesserocr pillow ``` ## 4. Mac下的安裝 {#5-mac下的安裝} Mac 下首先使用 Homebrew 安裝 Imagemagick 和 Tesseract 庫： ```text brew install imagemagick brew install tesseract --all-languages ``` 接下來再安裝 Tesserocr 即可： ```text pip3 install tesserocr pillow ``` 這樣便完成了 Tesserocr 的安裝。 ## 5.驗證安裝分別測試Tesseract 和 Tesserocr [測試圖片](https://raw.githubusercontent.com/Python3WebSpider/TestTess/master/image.png)： ![](https://box.kancloud.cn/b95c063e6f856598a1e2e477e97463b5_608x110.png) 用Tesseract 命令行測試，命令如下： ```text tesseract image.png result -l eng && type result.txt ``` 運行結果: ```text Tesseract Open Source OCR Engine v3.05.01 with Leptonica Python3WebSpider ``` 需要注意: * 需要將Tesseract-OCR目錄配置到環境變量中 * 需要將Tesseract-OCR目錄下的tessdata目錄配置到環境變量中![](https://box.kancloud.cn/718831ebfcb7aedd9b03a28edf519fd5_837x178.png) tesseract 命令參數講解: 第一個參數為圖片名稱，第二個參數 result 為結果保存的目標文件名稱，-l 指定使用的語言包，在此使用 eng 英文，然后再用 type命令將結果輸出。然后利用python代碼進行測試，需要借助Tesserocr庫，測試代碼如下: ```text import tesserocr from PIL import Image image = Image.open("D:/image.png") print(tesserocr.image_to_text(image)) ``` 如果報錯，錯誤方式如下: ![](https://box.kancloud.cn/7dfde8da64cafd5abde2f6f3828503e3_667x91.png) 需要將Tesseract-OCR目錄下的tessdata目錄拷貝到python36目錄下這時重新運行就會成功 ![](https://box.kancloud.cn/6abd6d7b1b81d481578fef7ec791da39_539x84.png) 在這里首先利用 Image 讀取了圖片文件，然后調用了 tesserocr 的 image\_to\_text\(\) 方法，再將將其識別結果輸出。運行結果： ```text Python3WebSpider ``` 另外我們還可以直接調用 file\_to\_text\(\) 方法，也可以達到同樣的效果： ```text import tesserocr print(tesserocr.file_to_text("D:\image.png")) ``` 運行結果： ```text Python3WebSpider ``` 如果成功輸出結果，則證明Tesseract和Tesserocr都已經安裝成功 ### 6.pytesseract庫 pytesseract庫和tesserocr庫的效果一樣，都可以識別圖片，都是用的Tesseract-OCR來進行識別圖片的，如果安裝不成功就不要糾結tesserocr庫了 ``` >>> import pytesseract >>> from PIL import Image >>> >>> image = Image.open('image.png') >>> code = pytesseract.image_to_string(image) >>> print(code) Python3WebSpider ```