使用Pytesseract進行光學字符識別

2022-02-18 15:21

磐創AI

關注

概述

本文，我們將使用計算機視覺技術從圖像中提取文本。提取文本后，我們將在該文本上應用 OpenCV 的一些基本功能來增強它并獲得更準確的結果。這個項目將非常有用，因為它可以節省從圖像打字的時間和精力。

范圍

· 對于將從圖像中獲取文本的大型組織而言，此應用程序可能會節省時間。

· 它可以打開“無紙化文檔”的世界，這也有助于升級存儲。

· 它還可以幫助自動化過程，因為它可以從圖像本身中獲取文本。

我們將導入requests庫以獲取 git 文件和圖像的 URL 。

＃import requests to install tesseract

import requests

注意：現在要下載 tesseract 文件，只需轉到我將在函數中作為參數提供的鏈接，但我只是提供另一種下載 tesseract 文件的方法。

＃ Downloading tesseract－ocr file

r ＝ requests．get（＂https：／／raw．githubusercontent．com／tesseract－ocr／tessdata／4．00／ind．traineddata＂， stream ＝ True）
將數據寫入文件以避免路徑問題

with open（＂ind．traineddata＂，＂wb＂） as file：

for block in r．iter＿content（chunk＿size ＝ 1024）：

if block：

file．write（block）

我們現在將下載Pytesseract 庫運行所需的tesseract，并將文件保存在open（）函數的路徑中。

！pip install pytesseract

如果你想將其安裝在筆記本中，此命令將安裝 Pytesseract 模塊

Requirement already satisfied： pytesseract in c：programdataanaconda3libsite－packages （0．3．8）

Requirement already satisfied： Pillow in c：programdataanaconda3libsite－packages （from pytesseract）（8．0．1）

在這一步中，我們將安裝 OCR 所需的庫，我們還將導入 IPython 函數以清除不需要的函數。

安裝光學字符識別所需的庫

！ apt install tesseract－ocr libtesseract－dev libmagickwand－dev

導入 IPython 以清除不重要的輸出

from IPython．display import HTML， clear＿output

clear＿output（）

現在，我們將安裝Pytesseract 和 OpenCV庫，它們是我們文本識別的靈魂

安裝Pytesseract 和 OpenCV！

pip install pytesseract wand opencv－python

clear＿output（）

導入所需的庫

＃ Import libraries

from PIL import Image

import pytesseract

import cv2

import numpy as np

from pytesseract import Output

import re

在這一步中，我們將打開一個圖像調整其大小，然后再次保存以供進一步使用和可視化。

從URL讀取圖像

image ＝ Image．open（requests．get（＇https：／／i．stack．imgur．com／pbIdS．png＇， stream＝True）．raw）

image ＝ image．resize（（300，150））

image．save（＇sample．png＇）

image

輸出：

設置tesseract的路徑

pytesseract．pytesseract．tesseract＿cmd ＝ r＇C：Program FilesTesseract－OCRtesseract．exe＇

注意：上面的命令將在系統配置中設置tesseract庫的路徑，如果路徑沒有根據系統配置設置，那么即使安裝了tesseract也會拋出錯誤。

在這里，我們將使用自定義配置從圖像中提取文本。

＃ Simply extracting text from image

custom＿config ＝ r＇－l eng －－oem 3 －－psm 6＇

text ＝ pytesseract．image＿to＿string（image，config＝custom＿config）

print（text）

輸出：

在自定義配置中，你可以看到＊＊“eng”表示英語，即它會識別英文字母，你還可以添加多種語言，“PSM”表示頁面分割＊＊，它設置了塊如何識別字符，“OEM”是默認配置。

現在，我們將通過用空字符串替換符號，從提取的文本中刪除不需要的符號

＃ Extracting text from image and removing irrelevant symbols from characters

try：

text＝pytesseract．image＿to＿string（image，lang＝＂eng＂）

characters＿to＿remove ＝＂！（）＠—＊“＞＋－／，＇｜?＃％＄＆＾＿～＂

new＿string ＝ text

for character in characters＿to＿remove：

new＿string ＝ new＿string．replace（character，＂＂）

print（new＿string）

except IOError as e：

輸出：

在下面的單元格中，我們將圖像讀入OpenCV格式以進一步處理。當我們需要從復雜圖像中提取文本時，這是必需的。

現在我們將執行OpenCV操作以從復雜圖像中獲取文本。

image ＝ cv2．imread（＇sample．png＇）＃ will read in the array format

輸出：

將圖像轉換為灰度圖像，使其處理起來不那么復雜，因為它只有 0 和 1 兩個值。這里我們使用cv2．cvtColor（）方法將彩色圖像轉換為灰度格式，而cv2．cvtColor 實際上可以幫助圖像的 150 色轉換。

灰度圖像

def get＿grayscale（image）：

return cv2．cvtColor（image， cv2．COLOR＿BGR2GRAY）

gray ＝ get＿grayscale（image）

Image．fromarray（gray）

輸出：

現在我們將模糊圖像，以便我們可以從圖像中去除噪聲。在這里，我們使用函數cv2．medianBlur（）函數以減少圖像中的噪聲＊＊，＊＊模糊基本上是通過應用相關平滑濾波器來平滑圖像的技術，是圖像處理中廣泛使用的方法之一。

降噪

def remove＿noise（image）：

return cv2．medianBlur（image，5）

noise ＝ remove＿noise（gray）

Image．fromarray（gray）

輸出：

我們將在這里進行閾值變換。閾值適用于簡單的概念，即當像素值低于給定的閾值時，顏色為白色，否則像素顏色正好相反，即黑色。使用的函數是cv2．threshold。

閾值

def thresholding（image）：

＃ source image， grayscale image

return cv2．threshold（image， 0， 255， cv2．THRESH＿BINARY ＋

cv2．THRESH＿OTSU）［1］

thresh ＝ thresholding（gray）

Image．fromarray（thresh）

輸出：

這里我們正在做腐蝕變換。腐蝕變換是圖像變換中最基本、最重要的步驟之一。腐蝕變換通常會擬合圖像中缺失的形狀和格子，這有助于在圖像中稍微模糊或扭曲時識別字符。在這里，我們使用cv2 庫中的erode（）函數進行腐蝕轉換。

腐蝕

def erode（image）：

kernel ＝ np．ones（（5，5），np．uint8）

return cv2．erode（image， kernel， iterations ＝ 1）

erode ＝ erode（gray）

Image．fromarray（erode）

輸出：

在這里，我們將執行形態變換。形態變換是最適合二值圖像的技術之一，它根據圖像的像素值對圖像進行排序，而不是在考慮閾值的情況下對圖像的數值進行排序。

形態變換

def opening（image）：

kernel ＝ np．ones（（5，5），np．uint8）

return cv2．morphologyEx（image， cv2．MORPH＿OPEN， kernel）

opening ＝ opening（gray）

Image．fromarray（opening）

輸出：

在這里，我們試圖匹配圖像。當我們傳遞相同的圖像進行匹配時，我們得到了99．99％的相似度。這里，模板匹配是一種在較大的圖像中搜索和查找模板圖像的位置的方法。對于模板匹配，我們使用cv2 庫中的 match template（）函數。

模板匹配

def match＿template（image， template）：

return cv2．matchTemplate（image， template， cv2．TM＿CCOEFF＿NORMED）

match ＝ match＿template（gray， gray）

match

輸出：

array（［［1．］］， dtype＝float32）

現在我們將通過在文本周圍創建一個矩形來分隔文本中的每個字符。

＃ Drawing rectangle around text

img ＝ cv2．imread（＇sample．png＇）

h， w， c ＝ img．shape

boxes ＝ pytesseract．image＿to＿boxes（img）

for b in boxes．splitlines（）：

b ＝ b．split（＇＇）

img ＝ cv2．rectangle（img，（int（b［1］）， h － int（b［2］）），（int（b［3］）， h － int（b［4］）），（0， 255， 0）， 2）

Image．fromarray（img）

輸出：

最后，我們可以圍繞特定的圖案或單詞繪制矩形。

＃ Drawing pattern on specific pattern or word

img ＝ cv2．imread（＇sample．png＇）

d ＝ pytesseract．image＿to＿data（img， output＿type＝Output．DICT）

keys ＝ list（d．keys（））

date＿pattern ＝＇artificially＇

n＿boxes ＝ len（d［＇text＇］）

for i in range（n＿boxes）：

if float（d［＇conf＇］［i］）＞ 60：

if re．match（date＿pattern， d［＇text＇］［i］）：

（x， y， w， h）＝（d［＇left＇］［i］， d［＇top＇］［i］， d［＇width＇］［i］， d［＇height＇］［i］）

img ＝ cv2．rectangle（img，（x， y），（x ＋ w， y ＋ h），（0， 255， 0）， 2）

Image．fromarray（img）

輸出：

結論

我們從學習如何安裝用于文本提取的 tesseract 開始。接下來，我們拍攝了一張圖像并從該圖像中提取了文本。我們了解到我們需要使用 OpenCV 的某些圖像轉換函數來從復雜圖像中提取文本。

尾注

希望你們會喜歡這個使用 Pytesseract逐步學習光學字符識別的方法。

原文標題 : 使用Pytesseract進行光學字符識別

本地收藏打印推薦給朋友

聲明： 本文由入駐維科號的作者撰寫，觀點僅代表作者本人，不代表OFweek立場。如有侵權或其他問題，請聯系舉報。

發表評論

共0條評論，0人參與

立即登錄即可訪問所有OFweek服務

忘記密碼

其他方式

請輸入評論內容...

請輸入評論/評論長度6~500個字

暫無評論

圖片新聞

新在线不卡免费视频|www国产精品久久麻豆|美女午夜福利网站|《福克斯号上空姐们》|关于秘书的电影|美少女的哀羞txt|日产国产一区二区三区

發表評論

登錄