from paddleocr import PaddleOCR
# Initialize the OCR
ocr = PaddleOCR(use_angle_cls=True, lang=’en’)
# Perform OCR on an image
result = ocr.ocr(‘image_sample.png’, cls=True)
# Print the extracted text
for line in result[0]:
print(line[1])
6. Kraken
import pytesseract
from PIL import Image
# Load an image
img = Image.open(“image_sample.png”)
# Use Tesseract to extract text
text = pytesseract.image_to_string(img)
# Print the extracted text
print(text)
2. EasyOCR
Tesseract is undoubtedly the most popular and widely used OCR library in the Python ecosystem. Originally developed by HP and now maintained by Google, Tesseract provides high-quality OCR capabilities for over 100 languages.
import pyocr
from PIL import Image
# Choose the OCR tool (Tesseract or CuneiForm)
tool = pyocr.get_available_tools()[0]
# Load the image
img = Image.open(‘image_sample.png’)
# Extract text from the image
text = tool.image_to_string(img)
# Print the extracted text
print(text)
5. PaddleOCR
sudo apt install tesseract-ocr [On Debian, Ubuntu and Mint]
sudo yum install tesseract [On RHEL/CentOS/Fedora and Rocky/AlmaLinux]
sudo emerge -a sys-apps/tesseract [On Gentoo Linux]
sudo apk add tesseract [On Alpine Linux]
sudo pacman -S tesseract [On Arch Linux]
sudo zypper install tesseract [On OpenSUSE]
sudo pkg install tesseract [On FreeBSD]
pip3 install kraken
OR
pip install kraken
Key Features:
import boto3
# Initialize a Textract client
client = boto3.client(‘textract’)
# Path to the image or PDF file you want to analyze
file_path = ‘path_to_your_file.png’ # Replace with your file path
# Open the file in binary mode
with open(file_path, ‘rb’) as document:
# Call Textract to analyze the document
response = client.detect_document_text(Document={‘Bytes’: document.read()})
# Print the extracted text
for item in response[‘Blocks’]:
if item[‘BlockType’] == ‘LINE’:
print(item[‘Text’])
Conclusion
Choosing the right OCR library in Python depends on the specific use case, the language requirements, and the complexity of the documents you’re processing. Whether you’re working on historical documents, multilingual texts, or simple scanned PDFs, these libraries provide powerful tools for text extraction.