Multilingual OCR: How to Extract Text From PDFs in Any Language
Guide to running OCR on non-English documents — Arabic, Chinese, Japanese, Russian, and more — with the best free and paid tools.
AltoUnlockPDF Team
PDF Tools Expert
The majority of documents in the world are not in English. If you’re working with German legal contracts, French medical records, Chinese business documents, or Arabic government forms, you need an OCR tool that truly handles that language.
This guide covers multilingual OCR across the major language families.
Why Language Matters for OCR
OCR engines work by matching pixel patterns against trained character models. For this to work:
- The engine must have training data for that language’s alphabet/characters
- It must understand how characters combine into words (language model)
- For complex scripts (Arabic, Thai, Devanagari), it needs special handling of ligatures and diacritics
Using an English OCR engine on French text: passable (mostly the same alphabet, with errors on é, à, ü, etc.)
Using an English OCR engine on Arabic: completely unusable (entirely different script)
Supported Languages by Tool
Tesseract (100+ Languages)
Tesseract is the most multilingual free OCR engine. Language packs must be installed separately:
# Install language packs (Ubuntu)
sudo apt install tesseract-ocr-deu # German
sudo apt install tesseract-ocr-fra # French
sudo apt install tesseract-ocr-ara # Arabic
sudo apt install tesseract-ocr-chi-sim # Chinese Simplified
sudo apt install tesseract-ocr-jpn # Japanese
# macOS (Homebrew)
brew install tesseract-lang
# List installed languages
tesseract --list-langs
# Run OCR with specific language
tesseract arabic_doc.jpg output -l ara
tesseract chinese_doc.jpg output -l chi_sim
tesseract multilingual.jpg output -l fra+eng
Languages by Difficulty
Latin Script Languages (Easy)
English, French, German, Spanish, Italian, Portuguese — all use the same basic alphabet with minor variations. Tesseract handles these excellently.
Cyrillic Script (Moderate)
Russian, Ukrainian, Bulgarian, Serbian — well-supported by Tesseract and most OCR tools. Key Tesseract codes: rus, ukr, bul.
Arabic / Hebrew (Challenging — RTL)
Arabic and Hebrew are right-to-left scripts with complex joining rules. Dedicated models are needed:
import pytesseract
from PIL import Image
# Arabic OCR
text = pytesseract.image_to_string(Image.open('arabic.jpg'), lang='ara')
# Configuration for Arabic (right-to-left)
custom_config = r'--oem 3 --psm 6'
text = pytesseract.image_to_string(Image.open('arabic.jpg'), lang='ara', config=custom_config)
For Arabic documents, ABBYY FineReader significantly outperforms Tesseract.
Chinese / Japanese / Korean (CJK — Complex)
CJK scripts have thousands of characters. Dedicated models are required:
- Tesseract:
chi_sim(Simplified Chinese),chi_tra(Traditional Chinese),jpn(Japanese),kor(Korean) - PaddleOCR (by Baidu) is generally better for CJK than Tesseract:
from paddleocr import PaddleOCR
ocr = PaddleOCR(use_angle_cls=True, lang='ch') # Chinese
result = ocr.ocr('chinese_document.jpg', cls=True)
for line in result[0]:
print(line[1][0]) # extracted text
Cloud OCR APIs for Multilingual Documents
When accuracy is critical and volume justifies cost:
Google Cloud Vision API
- Supports 50+ languages
- Excellent for CJK and Arabic
- $1.50 per 1,000 pages
from google.cloud import vision
client = vision.ImageAnnotatorClient()
with open('document.jpg', 'rb') as f:
image = vision.Image(content=f.read())
# Specify language hints
image_context = vision.ImageContext(language_hints=['zh', 'en'])
response = client.text_detection(image=image, image_context=image_context)
print(response.full_text_annotation.text)
AWS Textract
- 14 languages supported
- Best for forms and tables
- $1.50 per 1,000 pages
AltoUnlockPDF Language Support
Our OCR tool supports 35 languages including:
- All major European languages
- Russian and other Cyrillic scripts
- Arabic and Hebrew (RTL support)
- Chinese (Simplified and Traditional)
- Japanese and Korean
Select your language from the dropdown before converting.
Tips for Non-Latin Script OCR
- Ensure correct text direction is set (RTL for Arabic/Hebrew)
- Avoid compressed JPEG — use PNG or TIFF for sharper character edges
- Font clarity matters more — many non-Latin scripts have more complex strokes
- Post-process with a native spell checker for the specific language
- Use script-specific tools (PaddleOCR for CJK, dedicated Arabic OCR for Arabic business documents)
For mission-critical multilingual document processing, combining Google Cloud Vision’s API with human review is the current best practice in enterprise settings.
Related Articles
Best Free OCR PDF Online Tools: Extract Text From Scanned Documents
Compare the best free OCR tools to extract text from scanned PDFs — with accuracy tests, file size limits, and language support compared.
Read Article
How to Make a Scanned PDF Searchable for Free
Turn your scanned PDFs into searchable, text-selectable documents using free tools — no Adobe Acrobat Pro required.
Read Article
10 Tips to Improve OCR Accuracy for Better PDF Text Recognition
Practical tips to get better OCR results from scanned documents — covering scan settings, image preprocessing, and tool configuration.
Read Article