10 Tips to Improve OCR Accuracy for Better PDF Text Recognition
Practical tips to get better OCR results from scanned documents — covering scan settings, image preprocessing, and tool configuration.
AltoUnlockPDF Team
PDF Tools Expert
Even the best OCR software produces poor results when given a bad input image. The most important factor in OCR accuracy isn’t which software you choose — it’s the quality of your scanned document.
Here are ten proven techniques to maximize OCR accuracy.
1. Scan at the Right Resolution
Resolution is the single most important factor. Always scan at:
- Minimum: 200 DPI (barely acceptable for clean documents)
- Recommended: 300 DPI (good balance of quality and file size)
- Best: 600 DPI (for faded, small-font, or degraded documents)
Higher DPI means more pixels per character, giving the OCR engine more information to work with.
Most modern smartphone cameras exceed 300 DPI equivalent when held 30–40cm from a document.
2. Use Grayscale or Black & White Scans
For text documents, color scanning adds noise without benefit. The OCR engine only needs to distinguish dark text from light background.
- Grayscale (8-bit): Best for most documents — preserves subtle contrast
- Black & White (1-bit): Fastest processing; best for high-contrast printed text
- Color: Use only when the document has meaningful color content (charts, forms with colored fields)
3. Ensure Proper Lighting (For Camera Scans)
When photographing documents with a phone:
- Use bright, even lighting — stand near a window during the day
- Avoid flash — it creates hot spots and glare
- No shadows — don’t hold the phone over the document; use a stand or lean it against a wall
- Diffuse light is best — overcast sky or multiple light sources
The Google PhotoScan app uses multiple shots to eliminate glare automatically.
4. Deskew and Straighten Documents
Even a 1–2° tilt significantly reduces OCR accuracy. Most OCR tools have automatic deskew, but you can also:
- Use a document feeder on a flatbed scanner (keeps pages perfectly flat)
- Use OCRmyPDF’s
--deskewoption for automatic correction - In Photoshop/GIMP, use the Straighten tool before OCR
5. Increase Contrast Before OCR
Low contrast between text and background is the number one cause of OCR errors on old or faded documents. Fix this with image preprocessing:
from PIL import Image, ImageEnhance
img = Image.open('faded_document.jpg').convert('L') # grayscale
enhancer = ImageEnhance.Contrast(img)
img_high_contrast = enhancer.enhance(2.5) # increase contrast by 2.5x
img_high_contrast.save('enhanced.jpg')
Or use free tools like GIMP: Colors → Levels → drag the black/white input points inward.
6. Remove Background Noise
Spotted or textured backgrounds (recycled paper, watermarked letterheads) confuse OCR. Preprocessing:
import cv2
import numpy as np
img = cv2.imread('noisy.jpg', cv2.IMREAD_GRAYSCALE)
# Threshold to binary (Otsu's method - automatic threshold selection)
_, binary = cv2.threshold(img, 0, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)
cv2.imwrite('cleaned.jpg', binary)
OCRmyPDF’s --clean flag uses unpaper to remove background artifacts automatically.
7. Select the Correct Language
Always specify the document language. An OCR engine using an English dictionary to read French will produce errors on accented characters.
# Tesseract: specify language
tesseract document.pdf output -l fra
# Multi-language document
tesseract document.pdf output -l fra+eng
Most online OCR tools have a language dropdown — don’t leave it on “Auto” if you know the language.
8. Use the Right OCR Mode for Your Document
Tesseract has multiple “page segmentation modes” (PSM):
# Default (auto segment)
tesseract doc.png out -l eng --psm 3
# Single column of text
tesseract doc.png out --psm 4
# Single line (good for forms)
tesseract doc.png out --psm 7
# Single word (good for labels)
tesseract doc.png out --psm 8
For most documents, the default (auto) works well. For forms, labels, and single-column text, specifying the PSM improves accuracy.
9. Use Post-Processing Spell Check
OCR output always has some errors. Running a spell checker on the output catches many of them:
from spellchecker import SpellChecker
spell = SpellChecker()
words = ocr_text.split()
misspelled = spell.unknown(words)
for word in misspelled:
correction = spell.correction(word)
if correction:
ocr_text = ocr_text.replace(word, correction)
For domain-specific documents (legal, medical), add a custom dictionary.
10. Split Large Documents
For multi-page PDFs, processing page by page gives better results than processing the whole document at once:
# Split with pdftk
pdftk input.pdf burst output page_%04d.pdf
# Then OCR each page
for file in page_*.pdf; do
ocrmypdf --force-ocr "$file" "ocr_${file}"
done
Quick Reference Checklist
- 300 DPI scan resolution
- Grayscale mode for text documents
- Document flat and straight
- Even lighting, no shadows
- Correct language selected
- Background noise removed if present
- Contrast enhanced for faded documents
Follow these tips and you’ll see OCR accuracy jump from 85–90% to 97–99% on most standard documents.
Related Articles
Best Free Handwriting OCR Tools to Convert Notes to Text
Discover the best free tools that can read and digitize handwritten notes, forms, and documents using OCR and AI recognition.
Read Article
How to Extract Text From a PDF Image (Scanned Document)
Step-by-step guide to extracting selectable, copyable text from image-based PDFs and scanned documents using free online and offline tools.
Read Article
Best Free OCR PDF Online Tools: Extract Text From Scanned Documents
Compare the best free OCR tools to extract text from scanned PDFs — with accuracy tests, file size limits, and language support compared.
Read Article