How to Extract Text From a PDF Image (Scanned Document)
Step-by-step guide to extracting selectable, copyable text from image-based PDFs and scanned documents using free online and offline tools.
AltoUnlockPDF Team
PDF Tools Expert
When you receive a PDF that’s actually a scanned image, you can’t simply click and drag to select text. The document is a photograph of a page — the text is “baked in.” To get the text out, you need OCR.
Here’s every method, from simplest to most powerful.
Why Can’t You Copy Text From Some PDFs?
There are two types of PDFs:
- Native/digital PDFs — created from Word, InDesign, etc. Text is stored as actual characters. You can search and copy freely.
- Image-based/scanned PDFs — the page is stored as a raster image. No text data exists; just pixels.
If you try to Ctrl+A in a scanned PDF and no text gets selected, you have an image-based PDF.
Method 1: AltoUnlockPDF (Fast, Free, Online)
- Visit our OCR tool
- Upload your PDF
- Select output format: Searchable PDF (keeps original appearance + adds text layer) or Plain Text (.txt)
- Choose language
- Download output
Takes about 5–30 seconds per page. No signup required.
Method 2: Google Drive (Free, Highly Accurate)
- Upload PDF to Google Drive
- Right-click → Open with → Google Docs
- Wait 30–60 seconds for OCR to complete
- The document opens with extracted text above/below each page image
- Select all → Copy → paste wherever needed
Works great for 1–20 page documents. Free and unlimited.
Method 3: Python — Programmatic Extraction
For developers or bulk processing:
import pdf2image
import pytesseract
from PIL import Image
import io
def extract_text_from_scanned_pdf(pdf_path):
# Convert PDF pages to images
images = pdf2image.convert_from_path(pdf_path, dpi=300)
text_pages = []
for i, image in enumerate(images):
# Run OCR on each page
text = pytesseract.image_to_string(image, lang='eng')
text_pages.append(f"--- Page {i+1} ---\n{text}")
print(f"Processed page {i+1}/{len(images)}")
return '\n\n'.join(text_pages)
# Usage
text = extract_text_from_scanned_pdf('contract.pdf')
with open('contract_text.txt', 'w') as f:
f.write(text)
Dependencies:
pip install pdf2image pytesseract Pillow
# Also install: poppler (for pdf2image) and tesseract (for pytesseract)
Method 4: Adobe Acrobat Reader (Recognize Text)
Even the free Adobe Acrobat Reader can recognize text in scanned PDFs:
- Open the scanned PDF in Adobe Acrobat Reader
- Look for the notification bar: “This document contains only images”
- Click “Recognize Text” (appears in the right panel or notification)
- Wait for processing
- Now Ctrl+F search and text selection work
Limitations: free version can recognize but may not let you export the text layer.
Method 5: macOS Preview (Built-In on Mac)
macOS Preview has improved significantly and now includes basic OCR:
- Open the scanned PDF in Preview
- Select the text tool (T)
- Try to click and drag on text areas
- If OCR is needed, use Edit → Redact or import to Notes for Apple Intelligence OCR
Apple’s Live Text feature in macOS Monterey+ recognizes text in images automatically when you use the selection tool.
Extracting Text While Preserving Formatting
Sometimes you need the text with its original structure (columns, tables, headings). Tools for this:
- ABBYY FineReader (paid) — best structure preservation
- Adobe Acrobat Pro (paid) — good table extraction
- Camelot (Python, free) — specifically for tables in PDFs:
import camelot
tables = camelot.read_pdf('annual_report.pdf', pages='1-5')
tables[0].df # Returns a pandas DataFrame
tables.export('tables.csv', f='csv')
Which Method Should You Use?
| Scenario | Best Method |
|---|---|
| Quick, one-time extraction | AltoUnlockPDF or Google Drive |
| Bulk processing (100+ PDFs) | Python + pytesseract |
| Mac user | macOS Live Text + Preview |
| Need tables preserved | ABBYY FineReader or Camelot |
| Developer building a product | pytesseract API or cloud OCR API |
The best free combination for most people: Google Drive for single documents, Python script for batch processing.
Related Articles
Best Free OCR APIs Compared: Google Vision, Tesseract & AWS Textract
Compare the top OCR APIs for developers — Google Cloud Vision, AWS Textract, Azure AI, and Tesseract — with pricing, accuracy, and code examples.
Read Article
7 Best Free OCR Software in 2024: Full Comparison
Tested and ranked: the best free OCR programs and apps for Windows, Mac, and Linux — from desktop apps to cloud tools.
Read Article
How to Extract Tables From PDF With OCR (Free Methods)
Extract structured table data from PDF documents — both native PDFs and scanned images — using free tools and Python libraries.
Read Article