OCR February 14, 2026 · 6 min read

How to Extract Text From a PDF Image (Scanned Document)

Step-by-step guide to extracting selectable, copyable text from image-based PDFs and scanned documents using free online and offline tools.

How to Extract Text From a PDF Image (Scanned Document)
AT

AltoUnlockPDF Team

PDF Tools Expert

When you receive a PDF that’s actually a scanned image, you can’t simply click and drag to select text. The document is a photograph of a page — the text is “baked in.” To get the text out, you need OCR.

Here’s every method, from simplest to most powerful.


Why Can’t You Copy Text From Some PDFs?

There are two types of PDFs:

  1. Native/digital PDFs — created from Word, InDesign, etc. Text is stored as actual characters. You can search and copy freely.
  2. Image-based/scanned PDFs — the page is stored as a raster image. No text data exists; just pixels.

If you try to Ctrl+A in a scanned PDF and no text gets selected, you have an image-based PDF.


Method 1: AltoUnlockPDF (Fast, Free, Online)

  1. Visit our OCR tool
  2. Upload your PDF
  3. Select output format: Searchable PDF (keeps original appearance + adds text layer) or Plain Text (.txt)
  4. Choose language
  5. Download output

Takes about 5–30 seconds per page. No signup required.


Method 2: Google Drive (Free, Highly Accurate)

  1. Upload PDF to Google Drive
  2. Right-click → Open with → Google Docs
  3. Wait 30–60 seconds for OCR to complete
  4. The document opens with extracted text above/below each page image
  5. Select all → Copy → paste wherever needed

Works great for 1–20 page documents. Free and unlimited.


Method 3: Python — Programmatic Extraction

For developers or bulk processing:

import pdf2image
import pytesseract
from PIL import Image
import io

def extract_text_from_scanned_pdf(pdf_path):
    # Convert PDF pages to images
    images = pdf2image.convert_from_path(pdf_path, dpi=300)
    
    text_pages = []
    for i, image in enumerate(images):
        # Run OCR on each page
        text = pytesseract.image_to_string(image, lang='eng')
        text_pages.append(f"--- Page {i+1} ---\n{text}")
        print(f"Processed page {i+1}/{len(images)}")
    
    return '\n\n'.join(text_pages)

# Usage
text = extract_text_from_scanned_pdf('contract.pdf')
with open('contract_text.txt', 'w') as f:
    f.write(text)

Dependencies:

pip install pdf2image pytesseract Pillow
# Also install: poppler (for pdf2image) and tesseract (for pytesseract)
Text being extracted from scanned PDF

Method 4: Adobe Acrobat Reader (Recognize Text)

Even the free Adobe Acrobat Reader can recognize text in scanned PDFs:

  1. Open the scanned PDF in Adobe Acrobat Reader
  2. Look for the notification bar: “This document contains only images”
  3. Click “Recognize Text” (appears in the right panel or notification)
  4. Wait for processing
  5. Now Ctrl+F search and text selection work

Limitations: free version can recognize but may not let you export the text layer.


Method 5: macOS Preview (Built-In on Mac)

macOS Preview has improved significantly and now includes basic OCR:

  1. Open the scanned PDF in Preview
  2. Select the text tool (T)
  3. Try to click and drag on text areas
  4. If OCR is needed, use Edit → Redact or import to Notes for Apple Intelligence OCR

Apple’s Live Text feature in macOS Monterey+ recognizes text in images automatically when you use the selection tool.

PDF text extraction workflow

Extracting Text While Preserving Formatting

Sometimes you need the text with its original structure (columns, tables, headings). Tools for this:

  • ABBYY FineReader (paid) — best structure preservation
  • Adobe Acrobat Pro (paid) — good table extraction
  • Camelot (Python, free) — specifically for tables in PDFs:
import camelot
tables = camelot.read_pdf('annual_report.pdf', pages='1-5')
tables[0].df  # Returns a pandas DataFrame
tables.export('tables.csv', f='csv')

Which Method Should You Use?

ScenarioBest Method
Quick, one-time extractionAltoUnlockPDF or Google Drive
Bulk processing (100+ PDFs)Python + pytesseract
Mac usermacOS Live Text + Preview
Need tables preservedABBYY FineReader or Camelot
Developer building a productpytesseract API or cloud OCR API

The best free combination for most people: Google Drive for single documents, Python script for batch processing.

Related Articles