How to Extract Tables From PDF With OCR (Free Methods)
Extract structured table data from PDF documents — both native PDFs and scanned images — using free tools and Python libraries.
AltoUnlockPDF Team
PDF Tools Expert
Extracting table data from PDFs is one of the most frustrating tasks in data work. Whether you’re a researcher pulling numbers from published reports, an analyst importing financial data, or an accountant reconciling invoices, getting structured data out of PDFs without manual retyping is a huge time saver.
Types of PDF Tables
Before choosing a tool, identify what type of PDF you’re dealing with:
-
Native digital PDF — the PDF was created from software. Text is real characters. Table lines may or may not be present as vector graphics.
-
Scanned/Image PDF — the page is an image. The “table” is just pixels. Requires OCR before data can be extracted.
Method 1: Camelot (Python — Best for Native PDFs)
Camelot is purpose-built for table extraction from native PDFs. It’s consistently the most accurate free tool.
import camelot
# Lattice mode — for tables with visible borders/lines
tables = camelot.read_pdf('annual_report.pdf', flavor='lattice', pages='1-3')
# Stream mode — for tables without borders (space-separated columns)
tables = camelot.read_pdf('financial_data.pdf', flavor='stream', pages='all')
# Check accuracy score
print(tables[0].parsing_report)
# Export to various formats
tables.export('output.csv', f='csv')
tables.export('output.xlsx', f='excel')
tables.export('output.json', f='json')
# Access as pandas DataFrame
df = tables[0].df
print(df.head())
Method 2: Tabula (GUI + Python)
Tabula is a desktop app (free, open-source) that lets you draw boxes around tables visually and extract them.
- Download and open Tabula
- Upload your PDF
- Draw a selection box around the table
- Click Preview & Export → CSV or Excel
Also available as Python library:
import tabula
# Extract all tables from a page
tables = tabula.read_pdf('report.pdf', pages='2', multiple_tables=True)
# Export to Excel
tabula.convert_into('report.pdf', 'output.xlsx', output_format='xlsx', pages='all')
Method 3: pdfplumber (Python — Precise Positioning)
pdfplumber gives you detailed control over PDF parsing and is excellent for complex multi-column layouts.
import pdfplumber
with pdfplumber.open('document.pdf') as pdf:
for page_num, page in enumerate(pdf.pages, 1):
tables = page.extract_tables()
for table_num, table in enumerate(tables, 1):
print(f"Page {page_num}, Table {table_num}:")
for row in table:
print(row)
Method 4: Google Sheets Import (No-Code)
For one-off extractions without any coding:
- Upload the PDF to Google Drive
- Right-click → Open with Google Docs
- Google’s OCR extracts the content including table structure
- Copy the table → paste into Google Sheets
- Clean up any formatting issues
Alternatively: Google Sheets → File → Import → select PDF (works on some digital PDFs).
Method 5: Scanned PDFs — OCR + Table Extraction
For scanned PDFs, you need OCR first:
import pdf2image
import pytesseract
import pandas as pd
from PIL import Image
def extract_table_from_scanned_pdf(pdf_path, page_num=1):
# Convert PDF page to image
images = pdf2image.convert_from_path(pdf_path, dpi=300)
image = images[page_num - 1]
# OCR with table data output format
data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DATAFRAME)
# Filter confident results
data = data[data.conf > 60].dropna(subset=['text'])
return data
# Better: Use pytesseract's HOCR output for table reconstruction
html = pytesseract.image_to_pdf_or_hocr(image, extension='hocr')
For complex scanned tables, ABBYY FineReader (paid) significantly outperforms free tools.
Comparison of Table Extraction Tools
| Tool | Native PDFs | Scanned PDFs | Borders Required | Output Formats |
|---|---|---|---|---|
| Camelot | ★★★★★ | ✗ | Lattice only | CSV, Excel, JSON |
| Tabula | ★★★★☆ | ✗ | No | CSV, Excel |
| pdfplumber | ★★★★☆ | ✗ | No | Raw Python objects |
| Google Sheets | ★★★☆☆ | ★★★☆☆ | No | Google Sheets |
| ABBYY FineReader | ★★★★★ | ★★★★★ | No | All formats |
Troubleshooting Common Issues
Tables spanning multiple pages: Use pages='all' in Camelot/Tabula and merge DataFrames with pd.concat().
Merged cells: These are the hardest to handle automatically. Use camelot’s copy_text=['v'] option to fill merged cells.
Mixed text and numbers: Always cast numeric columns after extraction with pd.to_numeric(col, errors='coerce').
For a comprehensive guide to PDF data extraction, the PyPDF community documentation is an excellent reference.
Related Articles
7 Best Free OCR Software in 2024: Full Comparison
Tested and ranked: the best free OCR programs and apps for Windows, Mac, and Linux — from desktop apps to cloud tools.
Read Article
Best Free OCR PDF Online Tools: Extract Text From Scanned Documents
Compare the best free OCR tools to extract text from scanned PDFs — with accuracy tests, file size limits, and language support compared.
Read Article
How to Extract Text From a PDF Image (Scanned Document)
Step-by-step guide to extracting selectable, copyable text from image-based PDFs and scanned documents using free online and offline tools.
Read Article