OCR February 20, 2026 · 8 min read

How to Extract Tables From PDF With OCR (Free Methods)

Extract structured table data from PDF documents — both native PDFs and scanned images — using free tools and Python libraries.

How to Extract Tables From PDF With OCR (Free Methods)
AT

AltoUnlockPDF Team

PDF Tools Expert

Extracting table data from PDFs is one of the most frustrating tasks in data work. Whether you’re a researcher pulling numbers from published reports, an analyst importing financial data, or an accountant reconciling invoices, getting structured data out of PDFs without manual retyping is a huge time saver.


Types of PDF Tables

Before choosing a tool, identify what type of PDF you’re dealing with:

  1. Native digital PDF — the PDF was created from software. Text is real characters. Table lines may or may not be present as vector graphics.

  2. Scanned/Image PDF — the page is an image. The “table” is just pixels. Requires OCR before data can be extracted.


Method 1: Camelot (Python — Best for Native PDFs)

Camelot is purpose-built for table extraction from native PDFs. It’s consistently the most accurate free tool.

import camelot

# Lattice mode — for tables with visible borders/lines
tables = camelot.read_pdf('annual_report.pdf', flavor='lattice', pages='1-3')

# Stream mode — for tables without borders (space-separated columns)
tables = camelot.read_pdf('financial_data.pdf', flavor='stream', pages='all')

# Check accuracy score
print(tables[0].parsing_report)

# Export to various formats
tables.export('output.csv', f='csv')
tables.export('output.xlsx', f='excel')
tables.export('output.json', f='json')

# Access as pandas DataFrame
df = tables[0].df
print(df.head())
Table data extracted from PDF document

Method 2: Tabula (GUI + Python)

Tabula is a desktop app (free, open-source) that lets you draw boxes around tables visually and extract them.

  1. Download and open Tabula
  2. Upload your PDF
  3. Draw a selection box around the table
  4. Click Preview & Export → CSV or Excel

Also available as Python library:

import tabula

# Extract all tables from a page
tables = tabula.read_pdf('report.pdf', pages='2', multiple_tables=True)

# Export to Excel
tabula.convert_into('report.pdf', 'output.xlsx', output_format='xlsx', pages='all')

Method 3: pdfplumber (Python — Precise Positioning)

pdfplumber gives you detailed control over PDF parsing and is excellent for complex multi-column layouts.

import pdfplumber

with pdfplumber.open('document.pdf') as pdf:
    for page_num, page in enumerate(pdf.pages, 1):
        tables = page.extract_tables()
        for table_num, table in enumerate(tables, 1):
            print(f"Page {page_num}, Table {table_num}:")
            for row in table:
                print(row)

Method 4: Google Sheets Import (No-Code)

For one-off extractions without any coding:

  1. Upload the PDF to Google Drive
  2. Right-click → Open with Google Docs
  3. Google’s OCR extracts the content including table structure
  4. Copy the table → paste into Google Sheets
  5. Clean up any formatting issues

Alternatively: Google Sheets → File → Import → select PDF (works on some digital PDFs).

Table extracted from PDF into spreadsheet

Method 5: Scanned PDFs — OCR + Table Extraction

For scanned PDFs, you need OCR first:

import pdf2image
import pytesseract
import pandas as pd
from PIL import Image

def extract_table_from_scanned_pdf(pdf_path, page_num=1):
    # Convert PDF page to image
    images = pdf2image.convert_from_path(pdf_path, dpi=300)
    image = images[page_num - 1]
    
    # OCR with table data output format
    data = pytesseract.image_to_data(image, output_type=pytesseract.Output.DATAFRAME)
    
    # Filter confident results
    data = data[data.conf > 60].dropna(subset=['text'])
    return data

# Better: Use pytesseract's HOCR output for table reconstruction
html = pytesseract.image_to_pdf_or_hocr(image, extension='hocr')

For complex scanned tables, ABBYY FineReader (paid) significantly outperforms free tools.


Comparison of Table Extraction Tools

ToolNative PDFsScanned PDFsBorders RequiredOutput Formats
Camelot★★★★★Lattice onlyCSV, Excel, JSON
Tabula★★★★☆NoCSV, Excel
pdfplumber★★★★☆NoRaw Python objects
Google Sheets★★★☆☆★★★☆☆NoGoogle Sheets
ABBYY FineReader★★★★★★★★★★NoAll formats

Troubleshooting Common Issues

Tables spanning multiple pages: Use pages='all' in Camelot/Tabula and merge DataFrames with pd.concat().

Merged cells: These are the hardest to handle automatically. Use camelot’s copy_text=['v'] option to fill merged cells.

Mixed text and numbers: Always cast numeric columns after extraction with pd.to_numeric(col, errors='coerce').

For a comprehensive guide to PDF data extraction, the PyPDF community documentation is an excellent reference.

Related Articles