Tables are one of the most common and most valuable structures in business documents. Invoices have line item tables. Financial reports have balance sheets. Medical records have lab results. Contracts have fee schedules.
Extracting these tables from PDFs into usable spreadsheet data is also one of the hardest document processing tasks. Here is why — and how to solve it.
Why table extraction from PDFs is so hard
PDFs were designed for visual presentation, not data extraction. A table that looks perfectly organized on screen has no actual "table" structure in the PDF file. Instead, the PDF contains:
- Individual text elements positioned at specific coordinates
- Lines drawn as vector graphics (or not drawn at all — many tables use whitespace instead of borders)
- No semantic information about which text belongs to which cell
When you try to select and copy a PDF table, you get jumbled text because the PDF reader extracts text in reading order, not table order. A three-column table becomes an unreadable stream of interleaved values.
Common failure modes
Merged cells cause column alignment to break. A header that spans two columns confuses rule-based extractors.
Multi-page tables are treated as separate tables by most tools. The header row appears on page 1 but not page 2, and the continuation is not linked.
Borderless tables that rely on whitespace alignment have no visual cues for column boundaries.
Mixed content where tables sit alongside paragraphs, charts, or images requires the extractor to identify where the table begins and ends.
Traditional approaches (and their limitations)
Copy-paste from PDF viewer
The simplest approach — and the most unreliable. Column alignment breaks, merged cells collapse, and multi-page tables require manual stitching.
Rule-based extraction tools
Tools that use rules like "find horizontal and vertical lines to identify cell boundaries" work for simple, bordered tables. They fail on borderless tables, tables with merged cells, and tables in scanned documents.
Coordinate-based extraction
Developer tools that extract text by page coordinates (like Tabula or Camelot) work for native PDFs with consistent layouts. They require per-document or per-template configuration and fail on scanned PDFs entirely.
The AI approach: visual table understanding
AI vision models process the PDF page as an image and identify tables the same way a human does — by recognizing the spatial arrangement of text into rows and columns.
This means:
- Borderless tables are recognized by whitespace patterns and alignment
- Merged cells are understood from the visual layout
- Multi-page tables can be continued when the AI recognizes the same column structure
- Scanned PDFs work because the AI reads the image, not the PDF text layer
The output is a structured table with correct row/column relationships — ready to export as CSV or include in Markdown.
How to extract tables with PaperAI
Step 1: Upload your PDF
Upload the PDF containing tables to PaperAI. Both native and scanned PDFs are supported.
Step 2: Convert with an appropriate AI model
For clean, typed PDFs with simple tables, a standard model (2-5 credits/page) is sufficient. For scanned documents, complex multi-page tables, or tables with handwritten entries, use a premium model (8-10 credits/page).
Step 3: Review the table output
In the side-by-side review, check that:
- All rows are captured (no missing data)
- Columns are correctly aligned (amounts in the amount column, dates in the date column)
- Headers are identified correctly
- Merged cells are handled properly
The Markdown output represents tables using standard Markdown table syntax, which displays correctly in any Markdown viewer and converts cleanly to HTML or CSV.
Step 4: Extract specific table fields
If you need specific data from the table (rather than the entire table), set up extraction fields in a Smart Flow. For an invoice line items table, you might extract:
- Line item description (text)
- Quantity (number)
- Unit price (currency)
- Line total (currency)
The AI extracts these as a structured array — each line item becomes an object with named fields.
Step 5: Export
Export the structured data as:
- CSV: Each table row becomes a spreadsheet row. Import directly into Excel or Google Sheets.
- JSON: Each table becomes an array of objects with named fields. Ready for your database or API.
- Markdown: Tables stay as formatted Markdown tables for documentation.
Tips for better table extraction
Use premium models for complex tables. Tables with dense data, small fonts, or many columns benefit from the higher accuracy of premium AI models.
Process similar documents together. If you have 100 invoices with the same table structure, set up a Smart Flow once and batch-process all of them.
Check the accuracy score. Tables with low confidence scores are more likely to have alignment or data errors. Focus your review time on these.
For multi-page tables, ensure the entire document is uploaded as a single file. The AI can recognize table continuation across pages within one document but not across separate files.
When you need more than table extraction
If you need not just the table data but also specific fields from outside the table (like the invoice number in the header or the total in the footer), use PaperAI's extraction fields. Define all the data points you need — both table data and standalone fields — and the AI extracts everything in one pass.
Start free with 100 credits and try extracting tables from your own documents.