← All posts
5 min read

Extracting Tables from PDF Files in Your Browser

PDF files are the default format for financial reports, government filings, invoices, and regulatory documents. The data is right there on the page, but getting it into a spreadsheet usually means copying cells one by one or paying for a SaaS extraction tool that uploads your files to a remote server.

ExploreMyData can extract tabular data from text-based PDFs directly in your browser. No upload, no server, no account. The file never leaves your machine.

How the parser works

The extraction pipeline has four stages:

  1. Text extraction. The parser reads every text item on each page, recording its content and its x,y position on the page. For scanned PDFs, browser-based OCR recognizes text from the page images instead.
  2. Fragment merging. PDF text is often split into many small fragments. The parser merges adjacent fragments on the same line into complete words and cell values, tracking character widths to distinguish cell boundaries from word spacing.
  3. Row clustering. Merged text chunks are grouped into rows based on their vertical position, with an adaptive tolerance calculated from the actual text height.
  4. Column detection. Column boundaries are detected using a region-growing algorithm inspired by Tabula. The parser seeds column regions from the most populated row, then grows each region by merging horizontally overlapping chunks from other rows. This handles right-aligned numbers and varying column widths naturally.
  5. Structured output. Each chunk is assigned to its overlapping column region and the result is a proper table ready for filtering, aggregation, or export.

For example, a quarterly revenue report with this layout in the PDF:

QuarterRevenueExpensesNet Income
Q1 20251,245,000892,000353,000
Q2 20251,380,000945,000435,000
Q3 20251,512,0001,023,000489,000
Q4 20251,688,0001,101,000587,000

...becomes exactly that table in ExploreMyData, ready for aggregation, charting, or export.

What works well

The parser performs best on PDFs with clearly structured, machine-readable text:

  • Financial statements with consistent column headers and numeric rows
  • Invoice line items where each row has a description, quantity, and amount
  • Government statistical tables published as PDF reports
  • Any PDF where you can select and copy text (this means it's text-based, not an image)

A quick test: open the PDF and try to highlight some text with your cursor. If you can select individual words, the parser can read it.

Scanned PDFs and OCR

If the PDF is a scanned image (no selectable text), ExploreMyData automatically falls back to browser-based OCR using tesseract.js. The OCR engine renders each page as an image and recognizes text character by character. The recognized text then goes through the same table extraction pipeline.

OCR works best with clear, high-resolution scans. Blurry pages, handwriting, or unusual fonts will reduce accuracy. OCR is limited to the first 10 pages (it takes several seconds per page). A warning is shown when OCR is used, since recognition errors are possible.

Password-protected PDFs

If your PDF is password protected, ExploreMyData will prompt you to enter the password. The password is used locally in your browser to decrypt the file. It is never transmitted to any server. You get up to three attempts before the import is cancelled.

Known limitations

  • Merged cells and nested tables. The column detection assumes a regular grid. PDFs with cells spanning multiple columns or tables nested inside other tables may produce misaligned output.
  • Rotated or vertical text. The row-clustering algorithm groups items by vertical position. Text rotated 90 degrees breaks this assumption.
  • OCR accuracy. OCR on scanned PDFs will contain some recognition errors, especially with poor scan quality. Always review the extracted data.

Tips for better results

  • If a PDF contains multiple table formats across different pages (for example, a summary table on page 1 and a detail table on pages 2-10), extract them in separate steps and clean each one individually.
  • After extraction, use the pipeline to rename columns if the headers came through with extra whitespace or concatenated text. The Rename Column operation handles this.
  • Numeric columns may extract as text if they contain currency symbols or commas. Use Convert Type to cast them to numeric after stripping formatting characters.
  • For multi-page tables, check whether headers repeat on each page. If they do, the repeated header rows will appear as data rows. Use a Filter to remove them (e.g., filter out rows where the first column equals the header name).

Limitations

The parser processes a maximum of 50 pages. For longer documents, split the PDF before loading it. Multi-page tables are processed page by page, so if a table spans pages 3 through 7, each page is parsed independently and the results are concatenated.

After extraction

Once the table is loaded, it behaves like any other dataset in ExploreMyData. You can filter, aggregate, join it with other files, add calculated columns, or export it. Common next steps:

  • Export as CSV for use in other tools: PDF to CSV
  • Export as Excel for sharing with colleagues: PDF to Excel
  • Build a pipeline to clean and transform the extracted data before exporting

Extract a PDF table now →

Try it yourself

No sign-up, no upload, no tracking.

Open ExploreMyData