Start from the statement structure
A bank statement is not just a table. It usually has metadata, summary balances, one or more transaction sections, footers, and repeated page headers.
Reliable extraction identifies the transaction region first, then maps columns and continuation lines. Pulling every date and amount from the page creates false rows.
Fields that matter
Transaction date, processed date, description, reference, debit, credit, amount, balance, currency, and section are the core fields. The section is important for credit cards, fees, payments, and multi-account PDFs.
Keep original evidence for review. When a row looks suspicious, a user or operator needs to know which line of the PDF produced it.
Quality gates
Use transaction count, balance reconciliation, amount totals, date range checks, and description coverage. A parser that returns rows is not necessarily correct.
Unknown formats should enter a controlled learning path with a cost budget. If the system cannot reach confidence, show the result as review-required instead of pretending it is final.
FAQ
Why do some statements extract with missing descriptions?
Many descriptions wrap onto separate physical lines. If the extractor treats each line as a separate row, references and merchant names get dropped.
What should happen when extraction confidence is low?
The result should be marked for review, with a correction interface and a way to use the correction to improve future parsing.