Optical Character Recognition

OCR

Reviewed by Paul Hanke · Co-Founder, Transformance

May 29, 2026

Optical Character Recognition (OCR) is software that converts images of typed or handwritten text (scanned invoices, remittance PDFs, cheques, proofs of delivery) into machine-readable data. In AR and O2C, it sits at the front of capture workflows but is increasingly paired with, or replaced by, vision language models for variable-layout documents.

Key Takeaways

OCR turns pixels into text, which is the first step before any structured extraction (invoice number, amount, remit reference) can happen.
Pure OCR caps out at roughly 75 to 85 percent field-level accuracy on variable AR documents because it relies on templates and pattern matching, not context.
Vision language models (VLMs) now hit 95 to 99 percent on the same documents by reasoning about layout, spatial relationships, and meaning.
The production pattern in modern cash application is hybrid: OCR handles clean structured pages, VLMs handle handwritten, multi-page, or unseen layouts.
Buyers should ask vendors for straight-through-processing rates by document type (clean PDF vs scanned vs handwritten), not headline OCR accuracy numbers.

What OCR is and why it mattered for finance

Optical Character Recognition is the technology that lets a computer read printed or handwritten characters from an image. In an AR or O2C context, that image is almost always a document: a supplier invoice, a remittance advice attached to a wire, a cheque stub, a proof of delivery scan, a deduction backup PDF. Before OCR, those documents were keyed by hand. A shared services team in Manila or Krakow opening 4,000 emails a day and typing invoice numbers into an ERP was, until roughly 2015, the default operating model for global AR.

OCR changed the unit economics. A document that cost 1.20 euros to key manually could be captured for 0.10 to 0.25 euros. That gap is what funded the first wave of cash application automation, AP invoice capture tools, and lockbox digitisation. OCR did not make documents disappear; it made them cheap enough that finance teams stopped feeling them as a per-transaction cost.

How OCR works

Classical OCR is a pipeline of narrow steps. The image is binarised (converted to pure black and white), deskewed, and segmented into lines and characters. A pattern-matching engine then compares each character shape against a trained library of glyphs and returns the most likely letter or digit. Post-processing layers apply dictionaries, regular expressions, and template maps (for example: the invoice number always sits in a 60 by 20 pixel zone in the top right of this supplier's layout) to clean up the raw text and produce structured fields.

The template dependency is the critical limitation. A classical OCR stack works brilliantly when it has seen the layout before and the scan quality is high. It works poorly when either of those assumptions breaks.

OCR use cases in O2C and AR

Inside the order-to-cash function, OCR shows up in five places. Invoice capture on the AP side, where supplier invoices are read into the ERP. Remittance extraction on the AR side, where customer remittance advices (PDFs, emails, portal screenshots) are parsed for invoice numbers and amounts before being passed to a cash application engine. Cheque processing in markets like the US, France, and Canada where physical cheques still arrive at lockboxes. Proof of delivery digitisation, where signed PODs are captured and indexed so the collections team can answer disputes within minutes instead of days. Deduction documentation, where customer claim packets (often 30+ page bundles of bills of lading, marketing claims, and pricing letters) are read so deduction analysts have searchable text rather than image PDFs.

Where OCR breaks down

OCR was built for a world of standardised forms. Modern AR documents are nothing like that. A single mid-market manufacturer might receive remittances in 40 different formats: PDF tables, scanned cheque stubs, screenshots of customer portals, handwritten notes faxed from a hospital AP clerk, multi-page bank reports.

Pure OCR breaks on four things. Variable layouts, where the same customer sends a different remittance format each month. Multi-page documents, where context spans pages (the invoice number is on page 1, the payment amount is on page 3). Contextual disambiguation, for example telling a 7-digit PO number apart from a 7-digit invoice number when both appear in the same block. Handwriting and low-quality scans, where character-level accuracy collapses below 70 percent and propagates errors into every downstream field. Practitioners running classical OCR at scale report field-level accuracy in the 75 to 85 percent range on real-world AR document mixes, which means roughly one in five remittances still needs human touch.

OCR vs Vision Language Models

The shift in the last 18 months is that vision language models can read a document the way a human does: they look at the whole page, reason about where labels and values sit relative to each other, and use prior knowledge of what an invoice or remittance is supposed to look like. They do not need a template. They do not even need to have seen the customer's layout before.

On the same variable AR document mix where classical OCR plateaus at 75 to 85 percent, modern VLMs are now hitting 95 to 99 percent field-level accuracy, with line-item and multi-row table extraction landing at 95 to 97 percent. The gap is biggest exactly where AR pain lives: handwriting, multi-page packets, unseen layouts, and tables with merged cells.

The other shift is semantic. OCR gives you a string. A VLM gives you a string plus an opinion about what that string means in context (is this the remit-to address or the bill-to address, is this an invoice number or a credit memo reference). For cash application, that contextual layer is what removes the second human touch.

The hybrid stack in modern cash application

The production pattern in 2025 is not VLM-only. It is a routed hybrid. Clean, structured electronic documents (an EDI 820, a BAI2 file, a well-formed PDF remittance from a Tier 1 customer) flow through a classical OCR or direct-parse path because it is faster and cheaper per document. Anything that triggers a confidence drop (handwriting detected, unseen layout, multi-page, low scan quality) gets escalated to a VLM. Outputs from both paths feed the same downstream matching engine.

For finance buyers, the practical implication is this: do not evaluate capture vendors on a headline OCR accuracy number. Ask for straight-through-processing rates broken down by document type (structured PDF, scanned PDF, handwritten, multi-page packet). That breakdown reveals whether the vendor is genuinely hybrid or whether they are quietly relying on offshore keyers to clean up OCR's misses before the numbers are reported.

Frequently asked questions

Is OCR still relevant for cash application in 2025?

Yes, but as one component in a hybrid stack rather than the primary engine. OCR remains the cheapest and fastest path for clean, structured documents (well-formed PDF remittances, machine-printed invoices) and still handles a large share of daily volume. The change is that variable, handwritten, or multi-page documents now route to a vision language model instead of dropping to a human queue.

What accuracy should I expect from pure OCR on real AR documents?

On a representative mix of supplier remittances, lockbox images, and POD scans, classical OCR delivers roughly 75 to 85 percent field-level accuracy. Clean machine-printed PDFs sit at the top of that range; handwritten or low-quality scans pull the average down. A 95 percent number quoted by a vendor usually refers to character-level accuracy on benchmark datasets, not field-level accuracy on your actual document mix.

How is a vision language model different from OCR?

OCR converts pixels into characters using pattern matching and templates. A vision language model looks at the whole page and reasons about structure, context, and meaning, the way a human reader does. The VLM can identify that a 7-digit number in the top right is the invoice number and not the PO number, even on a layout it has never seen before, because it understands what an invoice is.

Can I replace my OCR stack entirely with a VLM?

Technically yes, but it is rarely the right commercial decision. VLM inference is more expensive per page than classical OCR, and for structured high-volume documents the accuracy gain is small. The economic answer is to route by confidence: cheap OCR for the easy 70 to 80 percent of volume, VLM for the long tail where OCR would have failed.

Where does OCR fit in the cash application workflow specifically?

OCR sits between document arrival and the matching engine. A remittance email lands, attachments are extracted, OCR (or a VLM) converts the image into structured fields (customer, invoice numbers, amounts, deductions), and those fields are passed to the matching engine that pairs them against open invoices and the bank statement. If OCR misses, the cash application AI either escalates to a VLM or queues the item for a human analyst.

How should I evaluate an OCR or document capture vendor?

Ignore the headline accuracy percentage and ask three questions. What is your straight-through-processing rate by document type (structured PDF, scanned PDF, handwritten, multi-page packet)? How do you route documents between classical OCR and VLM extraction? And how many of your reported automation rates depend on offshore human review hidden inside the platform? The answers separate genuinely AI-native capture stacks from rebranded OCR with a keyer pool behind it.

Continue learning

More glossary terms

R

Revenue Recognition

Revenue Recognition is the accounting principle and framework for recognising revenue when control of goods or services transfers to the customer. The current standards are ASC 606 (US GAAP) and IFRS 15 (international), both based on the same five-step model that aligns revenue timing with the substance of the transaction.

→

T

Transformer

A Transformer is a deep neural network architecture, introduced by Google researchers in 2017, that uses self-attention to process sequences in parallel and is the foundation behind virtually every modern large language model, vision model, and multimodal AI system.

→

D

Debit Memo

A debit memo is a document issued by a seller that increases the amount a customer owes on an existing invoice or account, typically to correct an undercharge, add fees, or recover costs that were missed at original billing.

→

From the blog

Aligned and incomplete glass blocks representing Dynamics 365 AR native vs. missing capabilities