OCR
Optical Character Recognition (OCR) is software that converts images of typed or handwritten text (scanned invoices, remittance PDFs, cheques, proofs of delivery) into machine-readable data. In AR and O2C, it sits at the front of capture workflows but is increasingly paired with, or replaced by, vision language models for variable-layout documents.
Optical Character Recognition is the technology that lets a computer read printed or handwritten characters from an image. In an AR or O2C context, that image is almost always a document: a supplier invoice, a remittance advice attached to a wire, a cheque stub, a proof of delivery scan, a deduction backup PDF. Before OCR, those documents were keyed by hand. A shared services team in Manila or Krakow opening 4,000 emails a day and typing invoice numbers into an ERP was, until roughly 2015, the default operating model for global AR.
OCR changed the unit economics. A document that cost 1.20 euros to key manually could be captured for 0.10 to 0.25 euros. That gap is what funded the first wave of cash application automation, AP invoice capture tools, and lockbox digitisation. OCR did not make documents disappear; it made them cheap enough that finance teams stopped feeling them as a per-transaction cost.
Classical OCR is a pipeline of narrow steps. The image is binarised (converted to pure black and white), deskewed, and segmented into lines and characters. A pattern-matching engine then compares each character shape against a trained library of glyphs and returns the most likely letter or digit. Post-processing layers apply dictionaries, regular expressions, and template maps (for example: the invoice number always sits in a 60 by 20 pixel zone in the top right of this supplier's layout) to clean up the raw text and produce structured fields.
The template dependency is the critical limitation. A classical OCR stack works brilliantly when it has seen the layout before and the scan quality is high. It works poorly when either of those assumptions breaks.
Inside the order-to-cash function, OCR shows up in five places. Invoice capture on the AP side, where supplier invoices are read into the ERP. Remittance extraction on the AR side, where customer remittance advices (PDFs, emails, portal screenshots) are parsed for invoice numbers and amounts before being passed to a cash application engine. Cheque processing in markets like the US, France, and Canada where physical cheques still arrive at lockboxes. Proof of delivery digitisation, where signed PODs are captured and indexed so the collections team can answer disputes within minutes instead of days. Deduction documentation, where customer claim packets (often 30+ page bundles of bills of lading, marketing claims, and pricing letters) are read so deduction analysts have searchable text rather than image PDFs.
OCR was built for a world of standardised forms. Modern AR documents are nothing like that. A single mid-market manufacturer might receive remittances in 40 different formats: PDF tables, scanned cheque stubs, screenshots of customer portals, handwritten notes faxed from a hospital AP clerk, multi-page bank reports.
Pure OCR breaks on four things. Variable layouts, where the same customer sends a different remittance format each month. Multi-page documents, where context spans pages (the invoice number is on page 1, the payment amount is on page 3). Contextual disambiguation, for example telling a 7-digit PO number apart from a 7-digit invoice number when both appear in the same block. Handwriting and low-quality scans, where character-level accuracy collapses below 70 percent and propagates errors into every downstream field. Practitioners running classical OCR at scale report field-level accuracy in the 75 to 85 percent range on real-world AR document mixes, which means roughly one in five remittances still needs human touch.
The shift in the last 18 months is that vision language models can read a document the way a human does: they look at the whole page, reason about where labels and values sit relative to each other, and use prior knowledge of what an invoice or remittance is supposed to look like. They do not need a template. They do not even need to have seen the customer's layout before.
On the same variable AR document mix where classical OCR plateaus at 75 to 85 percent, modern VLMs are now hitting 95 to 99 percent field-level accuracy, with line-item and multi-row table extraction landing at 95 to 97 percent. The gap is biggest exactly where AR pain lives: handwriting, multi-page packets, unseen layouts, and tables with merged cells.
The other shift is semantic. OCR gives you a string. A VLM gives you a string plus an opinion about what that string means in context (is this the remit-to address or the bill-to address, is this an invoice number or a credit memo reference). For cash application, that contextual layer is what removes the second human touch.
The production pattern in 2025 is not VLM-only. It is a routed hybrid. Clean, structured electronic documents (an EDI 820, a BAI2 file, a well-formed PDF remittance from a Tier 1 customer) flow through a classical OCR or direct-parse path because it is faster and cheaper per document. Anything that triggers a confidence drop (handwriting detected, unseen layout, multi-page, low scan quality) gets escalated to a VLM. Outputs from both paths feed the same downstream matching engine.
For finance buyers, the practical implication is this: do not evaluate capture vendors on a headline OCR accuracy number. Ask for straight-through-processing rates broken down by document type (structured PDF, scanned PDF, handwritten, multi-page packet). That breakdown reveals whether the vendor is genuinely hybrid or whether they are quietly relying on offshore keyers to clean up OCR's misses before the numbers are reported.
Yes, but as one component in a hybrid stack rather than the primary engine. OCR remains the cheapest and fastest path for clean, structured documents (well-formed PDF remittances, machine-printed invoices) and still handles a large share of daily volume. The change is that variable, handwritten, or multi-page documents now route to a vision language model instead of dropping to a human queue.
On a representative mix of supplier remittances, lockbox images, and POD scans, classical OCR delivers roughly 75 to 85 percent field-level accuracy. Clean machine-printed PDFs sit at the top of that range; handwritten or low-quality scans pull the average down. A 95 percent number quoted by a vendor usually refers to character-level accuracy on benchmark datasets, not field-level accuracy on your actual document mix.
OCR converts pixels into characters using pattern matching and templates. A vision language model looks at the whole page and reasons about structure, context, and meaning, the way a human reader does. The VLM can identify that a 7-digit number in the top right is the invoice number and not the PO number, even on a layout it has never seen before, because it understands what an invoice is.
Technically yes, but it is rarely the right commercial decision. VLM inference is more expensive per page than classical OCR, and for structured high-volume documents the accuracy gain is small. The economic answer is to route by confidence: cheap OCR for the easy 70 to 80 percent of volume, VLM for the long tail where OCR would have failed.
OCR sits between document arrival and the matching engine. A remittance email lands, attachments are extracted, OCR (or a VLM) converts the image into structured fields (customer, invoice numbers, amounts, deductions), and those fields are passed to the matching engine that pairs them against open invoices and the bank statement. If OCR misses, the cash application AI either escalates to a VLM or queues the item for a human analyst.
Ignore the headline accuracy percentage and ask three questions. What is your straight-through-processing rate by document type (structured PDF, scanned PDF, handwritten, multi-page packet)? How do you route documents between classical OCR and VLM extraction? And how many of your reported automation rates depend on offshore human review hidden inside the platform? The answers separate genuinely AI-native capture stacks from rebranded OCR with a keyer pool behind it.