AI Hallucination

A hallucination is when an AI model, typically a large language model, generates output that sounds fluent and confident but is factually incorrect, fabricated, or unverifiable. In finance, a hallucinated invoice number, customer name, or euro amount can post cash to the wrong account, extend credit to the wrong entity, or pollute the audit trail.

Key Takeaways

  • Hallucination is when an LLM produces plausible, confidently worded output that is factually wrong or invented, with no signal that the model is uncertain.
  • In AR and O2C, hallucinations are not cosmetic errors. A fabricated invoice reference or amount can misapply cash, misroute disputes, or trigger an incorrect credit decision.
  • Frontier models hallucinate roughly 3 to 15 percent of the time on benchmark tests; older or smaller models can exceed 30 percent. Finance-specific tasks tend to score worse because training data is sparse.
  • Mitigation stacks combine RAG for grounding, forced citations, deterministic validation against ground-truth systems, and confidence thresholds that route low-confidence output to human review.
  • Production AR systems must never auto-post amounts the model invented. The deterministic policy engine, not the LLM, is the source of truth for any euro that moves.

What hallucination is and why it is the number one AI risk for finance

Hallucination is the term for output that a language model generates with high fluency and high confidence but that is factually wrong, fabricated, or unverifiable against any real source. The model is not lying in any intentional sense. It is sampling the most probable next token given its training, and probability is not the same as truth.

For a marketing assistant, a hallucination is awkward. For an AR or O2C process, it is a control failure. Cash application, credit scoring, dispute coding, and collections outreach all run on identifiers: customer numbers, invoice references, contract terms, euro amounts, due dates. A model that invents any of these values, even occasionally, can post a 45,000 euro receipt to the wrong customer, extend credit to a lookalike entity, or send a dunning email referencing an invoice that does not exist. This is why hallucination sits at the top of the risk register for any finance leader evaluating an AI-native or agentic AR platform.

The five types of hallucination finance teams should know

Not all hallucinations look the same, and the mitigation differs by type.

  • Factual hallucination. The model states something that is not true, for example claiming the median DSO benchmark for European SaaS is 47 days when no such study exists.
  • Citation hallucination. The model invents sources: fake court cases, fake paper titles, fake regulator publications. The format looks correct, the document does not exist.
  • Numerical hallucination. The model produces a specific figure or statistic that is fabricated. The number is often plausibly close to a real one, which makes it hard to spot.
  • Context hallucination. The model misremembers what was said earlier in the same conversation or in the supplied context, blending real and invented details.
  • Tool hallucination. In agent setups, the model invents a function call, parameter name, or API endpoint that does not exist, producing a tool call the orchestrator cannot execute.

Root causes

Hallucinations are not a bug to be patched out. They are a structural property of how current LLMs work, and there are four main drivers.

First, training data is mixed quality. Models learn from internet-scale text that contains both accurate and inaccurate statements, and they have no inherent way to weight one over the other. Second, without retrieval-augmented generation, the model is recalling from compressed parameters rather than reading a source document; recall from compression is lossy. Third, models are trained to produce an answer. Saying I do not know is rarely rewarded during training, so the model defaults to a confident response even when its internal probability is low. Fourth, confidence calibration is poor. Frontier models sound roughly as confident when they are wrong as when they are right, which removes the natural human signal that would prompt a double-check.

Hallucination rates by model class

Public benchmarks and provider-published evaluations give a rough range, although numbers shift with each model release and vary by task.

  • Frontier models such as the latest GPT, Claude, and Gemini families typically hallucinate on roughly 3 to 15 percent of benchmark questions, depending on the test set.
  • Older or smaller models, including many open-weight models under 30 billion parameters, sit in the 20 to 40 percent range on the same tests.
  • Domain-specific tasks such as finance, legal, and medicine tend to score worse than general knowledge tests because the training data is sparser and more specialised.

The honest read for a finance buyer is that even the best models hallucinate often enough that no production AR workflow can rely on raw model output for any decision that moves money.

Mitigation strategies that actually work

Mitigation is a stack, not a single switch.

Retrieval-augmented generation. Ground the model in retrieved documents from your AR system, ERP, and contract repository. The model summarises what it reads rather than recalls from parameters. RAG reduces hallucination materially but does not eliminate it.

Forced citations. Require every factual claim or numerical value to be linked to the source document it came from. Then verify, programmatically, that the cited document exists and contains the value claimed.

Deterministic validation. Every invoice number, customer ID, and euro amount the model emits must be validated against the open AR ledger, customer master, or remittance file. If the value does not match a real record, it is rejected before any downstream action.

Confidence thresholds. Use the model's own log-probabilities, or a calibrated classifier, to route low-confidence outputs to a human reviewer rather than auto-executing them.

Tool use over recall. Where possible, instruct the model to call a database query rather than recall a fact. A SQL result is verifiable; a remembered figure is not.

Chain-of-thought reasoning. Ask the model to show its working. Reasoning traces make it far easier for a reviewer to spot where a step went wrong.

Production guardrails for AR systems

Beyond mitigation, AR systems need hard guardrails at the points where money or commitment is at stake.

  • Cash application. Never auto-post an invoice reference or amount that the model produced. Validate against open AR; require an exact match on customer plus invoice plus amount before posting.
  • Credit decisions. The model can recommend a credit limit or risk band. A deterministic policy engine, not the LLM, makes the binding decision and writes it to the ledger.
  • Customer communication. For material decisions such as escalations, payment plans, or legal notices, model output is a draft only. A human approves before send.
  • Audit trail. Log the model input, the raw output, the validation result, and the human approval if any. Auditors will ask, and the trail must reconstruct every euro.

The honest stance for a CFO is straightforward. Treat the LLM as a confident junior analyst. Useful, fast, often right, but never the final signature on anything that moves cash or commits credit.

Frequently asked questions

Does using a frontier model like GPT or Claude eliminate hallucinations?

No. Frontier models reduce hallucination rates compared with older or smaller models, but published benchmarks still show roughly 3 to 15 percent error rates depending on the task, and finance-specific tasks tend to score worse. For any AR workflow that moves money, the model output must still be validated against ground truth before it takes effect.

Does RAG fully solve hallucination?

No. Retrieval-augmented generation reduces hallucination significantly because the model summarises retrieved documents rather than recalls from memory, but it does not eliminate it. The model can still misread a retrieved document, blend two sources, or invent a detail that was not in the retrieval. RAG must be combined with citation checking and deterministic validation.

What is the difference between a hallucination and a normal model error?

A normal error is usually a misclassification or a misranking that the model itself signals as uncertain. A hallucination is output that is fluent, confidently phrased, and indistinguishable in tone from a correct answer, but factually wrong or fabricated. The danger is not the error rate alone, it is the absence of any signal that something is wrong.

How do we catch the hallucinations that are subtle, such as a number that is plausibly close to the real one?

Subtle numerical hallucinations are the hardest to catch by reading. The reliable approach is to never accept a number from the model as authoritative. Any euro amount, invoice number, or customer identifier must be cross-checked against the source system, and any number that does not exact-match a real record is rejected or flagged for review.

Are hallucinations a deal-breaker for using AI in cash application or credit?

No, provided the system is architected correctly. The LLM should propose matches, codings, or recommendations, and a deterministic policy engine should make the binding decision. With RAG, validation against open AR, and a human in the loop for low-confidence cases, hallucination becomes a manageable risk rather than a blocker.

How do we explain hallucination risk to an audit committee?

Frame it in three layers. First, acknowledge that LLMs can produce confident but incorrect output, and quantify the residual rate from vendor evaluations. Second, describe the control stack: retrieval grounding, citation enforcement, deterministic validation against the ledger, and confidence thresholds. Third, show the audit trail: every model output, every validation result, and every human approval is logged and reviewable.

Continue learning