Fine-Tuning

Fine-tuning is the process of further training a pre-trained foundation model on a curated dataset so it adapts its behaviour, style, or task performance to a specific domain. In finance, it is one of three main ways (alongside RAG and prompt engineering) to customise large language models for AR and O2C workflows.

Key Takeaways

  • Fine-tuning updates a foundation model's weights using a domain-specific dataset, changing how the model behaves rather than what it knows at runtime.
  • Parameter-Efficient Fine-Tuning (PEFT), particularly LoRA, is now the dominant enterprise approach: it costs 100 to 1,000 euros per run versus 10,000 euros plus for full fine-tuning.
  • Fine-tuning is best for narrow, stable tasks like output format normalisation, classification accuracy, and brand voice. RAG is better for anything involving current company data.
  • For finance specifically, RAG is almost always preferred because customer and contract data changes constantly, audit requires explainability, and baking data into weights creates data residency risk.
  • Common pitfalls include training data leakage, catastrophic forgetting of general capability, overfitting, and shipping a fine-tuned model without an evaluation harness.

What fine-tuning is and how it differs from pre-training, RAG, and prompting

Fine-tuning is the process of taking a pre-trained foundation model and continuing to train it on a smaller, curated dataset so it adapts its behaviour, style, or task performance to a specific domain. The model's weights change. After fine-tuning, the same prompt will produce a different response than it would from the base model.

It is useful to place fine-tuning next to three related but distinct techniques:

  • Pre-training is the initial training of a foundation model on massive general corpora (trillions of tokens of web text, books, code). This costs tens to hundreds of millions of euros and is done by foundation model providers, not enterprises.
  • RAG (Retrieval-Augmented Generation) retrieves relevant documents at runtime and adds them to the prompt. Model weights never change. Knowledge can be updated by changing the underlying documents.
  • Prompt engineering changes the input text to the model without changing weights. System prompting is the persistent-instruction version of this: a long block of guidance injected into every conversation.

The mental model: pre-training builds the brain, fine-tuning shapes habits, RAG hands the model a reference document, and prompt engineering tells it what job to do today.

The three main fine-tuning approaches

Full fine-tuning updates every parameter in the model. For a modern 70-billion-parameter model this requires significant GPU infrastructure and produces a complete new model checkpoint. Costs typically run from 10,000 to 100,000 euros plus per training run, and the resulting model must be hosted as a custom deployment.

Parameter-Efficient Fine-Tuning (PEFT) updates only a small subset of parameters, leaving the bulk of the model frozen. LoRA (Low-Rank Adaptation) is the dominant PEFT method: it inserts small trainable matrices alongside the original weights and trains only those. A LoRA adapter is often only a few megabytes and can be swapped in and out of a shared base model. Training takes hours of GPU time and typically costs 100 to 1,000 euros per run, putting it within reach of any finance IT team.

Instruction tuning and RLHF (Reinforcement Learning from Human Feedback) align a model to human preferences across many tasks. This is what turns a raw language model into a chat assistant. It is done by foundation model providers and is rarely the right tool for an individual enterprise deployment.

Where fine-tuning adds value in finance

In AR and O2C, fine-tuning earns its place when the same narrow behaviour must be repeated thousands of times with high consistency. Useful examples include:

  • House terminology and style: always use remit-to rather than billing address, always sign off draft dunning emails in a specific tone, always format customer names in title case.
  • Strict output formats: always return JSON with an exact set of fields, always classify into one of seven dispute reason codes, never include explanatory prose.
  • Classification accuracy on narrow categories: when your business has dispute reason codes, deduction types, or product-line distinctions that a general model cannot reliably tell apart from prompt instructions alone.
  • Customer-facing tone: drafting collections messages that consistently match brand voice across thousands of customers.

Notice what is missing: customer data, contract terms, current balances, dispute history. None of that should be fine-tuned in. It belongs in RAG.

Fine-tune, RAG, or prompt engineering: a decision framework

The three techniques are not competitors. They solve different problems and are usually combined.

  • Use prompt engineering first. If the desired behaviour can be specified in fewer than 100 lines of instruction, a well-written system prompt on a capable model will usually deliver. It is the cheapest and most auditable option.
  • Use RAG when the model needs knowledge that changes over time, when you need source attribution for audit, or when access controls matter. Almost all finance knowledge falls here: customer master data, contracts, invoice history, payment behaviour, dispute notes.
  • Use fine-tuning when behaviour or style must be consistent across thousands of calls, the task is narrow and well-defined, and prompt engineering has hit a quality ceiling. Fine-tuning is also justified when token budgets explode because the system prompt has grown to thousands of lines.

For finance specifically, the order of preference is prompt engineering, then RAG, then fine-tuning. RAG is almost always preferred over fine-tuning because customer data changes constantly, audit requires explainability through citations, and baking data into model weights creates data residency complications under GDPR.

Costs, infrastructure, and common pitfalls

PEFT and LoRA training runs typically cost 100 to 1,000 euros in GPU time and complete in hours. Full fine-tuning costs 10,000 to 100,000 euros plus and can take days to weeks. Inference cost is often higher for fine-tuned models: API providers commonly charge a premium per token on custom-tuned endpoints, and self-hosted custom models carry ongoing GPU hosting cost.

The pitfalls that bite enterprise teams:

  • Training data leakage: private customer data baked into weights cannot be deleted on request. This is a serious data subject rights problem.
  • Catastrophic forgetting: the model loses general reasoning capability while learning the narrow task.
  • Overfitting: the model memorises training examples and fails on slightly different real-world inputs.
  • No evaluation harness: teams ship a fine-tuned model without a held-out test set and cannot prove it is better than the base model with a good prompt.
  • Drift: the fine-tuned model becomes stale as foundation models improve, and the team is reluctant to retrain.

Production guidance for AR and finance teams

Reserve fine-tuning for stable, narrow tasks: output format normalisation, classification into a fixed taxonomy, brand voice on drafted communications. Use RAG for everything involving current company-specific knowledge, which is most of what AR and O2C actually need.

Before any fine-tuning project, build an evaluation harness: a held-out set of representative inputs with expected outputs, scored by humans or a strong judge model. Without it you cannot tell whether fine-tuning helped, hurt, or did nothing.

Strip personally identifiable information and customer-specific data from training datasets. If a piece of information might need to be deleted, updated, or access-controlled later, it does not belong in weights. Treat the fine-tuned adapter as a versioned artefact: train, evaluate, deploy, monitor, and retrain on a schedule as base models evolve. For most AR teams, an AI-native platform with strong RAG and prompt engineering will deliver 90 percent of the value of fine-tuning without the infrastructure burden or audit risk.

Frequently asked questions

What is the difference between fine-tuning and RAG?

Fine-tuning changes the model's weights by training it on examples, so the model behaves differently afterwards. RAG leaves the weights untouched and instead retrieves relevant documents at runtime to add to the prompt. Fine-tuning shapes how the model responds. RAG controls what knowledge it has access to. In finance, RAG is preferred for anything involving customer data, contracts, or balances because the data changes constantly and must be auditable.

When should a finance team fine-tune a model instead of using prompt engineering?

Fine-tune when the desired behaviour must be highly consistent across thousands of calls, the task is narrow and well-defined, and the system prompt has grown so large it is hurting cost or latency. If the behaviour can be specified in under 100 lines of instruction on a capable foundation model, prompt engineering is almost always cheaper, faster to iterate on, and easier to audit.

What is LoRA and why does it matter?

LoRA stands for Low-Rank Adaptation. It is a Parameter-Efficient Fine-Tuning method that trains a small set of additional matrices alongside the frozen base model rather than updating all weights. A LoRA adapter is often only a few megabytes, costs 100 to 1,000 euros to train, and can be swapped in and out of a shared base model. It has made fine-tuning practical for enterprise teams that previously could not afford full fine-tuning.

How much does fine-tuning cost?

PEFT and LoRA runs typically cost 100 to 1,000 euros in GPU time per training cycle and complete in hours. Full fine-tuning of a large model can cost 10,000 to 100,000 euros plus and take days to weeks. Inference is often charged at a higher per-token rate for custom-tuned models, and self-hosted custom checkpoints add ongoing GPU hosting cost.

Can fine-tuning expose private customer data?

Yes. Once data is baked into model weights it cannot be selectively deleted, which creates serious problems for GDPR data subject rights and for data residency. Any training dataset must be stripped of personally identifiable information and customer-specific records. If a piece of information might ever need to be deleted, updated, or access-controlled, it does not belong in weights. Use RAG for that data instead.

Does fine-tuning replace the need for RAG in an AR platform?

No. They solve different problems and are usually combined. Fine-tuning shapes consistent behaviour and style. RAG supplies the current customer, invoice, contract, and dispute knowledge the model needs to be useful on a specific account. An AI-native AR platform will typically rely heavily on RAG for company-specific knowledge and use fine-tuning sparingly for stable tasks like output format and brand voice.

Continue learning