Transformer

A Transformer is a deep neural network architecture, introduced by Google researchers in 2017, that uses self-attention to process sequences in parallel and is the foundation behind virtually every modern large language model, vision model, and multimodal AI system.

Key Takeaways

  • Transformers are an AI architecture (not electrical transformers) introduced in the 2017 paper Attention Is All You Need.
  • Self-attention lets the model weigh every token against every other token in parallel, replacing the slower recurrence used in RNNs and LSTMs.
  • Three main variants exist: encoder-only (BERT) for understanding, decoder-only (GPT, Claude) for generation, and encoder-decoder (T5) for translation-style tasks.
  • Transformers now power language, vision, audio, code, and time-series models because they scale predictably with data and compute.
  • In AR and O2C, transformer-based models read remittances and invoices, classify customer emails, and forecast payment timing across customer portfolios.

What a Transformer is and the 2017 origin

A Transformer is a deep neural network architecture for processing sequences of data: words, image patches, time-series points, or audio frames. It was introduced in the 2017 paper Attention Is All You Need by researchers at Google Brain and Google Research. The paper showed that a model built entirely on a mechanism called self-attention could outperform the recurrent neural networks (RNNs and LSTMs) that dominated sequence modeling at the time, while training far faster on modern hardware.

To be clear: this is an AI architecture, not an electrical transformer. The name reflects how the model transforms input sequences into output representations through layered attention. Nearly every modern AI system finance and IT leaders hear about (GPT, Claude, Gemini, Llama, BERT, Whisper, Stable Diffusion, CLIP) is a Transformer or a close relative.

The self-attention breakthrough

The central innovation is self-attention. For every token in a sequence, the model computes how much that token should attend to every other token, then mixes their representations accordingly. This lets the model capture long-range dependencies, for example linking a customer name at the top of an email to a payment amount at the bottom, in a single step.

Before Transformers, RNNs processed sequences one element at a time, which was slow and lost information over long passages. Self-attention is fully parallel: a GPU can compute attention across all tokens at once. That parallelism is what made it practical to train models on trillions of tokens and grow them to hundreds of billions of parameters, which in turn unlocked the capabilities we now call LLMs.

Core architecture components

A Transformer is built from a small set of repeating pieces:

  • Tokenization: input text or data is split into tokens (subwords, image patches, or time steps).
  • Embeddings: each token is mapped to a vector that captures its meaning.
  • Positional encoding: because attention is order-agnostic, position information is added so the model knows token order.
  • Multi-head self-attention: multiple attention heads run in parallel, each learning different relationships (syntax, entities, numeric context).
  • Feed-forward layers: per-token neural networks that transform the attended representations.
  • Layer normalization and residual connections: engineering tricks that keep deep stacks of these blocks stable during training.

These blocks are stacked dozens or hundreds of times. Stacking depth, attention heads, and embedding width is what scales a Transformer from a small classifier to a frontier model.

Variants: encoder-only, decoder-only, encoder-decoder

Three architectural variants dominate practical use:

  • Encoder-only (BERT, RoBERTa): reads the full input and produces rich representations. Best for classification, search, and embedding generation.
  • Decoder-only (GPT-4, Claude, Llama): generates tokens one at a time, attending only to previous tokens. The foundation of modern chat and generative AI.
  • Encoder-decoder (T5, the original Transformer): encodes an input sequence and decodes an output sequence. Strong for translation, summarization, and structured rewriting.

The same core architecture also powers Vision Transformers (ViT) for images, Whisper for speech, time-series transformers for forecasting, and multimodal models like CLIP, GPT-4o, and Claude that combine text with images. The architecture is general-purpose, the training data and objective determine the modality.

Why this matters for Order to Cash and Cash Flow Forecasting

Transformers are not an academic curiosity for finance teams, they are the engine inside almost every modern AR automation feature. In Order to Cash and Cash Flow Forecasting specifically, transformer-based models do four jobs that older rules-based systems could never reliably do.

First, document understanding: Vision-Language Transformers read remittance advices, invoices, and bank statements directly from PDF or image, extracting amounts, invoice numbers, and customer references without brittle templates. Second, semantic classification: encoder-style transformers route customer emails into categories such as dispute, payment confirmation, or copy-invoice request, even when the wording is unusual. Third, time-series forecasting: transformer-based forecasters predict when each customer will pay, rolling up to portfolio-level cash flow forecasts that adapt as new payments arrive. Fourth, dispute reasoning and summarization: decoder-style LLMs synthesize long email threads, ERP notes, and aging detail into a recommended dispute resolution, work that previously required a human analyst to read everything from scratch.

Transformance.ai uses transformer-based models across document understanding, semantic classification, and time-series forecasting in AR workflows.

Limitations and recent developments

Transformers have well-known limitations. The biggest is the quadratic cost of attention: doubling the context length roughly quadruples the compute and memory required. That is why very long documents, say a full year of customer correspondence, used to be hard to feed in directly.

Recent research addresses this with sparse and linear attention variants (Longformer, BigBird, FlashAttention), and with non-transformer alternatives such as state-space models like Mamba that promise linear scaling with sequence length. Mixture-of-experts transformers reduce inference cost by activating only a fraction of parameters per token. In practice, frontier models in 2026 still use the transformer recipe at their core, augmented with these efficiency techniques. For finance and IT leaders, the practical takeaway is that the architecture is stable enough to build on, vendors that depend on transformers today are not betting on a fad.

Frequently asked questions

Is a Transformer the same as an LLM?

No. A Transformer is the underlying neural network architecture. An LLM is a large language model, specifically a decoder-only Transformer trained on huge amounts of text. All modern LLMs are Transformers, but not all Transformers are LLMs. Vision Transformers, time-series Transformers, and BERT-style classifiers are also Transformers.

Who invented the Transformer architecture?

Researchers at Google Brain and Google Research published the architecture in the 2017 paper Attention Is All You Need. The eight authors, including Ashish Vaswani, Noam Shazeer, and Aidan Gomez, designed it originally for machine translation. The architecture has since become the foundation of nearly every major AI system in production today.

Why do Transformers need so much compute?

Self-attention compares every token to every other token, which scales quadratically with sequence length. Training also requires processing trillions of tokens across hundreds of billions of parameters. The trade-off is parallelism: unlike RNNs, every position can be computed at once on a GPU, which is what made large-scale pretraining possible in the first place.

Are Transformers used outside of language tasks?

Yes, extensively. Vision Transformers (ViT) classify images, Whisper transcribes speech, AlphaFold-style models predict protein structures, time-series Transformers forecast demand and payment timing, and multimodal models like Claude and GPT-4o process text, images, and audio in a single architecture. The same core mechanism, self-attention, works across modalities.

What is the difference between encoder-only and decoder-only Transformers?

Encoder-only models like BERT read the entire input at once and produce representations used for classification, search, or embeddings. Decoder-only models like GPT and Claude generate tokens one at a time, attending only to previous tokens, which makes them suited for chat, writing, and reasoning. Encoder-decoder models like T5 combine both for translation-style tasks.

How do Transformers help an AR or finance team in practice?

Transformer-based models extract structured data from remittances and invoices without templates, classify inbound customer emails by intent, summarize dispute threads into recommended actions, and forecast when each customer will pay so that cash flow forecasts update automatically. The result is fewer manual touches per invoice and a forecast that reflects current customer behaviour, not last quarter's averages.

Continue learning