Understanding the Transformer: The Architecture That Rewired AI

A deep dive into why Transformers work, not just how they work.

It's 2017. A team at Google publishes a paper with perhaps the most casually ambitious title in ML history: "Attention Is All You Need." Eight years later, we're living in a world shaped by that architecture — from GPT to BERT to Stable Diffusion. The Transformer didn't just improve on existing methods; it fundamentally changed what we thought neural networks could do with sequences.

I've spent a lot of time thinking about why Transformers work so well, and I want to share my understanding — not as a textbook walkthrough, but as a story about design decisions and the intuitions behind them.

The Problem Before Transformers#

To appreciate the Transformer, you have to feel the pain of what came before it.

RNNs and LSTMs process sequences one token at a time, left to right. This is elegant in its simplicity, but it has two devastating consequences. First, it's inherently sequential — you can't parallelize training across time steps, which makes scaling painful. Second, information from early tokens has to survive a long chain of operations to influence later ones. Even with gating mechanisms like LSTMs, long-range dependencies remain fragile. The gradient either vanishes or explodes, and the model quietly forgets.

Attention mechanisms were bolted onto RNNs as a fix: let the decoder "look back" at all encoder states instead of relying on a single compressed vector. This worked remarkably well for machine translation. But the RNN backbone was still there, still sequential, still the bottleneck.

The Transformer asked a radical question: what if we just removed the recurrence entirely?

Self-Attention: The Core Mechanism#

The heart of the Transformer is self-attention, and I think the best way to understand it is through the lens of information retrieval.

Given a sequence of tokens, each token generates three vectors: a Query (Q), a Key (K), and a Value (V). The analogy is surprisingly literal — it's a soft dictionary lookup. Each token's query asks "what information is relevant to me?", each token's key advertises "here's what I contain", and the value is the actual content that gets retrieved.

The attention score between two tokens is the dot product of Q and K, scaled by $\sqrt{d_k}$ to prevent softmax saturation, then normalized via softmax to produce weights over all values:

$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$

What makes this powerful is that every token attends to every other token simultaneously. There is no notion of distance. Token 1 and token 500 interact just as directly as token 1 and token 2. This is a fundamentally different inductive bias from convolutions (local) or recurrence (sequential).

Multi-Head Attention: Parallel Subspaces#

A single attention head can only capture one "type" of relationship at a time. Multi-head attention runs several attention heads in parallel, each with its own learned Q/K/V projections, then concatenates and projects the results. Different heads learn to attend to different things — one might track syntactic dependencies, another might capture semantic similarity, another might handle coreference.

This is one of those design choices that seems obvious in retrospect but was genuinely clever. It gives the model a rich, multi-faceted view of how tokens relate to each other, without increasing computational complexity proportionally.

Positional Encoding: Teaching Order#

Here's the catch with self-attention: it's permutation-invariant. If you shuffle the input tokens, the output changes only because the positional encodings change. The attention mechanism itself has no concept of "left" or "right", "before" or "after".

The original paper used sinusoidal positional encodings — fixed functions of position at different frequencies. The intuition is that each dimension of the encoding oscillates at a different rate, creating a unique "fingerprint" for each position. More importantly, relative positions can be expressed as linear transformations of the encodings, which lets the model learn to attend by relative distance.

Modern architectures have moved to Rotary Position Embeddings (RoPE), which encode relative position directly into the attention computation through rotation matrices. This turns out to be much more natural and extends better to long sequences — it's what most current LLMs use.

The broader lesson here is interesting: Transformers don't have a built-in notion of sequence order. They have to be told about position, and how you tell them matters enormously.

The Feed-Forward Network: The Memory Bank#

Between attention layers sits a position-wise feed-forward network (FFN) — two linear transformations with a nonlinearity in between. This component is often overlooked, but I think it's quietly one of the most important parts of the architecture.

Recent interpretability research suggests that the FFN layers function as a kind of key-value memory. The first linear layer's weights act as keys that match input patterns, and the second layer's weights store the associated values (knowledge). When a pattern matches a key strongly, the corresponding value gets written into the residual stream.

This means the FFN layers are where the model stores factual knowledge — "Paris is the capital of France", "water boils at 100°C", and so on. The attention layers route and compose information; the FFN layers retrieve and inject it.

The Residual Stream: A Communication Bus#

One of the most elegant aspects of the Transformer is its use of residual connections. Each sub-layer (attention or FFN) doesn't replace the representation — it adds to it. The residual stream flows through the entire network, and each layer reads from it and writes to it.

This creates what some researchers call a "residual stream" architecture, where the stream acts as a shared communication bus. Early layers might write syntactic features, middle layers might add semantic information, and later layers might refine the output distribution. The final output is the sum of all these contributions.

This design also has a practical benefit: it dramatically eases gradient flow. Gradients can travel directly from the loss to any layer through the residual connection, which is why Transformers can be trained with dozens or even hundreds of layers.

Layer Normalization: The Unsung Stabilizer#

Layer normalization appears twice in each Transformer block (in the Pre-LN variant that most modern models use). It normalizes the activations to have zero mean and unit variance, then applies a learned affine transformation.

This seems like a minor implementation detail, but it's actually critical for training stability. Without it, the residual stream accumulates values that grow unboundedly with depth, making optimization unstable. Pre-LN (normalizing before the sub-layer) turns out to work better than Post-LN (the original paper's approach) for deep models, which is why most modern architectures use it.

Decoder-Only: The Simplification That Won#

The original Transformer had an encoder-decoder structure designed for sequence-to-sequence tasks like translation. But the most impactful models today — GPT, LLaMA, Claude — use a decoder-only architecture.

The key modification is causal masking: each token can only attend to itself and previous tokens, not future ones. This turns the model into an autoregressive language model that predicts the next token given all previous tokens.

Why did this win? A few reasons. It's simpler (one stack instead of two). It unifies all tasks under the "predict the next token" framework — translation, summarization, question answering, and even reasoning can all be framed as continuation. And scaling laws seem to favor pouring all parameters into a single deep stack rather than splitting them between encoder and decoder.

Why Transformers Scale#

Perhaps the most profound thing about Transformers is that they scale. Double the parameters, double the data, and performance improves predictably along smooth power-law curves. This was not a given — most architectures hit walls.

I think several properties contribute to this:

The architecture is highly parallelizable. Unlike RNNs, all positions can be computed simultaneously during training, which maps perfectly onto modern GPU architectures.

The self-attention mechanism is expressive enough to learn increasingly complex patterns as you add more layers and heads, without running into the representation bottlenecks that plague shallower architectures.

The residual stream and layer normalization create a well-conditioned optimization landscape. Gradients flow cleanly, and the loss surface is smooth enough for large learning rates and massive batch sizes.

And perhaps most importantly, the next-token prediction objective is infinitely scalable in terms of data. The entire internet is a training set.

The Limitations We're Still Working On#

Transformers aren't perfect, and their limitations reveal interesting things about the architecture.

Quadratic attention complexity is the most obvious one. Self-attention computes all pairwise interactions, which scales as $O(n^2)$ in sequence length. This is why context windows were initially limited to 512 or 2048 tokens. There's been enormous progress here — sparse attention, linear attention, ring attention, and clever KV-cache compression schemes — but the fundamental tension between expressiveness and efficiency remains.

Reasoning and planning is another frontier. Transformers are fundamentally one-pass computation machines (ignoring chain-of-thought). Each token gets a fixed number of FLOPs regardless of problem difficulty. For tasks that require variable-depth computation — like multi-step logical deduction or search — this is a real constraint. Chain-of-thought prompting is essentially a hack to turn depth into length, letting the model "think" by generating intermediate tokens.

Lack of explicit memory and state means Transformers can't easily update their beliefs based on new information without retraining. This is why retrieval-augmented generation (RAG) and tool use have become so important — they're prosthetic solutions for a missing capability.

Closing Thoughts#

What impresses me most about the Transformer isn't any single component — it's how the components compose. Attention routes information. FFNs store and retrieve knowledge. Residual connections create a shared workspace. Layer normalization keeps everything stable. Positional encodings inject structure. Together, they create something far greater than the sum of their parts.

The Transformer is also a reminder that inductive biases matter, but less than you think at scale. RNNs had strong sequential inductive biases that matched human intuitions about language, but couldn't scale. Transformers have weaker inductive biases — they have to learn even basic things like word order — but their scalability let them blow past everything else.

We're probably not at the end of architecture innovation. But the Transformer has set an extraordinarily high bar. Whatever comes next will have to do something truly different — not just incrementally better.

If you're interested in going deeper, I'd recommend reading the original "Attention Is All You Need" paper, Anthropic's work on mechanistic interpretability (especially the residual stream perspective), and the scaling laws papers from Kaplan et al. They changed how I think about all of this.