← Back to Portfolio

The Math Behind Transformers

Transformer looks scary, but it is mostly matrix multiplication! Here are the must-understand linear algebra concepts and non-linear functions behind how LLMs work.

In a nutshell, modern deep learning models consist of repeatedly applying the following operations:

Matrix multiplication (Linear Algebra)
Activation function (Non-linear transformation)
Normalization/pooling (Simple Statistics)

Linear Algebra

When we read about the billions of parameters that a model consists of, we talk about weights and biases, which are represented as a large collection of matrices and vectors.

Before the attention mechanism, an LLM first turns each token into a vector. Then most of the hard work that an AI performs is multiplying matrices with other matrices and vectors, which, at its core, performs many dot products.

Sounds confusing and needs a refresh? The key concepts you must understand: vectors, matrices, and dot products. Just high school math!

📌 Resource: Visual Linear Algebra Book View Guide →

Non-linear Transformation

In each layer in the network, between each matrix multiplication step, the AI must perform an activation function to introduce non-linearity. Without this step, stacking multiple layers would be pointless; it would just be one big linear transformation. The most common choices are Softmax, ReLu, sigmoid, and tanh functions.

📌 Resource: Visual Guide to Activation Functions View Guide →

Basic Statistics

Lastly, some steps also perform simple statistics. In language models, we often normalize vectors (dividing by a number so that their length equals 1), or calculate min and max (e.g., max pooling in vision models).

📌 Resource: Seeing Theory - Visual explanations of statistical concepts View Guide →

Putting it all together

Once you know all these, want to see how these three operations combine in real architectures? Here are the best videos explaining the mathematics of LLMs.

📌 Visual book on transformer architecture: bbycroft.net/llm
📌 Transformers, the tech behind LLMs (3Blue1Brown): Watch Video

When you understand these three building blocks, attention and transformers stop feeling like magic. Which part of the math felt most abstract when you first learned it?