← Back to Portfolio

The Math Behind Transformers

Transformer looks scary, but it is mostly matrix multiplication! Here are the must-understand linear algebra concepts and non-linear functions behind how LLMs work.

In a nutshell, modern deep learning models consist of repeatedly applying the following operations:

  1. Matrix multiplication (Linear Algebra)
  2. Activation function (Non-linear transformation)
  3. Normalization/pooling (Simple Statistics)

Linear Algebra

When we read about the billions of parameters that a model consists of, we talk about weights and biases, which are represented as a large collection of matrices and vectors.

Before the attention mechanism, an LLM first turns each token into a vector. Then most of the hard work that an AI performs is multiplying matrices with other matrices and vectors, which, at its core, performs many dot products.

Sounds confusing and needs a refresh? The key concepts you must understand: vectors, matrices, and dot products. Just high school math!

📌 Resource: Visual Linear Algebra Book View Guide →

Non-linear Transformation

In each layer in the network, between each matrix multiplication step, the AI must perform an activation function to introduce non-linearity. Without this step, stacking multiple layers would be pointless; it would just be one big linear transformation. The most common choices are Softmax, ReLu, sigmoid, and tanh functions.

📌 Resource: Visual Guide to Activation Functions View Guide →

Basic Statistics

Lastly, some steps also perform simple statistics. In language models, we often normalize vectors (dividing by a number so that their length equals 1), or calculate min and max (e.g., max pooling in vision models).

📌 Resource: Seeing Theory - Visual explanations of statistical concepts View Guide →

Putting it all together

Once you know all these, want to see how these three operations combine in real architectures? Here are the best videos explaining the mathematics of LLMs.

When you understand these three building blocks, attention and transformers stop feeling like magic. Which part of the math felt most abstract when you first learned it?