Transformer looks scary, but it is mostly matrix multiplication! Here are the must-understand linear algebra concepts and non-linear functions behind how LLMs work.
In a nutshell, modern deep learning models consist of repeatedly applying the following operations:
When we read about the billions of parameters that a model consists of, we talk about weights and biases, which are represented as a large collection of matrices and vectors.
Before the attention mechanism, an LLM first turns each token into a vector. Then most of the hard work that an AI performs is multiplying matrices with other matrices and vectors, which, at its core, performs many dot products.
Sounds confusing and needs a refresh? The key concepts you must understand: vectors, matrices, and dot products. Just high school math!
In each layer in the network, between each matrix multiplication step, the AI must perform an activation function to introduce non-linearity. Without this step, stacking multiple layers would be pointless; it would just be one big linear transformation. The most common choices are Softmax, ReLu, sigmoid, and tanh functions.
Lastly, some steps also perform simple statistics. In language models, we often normalize vectors (dividing by a number so that their length equals 1), or calculate min and max (e.g., max pooling in vision models).
Once you know all these, want to see how these three operations combine in real architectures? Here are the best videos explaining the mathematics of LLMs.
When you understand these three building blocks, attention and transformers stop feeling like magic. Which part of the math felt most abstract when you first learned it?