The Intuition and Derivation of Perplexity for LLM Evaluation

The intuition When deriving the cross-entropy loss, we’ve shown how entropy plays a central role in the optimization of softmax models (ie. multi-class classification models). All large language models (LLMs) are exactly that - softmax models that for an input sequence of \(t\) tokens \(x=[x_1, x_2, \ldots, x_t]\) output a conditional probability distribution \(P(w|x)\) over the vocabulary \(V\) of all tokens. This distribution gives us the most likely next token(s) to continue the input sequence. ...

March 8, 2025

Deriving cross-entropy losses from first principles. From binary classification to LLM distilation

Binary logistic regression, binary cross-entropy loss We start with the binary logistic regression, and define our task as follows. We’re given a dataset inputs and targets (labels): $$ \begin{array}{} (x^{(1)}, y^{(1)})\\ (x^{(2)}, y^{(2)})\\ \vdots\\ (x^{(m)}, y^{(m)}) \end{array} $$and a logistic model: $$\hat{y}^{(i)} = h(f(x^{(i)}; \theta)) = \frac{1}{1 + e^{-f(x^{(i)}; theta)}}$$Each target \(y^{(i)}\) is either 0 or 1. That is, were doing binary classification, for example fraud/not fraud, churn/not churn, disease/not disease, cat/dog. ...

January 15, 2025

LLM Quantization From Scratch - How are empty models instantiated for low memory usage?

Say we want to do post-training quantization of an LLM. For PyTorch models, we’ll usually have an implementation defaulting to bfloat16 and torch.nn layers, such as torch.nn.Linear and torch.nn.Embedding. We’ll also have pretrained weights. For a HuggingFace model they’ll come in a bunch of .safetensors files, accompanied by model configs. To get a quantized model, we can simply: load the pretrained model into memory (cpu or gpu) Do this with the default, non-quantized dtype, usually bfloat16. ...

April 7, 2024