How Llama 2 Works: Inference and Architecture in Pure PHP

The PHP ecosystem has long been the engine of the web. When it comes to artificial intelligence, Large Language Models (LLMs), or tensor computation, the mind immediately jumps to Python, C++, or CUDA. Yet, the best way to truly understand a complex technology is to strip away its abstractions, step away from "magic" libraries like PyTorch or TensorFlow, and rewrite it from scratch.

Guided by the philosophy "if I can't build it, I don't understand it", I tried to follow what Andrej Karpathy already did with his project llama2.c, rewriting it in PHP as llama2-php. This is a native implementation, in pure PHP (without C extensions or external dependencies), for inference of the Llama 2 architecture.

In this article, we will explore in detail how Llama 2 works under the hood, how it was translated into PHP, and the architectural and performance challenges faced during development.

How Llama 2 Works

To understand the code, we first need to understand the architecture. Meta's Llama 2 is based on the classic Transformer architecture, a model introduced by Google in 2017 (with the famous paper "Attention Is All You Need") that literally revolutionized the world of artificial intelligence.

Before Transformers, neural networks (like RNNs or LSTMs) read text a bit like a beginner child would: one word at a time, in strict sequence. This approach had a huge limitation: upon reaching the end of a long paragraph, the model ended up "forgetting" the concepts expressed at the beginning. The Transformer eliminated this bottleneck by processing the entire text in parallel.

To give you an idea, imagine having to analyze a very complex contract. Instead of reading it line by line and losing the thread, the Transformer can look at the entire page in a single instant and instantaneously trace invisible threads connecting a pronoun on page 3 with the subject named on page 1. It manages to do this by weighing the importance of each word against all the others, simultaneously.

Building on this revolutionary foundation, Llama 2 introduces some fundamental optimizations that make it even faster and computationally more efficient. Let's look at them one by one:

1. Tokenization and Embedding

An LLM doesn't understand words, it understands numbers. The first step is therefore tokenization, where the text is fragmented into "tokens" (which can be whole words, syllables, or single characters) mapped to an integer ID.

These IDs are then converted into Embeddings, which are high-dimensional vectors.

Imagine an immense three-dimensional map containing all the words of the language. The word "King" in this map is located at a certain coordinate, "Queen" will be very close to it (because it shares the same context or meaning with King).... but "Apple" will be in a completely different area! The Embedding takes our token and places it in this space, allowing the model to understand its semantic meaning through vector distance from other words.

2. RoPE (Rotary Positional Embeddings)

Unlike our brain (and older sequential neural networks that read one word at a time), Transformers process all tokens in parallel. This guarantees brute speed, but makes them totally blind to word order: without help, the model wouldn't know how to distinguish between 'The dog bites the man' and 'The man bites the dog'.

The only way to solve this problem is to intervene on the data itself: we must "tag" the words before they enter the mathematical "blender". This is where Positional Encoding comes into play. Since the neural network lacks the concept of time or sequence, we take the Embedding of each individual token (the vector representing its meaning we talked about earlier) and add a second vector to it that encodes its position within the sentence.

This way, the token "dog" doesn't enter the model carrying only the semantic information of the animal; it carries the fused information "I am the concept of dog and I am located at position number two". It is precisely from this need that systems like Llama 2's RoPE are born, which instead of adding a value, "rotate" the vectors in multidimensional space to preserve the relative distances between words even more efficiently.

3. Self-Attention

The true beating heart of the Transformer, the magic that allows it to "understand" language, is the Self-Attention mechanism. To understand how it works, we must imagine that each word, as soon as it enters the neural network, undergoes a decomposition. Through matrix multiplications, for each individual token the model generates three distinct new vectors: Query, Key, and Value.

We can see them as the three roles a word assumes within a conversation:

Query (Q): What am I looking for? It's the question the token asks the rest of the sentence to better understand itself.
Key (K): What do I have to offer? It's the label the token exposes to others, indicating the information it contains.
Value (V): What is my true meaning? It is the substance, the pure semantic essence of the token that will actually be used to build the final understanding.

Take for example a very ambiguous sentence, like "The bank closed the branch because it had no funds".

The word "funds", taken individually, doesn't have a unique meaning: it could be coffee grounds (in Italian, fondi di caffè means coffee grounds), investment funds, or the bottoms of bottles. How does the model disambiguate? When it's the turn to process the word "funds", its Query vector starts scanning the entire sentence. Mathematically, this happens by calculating the dot product between the Query of "funds" and the Keys of all the other words present. The dot product is an operation that returns a high number if two vectors are similar, and a low number if they are orthogonal (meaning they have nothing to do with each other).

The Query of "funds" "asks": "Is there anyone here dealing with finance, agriculture, or something else?". The words "bank" and "branch" expose their Keys screaming: "We talk about economics, credit institutions, and money!".

The mathematical calculation between the Query of "funds" and the Keys of "bank" and "branch" produces a very high score. Conversely, the affinity score with words like "because" or "had" will be close to zero.

At this point the fusion happens: the model takes the Value vectors of all the words and sums them together, but not in equal parts. It "weighs" them based on the scores just calculated. Since "bank" obtained a very high score, its Value (its financial meaning) will help the correct interpretation of the word "funds".

The final result? The token "funds" exits this attention block profoundly changed. It is no longer the generic dictionary token, but has become a highly contextualized vector that unequivocally means money. It has literally paid "attention" to the right context.

In Llama 2, this process has been further optimized with a technique called Grouped-Query Attention (GQA). Instead of calculating unique Keys and Values for every single Query (which consumes an enormous amount of RAM during inference), multiple Queries are grouped together to share the same Keys and Values, drastically reducing the computational load without losing precision.

4. RMSNorm and SwiGLU

Once the Attention mechanism has merged the meanings of the words together, the resulting vectors must pass through two crucial phases before being sent to the next layer: they must be stabilized and then processed by a "classic" neural network.

Stabilization: RMSNorm In a deep network like Llama 2, which has dozens of stacked layers, the size of the numbers tends to get out of hand. Multiplication after multiplication, the values within the vectors can become astronomically large or microscopic, ruining future calculations.

Older Transformers used Layer Normalization, which calculated the mean of all values in the vector, subtracted it (to center the data on zero), and then divided by the variance. Llama 2 cuts to the chase and uses Root Mean Square Normalization (RMSNorm). Researchers realized that centering the data on the mean was a computationally superfluous step. RMSNorm simply divides the values by their root mean square to ensure that the numbers always stay within a manageable range (generally between -1 and 1), reducing calculation operations by about 10% for each single layer. In the context of a PHP implementation, where every loop over an array has a cost, skipping the recalculation of the mean on vectors of thousands of elements makes a huge difference in terms of performance.

Processing: The Feed-Forward Network and SwiGLU If Attention serves to make the tokens "talk" to each other to understand the context, the subsequent Feed-Forward network (FFN) serves to process the single token independently, extracting complex logical patterns from it.

This is where the activation function comes into play. A neural network without a non-linear activation function would just be a gigantic, useless addition. Until a few years ago, the industry standard was ReLU (Rectified Linear Unit), which has a brutal rule: if the number is negative it becomes 0; if it's positive it stays as it is. This approach is fast but causes the dead neurons problem: negative numbers are literally destroyed and the "neuron" stops learning (zero, exactly, for future multiplications).

Llama 2 makes a leap in quality by using SwiGLU (Swish-Gated Linear Unit). This architecture is much more sophisticated and combines two concepts:

Swish: Instead of the sharp cut to zero of ReLU, it uses a softer curve that attenuates negative numbers but does not zero them out completely, so they can continue to transmit nuances of useful information to the model.
GLU (Gated Linear Unit): Instead of simply passing the data through a set of weights, SwiGLU divides the flow. One part calculates the actual activation, the other part acts as a "gate" that multiplies the result, actively deciding how much of that data should actually pass to the next level.

In short, SwiGLU acts like a dimmable switch we use for some house lights: it doesn't just turn a neuron on or off, but finely regulates its signal intensity. Although it requires more parameters (three weight matrices instead of the classic two), empirically it allows Llama 2 to learn much more abstract concepts with the same computational resources.

The PHP Implementation

Porting all this to PHP required a very strict architectural approach, similar to the one I adopted in my vector database vektor. No ORMs, no frameworks, just low-level data structures and pure memory manipulation.

Reading and managing weights

The trained model (the "weights" of the neural network) is typically saved in a huge binary file (in my case generated via export from Andrej Karpathy's C project). In PHP, accessing these files cannot happen by loading everything into memory. If we tried to do a file_get_contents() of a multi-Gigabyte model, the PHP process would immediately throw a Fatal Error by exceeding the memory_limit.

The implementation obviously uses streams:

$fp = fopen('model.bin', 'rb');
// Reads a header to understand the size of the model
$header = unpack('i*', fread($fp, 28));

Tensors are read in blocks and decoded using unpack('f*', $binary_data). For larger models, a virtual memory-mapping approach must be studied instead, which consists of reading parts of the file on disk exactly when the current layer requires them, calculating the matrix multiplications, and discarding the data to keep RAM overhead close to zero.

Matrix Multiplication (MatMul)

The real bottleneck is the main mathematical operation: C = A * B. In the Llama core, this means multiplying the input vector by the massive weight matrix. In llama2-php, the matmul function was written taking memory layout into account.

function matmul(array &$out, array &$x, array &$w, int $n, int $d) {
    // x is the input (dimension $d)
    // w are the weights (flattened matrix $d x $n)
    for ($i = 0; $i < $n; $i++) {
        $val = 0.0;
        for ($j = 0; $j < $d; $j++) {
            $val += $w[$i * $d + $j] * $x[$j];
        }
        $out[$i] = $val;
    }
}

The matrix implementation is deliberately "flat" (made with one-dimensional arrays) to maximize access speed and avoid the disastrous overhead that PHP introduces with multidimensional arrays, which under the hood are managed as heavy hash tables.

Floating Point Precision

In C, lighter models use the 32-bit float format (FP32). PHP internally exclusively uses the 64-bit double type for everything floating-point. When reading weights with unpack('f*'), PHP tacitly converts them to 64-bit. While this theoretically increases precision, it slightly alters the calculation of logarithmic functions compared to the original in C. This inevitably leads to some multiplication differences compared to other implementations.

In Conclusion

Transforming llama2.c into llama2-php had a single purpose: to study and understand the internal workings of the engine.

Implementing abstract concepts like Attention and RoPE allowed me to experience the mathematical elegance of these networks firsthand. I discovered that artificial intelligence is not black magic, but "only" millions of sequential multiplications, structured binary readings, and, again, millions of multiplications....