AI Agents

They're Made Out of Weights — What Every Engineer Should Understand About How LLMs Actually Work

LLMs don't have a dictionary. They don't have grammar rules. They don't have a database of facts. They have weights — 80 layers of floating-point numbers multiplied together. Here's what that means for engineers who use these models every day but have never looked inside.

04 Jun 20269 min readAnkur

Max Leiter published "They're Made Out of Weights" on June 3, 2026 — a short dialogue that captures the central cognitive dissonance of working with LLMs. A manager refuses to accept that the AI writing performance reviews and softening tone "unprompted" is just matrix multiplication. An engineer explains: there's no dictionary in there, no grammar rules, no little man. Just weights. The HN thread hit 440 points because every engineer who's used an LLM has had this conversation with themselves. Let's walk through what "made out of weights" actually means, at the level of detail that matters for practical engineering decisions.

The Fundamental Model: It's Just Guessing the Next Token

An LLM doesn't "write" text in the way a human does. It predicts the next token — a token being roughly 0.75 of a word — based on all the tokens that came before it. Then it feeds that predicted token back into itself and predicts the next one. And the next. A 500-word response is 500+ sequential predictions, each one rolling the dice based on everything that came before.

# What an LLM does, conceptually:
tokens = tokenize("Explain PostgreSQL connection pooling")
output = []
for _ in range(max_tokens):
    logits = model.forward(tokens + output)  # matrix multiply through 80 layers
    next_token = sample(logits, temperature=0.7)  # pick one, weighted by probability
    output.append(next_token)
    if next_token == EOS_TOKEN:
        break
return detokenize(output)

That's it. There's no consciousness loop, no reasoning module, no "understanding" in the philosophical sense. There's a forward pass through stacked transformer layers that assigns a probability to every possible next token, and a sampling strategy that picks one. The "intelligence" is an emergent property of the probabilities being really well-calibrated.

💡 Key Insight When Claude writes a thoughtful code review, it's not "thinking" about your code. It's predicting the most probable sequence of tokens that follows the prompt "Review this pull request" given the patterns it absorbed during training. The fact that those predictions are useful is the miracle.

Weights: What They Are, Where They Come From

Every LLM is a set of matrices. A 12B-parameter model like Gemma 4 12B has approximately 12 billion numbers stored as 16-bit floating point values (about 24 GB uncompressed). These numbers are the "weights" — the coefficients in the matrix multiplications that transform input tokens into output token probabilities.

Weights start random. During training, the model is shown trillions of tokens of text (the entire public internet, books, code repositories, academic papers). For each token, the model predicts what comes next, compares its prediction to what actually came next, and adjusts its weights to reduce the error. After months of training on thousands of GPUs, the weights settle into values that encode patterns:

TrillionsTraining tokens

12BParameters (Gemma 4 12B)

80Transformer layers (typical)

~24 GBUncompressed (FP16)

The weights don't store facts in the way a database does. They store statistical relationships. The number 0.00423 in layer 37, position 8,241 doesn't "mean" anything by itself. It's one coefficient in a chain of 96 matrix multiplications that, collectively, make the model slightly more likely to complete "The capital of France is" with "Paris" than "Mumbai."

The Transformer: Why 80 Layers of Multiplication Does Anything Useful

The architecture that makes this work is the transformer, introduced in the 2017 paper "Attention Is All You Need." A transformer layer has two key mechanisms:

Attention lets each token "look at" every other token in the input and decide which ones are relevant. When processing the sentence "The cat sat on the mat because it was tired," the attention mechanism learns that "it" probably refers to "cat" (not "mat"). This is computed as a weighted sum of all token representations, where the weights are learned during training.

Feed-forward networks (the MLP layers) do the heavy computation. After attention aggregates context, the FFN transforms each token's representation through two matrix multiplications with a nonlinearity in between. This is where most of the model's "knowledge" lives — the FFN weights are what encode that Paris is the capital of France, not Mumbai.

# Simplified transformer layer (one of ~80):
def transformer_layer(x):
    # 1. Self-attention: figure out what's relevant
    attn_out = multi_head_attention(x)  # each token looks at all others
    x = layer_norm(x + attn_out)        # add & normalize (residual connection)
    
    # 2. Feed-forward: transform each token independently
    ffn_out = feed_forward(x)           # two matrix multiplies + activation
    x = layer_norm(x + ffn_out)         # add & normalize
    
    return x  # pass to next layer

The "thinking" is just 80 iterations of this. No branching, no loops, no variable state beyond the accumulated token representations. The same computation runs for every token, every time. The weights are fixed after training — the model doesn't learn from your conversation. It's a pure function from token sequence to probability distribution.

What This Means for Engineering Decisions

Understanding that LLMs are deterministic functions of their weights and inputs has practical implications:

1. Temperature is your only control knob. The model doesn't "try harder" on complex questions. It runs the same forward pass regardless. Temperature (usually 0.0-1.0) controls how much randomness is injected during token sampling. Temperature 0 = always pick the most probable token (deterministic, good for code). Temperature 0.7 = sample proportionally to probability (creative, good for writing). Temperature 1.5 = the model starts generating nonsense because you're amplifying low-probability tokens.

2. Context window is RAM, not disk. The 128K tokens you can feed into a model aren't "stored" — they're processed through attention every time a new token is generated. This is why long contexts are expensive: attention is O(n²) in sequence length. A 128K context costs roughly 4x the computation of a 64K context, not 2x.

3. Fine-tuning changes weights, RAG changes inputs. When you fine-tune a model, you're adjusting the weights to shift the probability distribution toward your domain — legal contracts, medical notes, textile specifications. When you use RAG (retrieval-augmented generation), you're adding relevant documents to the input context without changing weights. Fine-tuning is surgery. RAG is a briefing document. For most Indian business applications, RAG is the right first step — it's cheaper, faster, and doesn't require a dataset of 10,000+ examples.

4. "Hallucination" isn't a bug — it's how the model works. The model doesn't "know" it's making something up. It's always predicting the most probable next token given the training data. When the training data doesn't contain the answer, the model still predicts — it just predicts wrong. This is why RAG is essential for factual applications: you feed the model the facts as context so its next-token predictions are anchored in ground truth.

An LLM is a function that maps (prompt + context) → (probability distribution over next tokens). Everything else — reasoning, creativity, hallucination — is a property of how well those probabilities match what a human would write.

The Practical Takeaway

You don't need to implement a transformer to use LLMs effectively. But you do need to internalize what they are and aren't:

They aren't databases — don't expect exact recall without RAG
They aren't reasoning engines — don't expect multi-step logic to always converge
They aren't deterministic — don't expect the same output twice unless temperature=0
They are pattern matchers — and patterns at web scale turn out to be useful for a lot of things

The manager in Max Leiter's dialogue asks, "So what do these weights have in mind?" The answer is: nothing. They don't have minds. They have a probability distribution over the next token, calibrated across trillions of examples, and optimized to produce outputs that look like they were written by someone who does have a mind.

That's the paradox. And it's why every engineer who uses these tools should understand the weights behind them — not to build their own model, but to know when to trust the output and when to reach for a deterministic alternative.

The Fundamental Model: It's Just Guessing the Next Token

Weights: What They Are, Where They Come From

The Transformer: Why 80 Layers of Multiplication Does Anything Useful

What This Means for Engineering Decisions

The Practical Takeaway

More on ai agents