AI Agents
They're Made Out of Weights — What Every Engineer Should Understand About How LLMs Actually Work
LLMs don't have a dictionary. They don't have grammar rules. They don't have a database of facts. They have weights — 80 layers of floating-point numbers multiplied together. Here's what that means for engineers who use these models every day but have never looked inside.
Max Leiter published "They're Made Out of Weights" on June 3, 2026 — a short dialogue that captures the central cognitive dissonance of working with LLMs. A manager refuses to accept that the AI writing performance reviews and softening tone "unprompted" is just matrix multiplication. An engineer explains: there's no dictionary in there, no grammar rules, no little man. Just weights. The HN thread hit 440 points because every engineer who's used an LLM has had this conversation with themselves. Let's walk through what "made out of weights" actually means, at the level of detail that matters for practical engineering decisions.
The Fundamental Model: It's Just Guessing the Next Token
An LLM doesn't "write" text in the way a human does. It predicts the next token — a token being roughly 0.75 of a word — based on all the tokens that came before it. Then it feeds that predicted token back into itself and predicts the next one. And the next. A 500-word response is 500+ sequential predictions, each one rolling the dice based on everything that came before.
# What an LLM does, conceptually:
tokens = tokenize("Explain PostgreSQL connection pooling")
output = []
for _ in range(max_tokens):
logits = model.forward(tokens + output) # matrix multiply through 80 layers
next_token = sample(logits, temperature=0.7) # pick one, weighted by probability
output.append(next_token)
if next_token == EOS_TOKEN:
break
return detokenize(output)
That's it. There's no consciousness loop, no reasoning module, no "understanding" in the philosophical sense. There's a forward pass through stacked transformer layers that assigns a probability to every possible next token, and a sampling strategy that picks one. The "intelligence" is an emergent property of the probabilities being really well-calibrated.
Weights: What They Are, Where They Come From
Every LLM is a set of matrices. A 12B-parameter model like Gemma 4 12B has approximately 12 billion numbers stored as 16-bit floating point values (about 24 GB uncompressed). These numbers are the "weights" — the coefficients in the matrix multiplications that transform input tokens into output token probabilities.
Weights start random. During training, the model is shown trillions of tokens of text (the entire public internet, books, code repositories, academic papers). For each token, the model predicts what comes next, compares its prediction to what actually came next, and adjusts its weights to reduce the error. After months of training on thousands of GPUs, the weights settle into values that encode patterns:
The weights don't store facts in the way a database does. They store statistical relationships. The number 0.00423 in layer 37, position 8,241 doesn't "mean" anything by itself. It's one coefficient in a chain of 96 matrix multiplications that, collectively, make the model slightly more likely to complete "The capital of France is" with "Paris" than "Mumbai."
The Transformer: Why 80 Layers of Multiplication Does Anything Useful
The architecture that makes this work is the transformer, introduced in the 2017 paper "Attention Is All You Need." A transformer layer has two key mechanisms:
Attention lets each token "look at" every other token in the input and decide which ones are relevant. When processing the sentence "The cat sat on the mat because it was tired," the attention mechanism learns that "it" probably refers to "cat" (not "mat"). This is computed as a weighted sum of all token representations, where the weights are learned during training.
Feed-forward networks (the MLP layers) do the heavy computation. After attention aggregates context, the FFN transforms each token's representation through two matrix multiplications with a nonlinearity in between. This is where most of the model's "knowledge" lives — the FFN weights are what encode that Paris is the capital of France, not Mumbai.
# Simplified transformer layer (one of ~80):
def transformer_layer(x):
# 1. Self-attention: figure out what's relevant
attn_out = multi_head_attention(x) # each token looks at all others
x = layer_norm(x + attn_out) # add & normalize (residual connection)
# 2. Feed-forward: transform each token independently
ffn_out = feed_forward(x) # two matrix multiplies + activation
x = layer_norm(x + ffn_out) # add & normalize
return x # pass to next layer
The "thinking" is just 80 iterations of this. No branching, no loops, no variable state beyond the accumulated token representations. The same computation runs for every token, every time. The weights are fixed after training — the model doesn't learn from your conversation. It's a pure function from token sequence to probability distribution.
What This Means for Engineering Decisions
Understanding that LLMs are deterministic functions of their weights and inputs has practical implications:
1. Temperature is your only control knob. The model doesn't "try harder" on complex questions. It runs the same forward pass regardless. Temperature (usually 0.0-1.0) controls how much randomness is injected during token sampling. Temperature 0 = always pick the most probable token (deterministic, good for code). Temperature 0.7 = sample proportionally to probability (creative, good for writing). Temperature 1.5 = the model starts generating nonsense because you're amplifying low-probability tokens.
2. Context window is RAM, not disk. The 128K tokens you can feed into a model aren't "stored" — they're processed through attention every time a new token is generated. This is why long contexts are expensive: attention is O(n²) in sequence length. A 128K context costs roughly 4x the computation of a 64K context, not 2x.
3. Fine-tuning changes weights, RAG changes inputs. When you fine-tune a model, you're adjusting the weights to shift the probability distribution toward your domain — legal contracts, medical notes, textile specifications. When you use RAG (retrieval-augmented generation), you're adding relevant documents to the input context without changing weights. Fine-tuning is surgery. RAG is a briefing document. For most Indian business applications, RAG is the right first step — it's cheaper, faster, and doesn't require a dataset of 10,000+ examples.
4. "Hallucination" isn't a bug — it's how the model works. The model doesn't "know" it's making something up. It's always predicting the most probable next token given the training data. When the training data doesn't contain the answer, the model still predicts — it just predicts wrong. This is why RAG is essential for factual applications: you feed the model the facts as context so its next-token predictions are anchored in ground truth.
The Practical Takeaway
You don't need to implement a transformer to use LLMs effectively. But you do need to internalize what they are and aren't:
- They aren't databases — don't expect exact recall without RAG
- They aren't reasoning engines — don't expect multi-step logic to always converge
- They aren't deterministic — don't expect the same output twice unless temperature=0
- They are pattern matchers — and patterns at web scale turn out to be useful for a lot of things
The manager in Max Leiter's dialogue asks, "So what do these weights have in mind?" The answer is: nothing. They don't have minds. They have a probability distribution over the next token, calibrated across trillions of examples, and optimized to produce outputs that look like they were written by someone who does have a mind.
That's the paradox. And it's why every engineer who uses these tools should understand the weights behind them — not to build their own model, but to know when to trust the output and when to reach for a deterministic alternative.
Tags
- llm
- weights
- transformers
- ai-internals
- machine-learning
- engineering
More on ai agents
- Gemma 4 12B — Google Ships a Laptop-Ready Multimodal, and the Open-Weight Race Isn't Slowing DownGoogle's Gemma 4 12B released June 3, 2026 as a unified, encoder-free multimodal model that runs on a single GPU. No separate vision encoder. No API key. Here's what Indian teams building with local LLMs should know before pulling the model.
- MCP vs A2A — The Agent Protocol Landscape in June 2026Two agent protocols dominate mid-2026: Anthropic's MCP for tool use and Google's A2A for inter-agent communication. They solve different problems, but the industry keeps confusing them. Here's what each protocol actually does, where they overlap, and which one you should build against.
- The Mistral AI Now Summit — Small Models, On-Prem Deployments, and Why It Matters for Indian TeamsMistral's May 2026 summit in Paris revealed their strategy: small specialized models, on-prem sovereignty, and agentic harnesses. Here's what shipped, what's strategic, and what Indian engineering teams should pay attention to.