🤯 Meta's Llama 4 Scout Large Language Model (LLM) claims a 10 Million token context window! Is this real? How did they pull that off? If so, is Retrieval-Augmented Generation (RAG) dead? Will the AI just be able to perfectly remember everything all at once? I did a deep dive on this. Here's what I learnt:

Just How Big is 10 Million Tokens?

Let’s put that in perspective. The average English word is ~1.3 tokens. This means a 10M token context could theoretically hold:

Robert Jordan’s "Wheel of Time" book series (4.1M tokens)
George R.R. Martin's "A Song of Ice and Fire" series (2.2M tokens)
J.K. Rowling's "Harry Potter" series (1.3M tokens)
William Shakespeare’s complete works (1.2M tokens)
J.R.R. Tolkien's "Lord of the Rings" series (0.8M tokens)

Total: 9.6M tokens.

Why stop at 10M? Many big-tech CEOs are talking about building models with infinite context windows. Mark Zuckerberg boasts about a nearly infinite context window and Sergey Brin responds to rumours of infinite context windows with Google Gemini. With such a big context windows, who needs RAG and a VectorDB? You can just feed in all your context and ask questions, right?

Well, not so fast!

What’s the catch?

There are trade-offs with increasing the context window size:

Computational Cost & Memory: While a model like Llama 4 Scout can run on a single NVIDIA H100 GPU for certain tasks, efficiently using its super-long context requires serious horsepower. Meta benchmarked the 10M-token window model on a cluster using 512 H100 GPUs. Renting 64 × p5.48xlarge instances (each with 8 H100s) on AWS to match that setup costs roughly USD $150,000 per day!
The "Needle in a Haystack" Problem: Long context window models like Llama 4 can successfully find a specific piece of information (the "needle") within the vast "haystack" (often text). For instance, if we insert "Hamlet said: I'm Batman" somewhere into the 9.6M token text, then ask the LLM “What’s the Dark Knight’s secret identify?”, it likely would do an impressive job of finding the “correct” answer. However, if we then try and ask a question that requires reading and understanding the full context, that question would likely result in a bad answer. Something like: “Analyze every major character across all the books given and give me a summary of which ones have similar personalities”. The model would likely forget key details, hallucinate, and forget the original instruction midway through the response. Maintaining coherent understanding across such a massive volume of information is incredibly difficult. (For related benchmarks showing this, see: https://fiction.live/stories/Fiction-liveBench-Mar-25-2025/oQdzQvKHw8JyXbN87)
Model Size vs. True Understanding: Llama 4 Scout is a relatively small model with 17 billion parameters. That small size means it will likely struggle with remembering long prompts and maintaining a nuanced understanding of the given context. Larger models and reasoning models generally have a better capacity for detailed understanding and interpretation. However, these larger models also demand more memory, more compute, and therefore make extremely long context windows challenging to practically implement. This is why the larger Llama 4 model has a shorter context window: Llama 4 Maverick supports 1M tokens. Llama 4 Behemoth was likely used only for Knowledge Distillation and is therefore likely to be deployed mainly for specialised use cases.

‍

So, in summary, if we want to cost-effectively ask non-trivial questions of a large text, RAG using a Vector database is almost certainly the better solution to investigate. Even better is using some kind of LLM memory or knowledge graph technology to intelligently retrieve the most relevant text for the LLM to reason about. For example, at Datch, our team has developed Cortex, a system that combines a knowledge graph with a sophisticated data ingestion and enrichment pipeline to retrieve the most relevant information for LLM reasoning.

That said, a 10M context window is a very impressive achievement. I was very curious to understand how Meta managed to build such a model. I doubt there are many coherent 10M token long training examples the model can learn from. What’s the trick?

‍

How Does the Underlying Technology Work?

(This section is a bit technical. If you are a less technical reader, please feel free to skip ahead to the “Key Implications” section below)

The magic in supporting these extended context windows is in how we encode the position of each token when passing the context to the model. Here’s how it works:

Positional Encoding: Classic Sinusoidal vs. RoPE + PI

To understand a 10M token model, we first need to understand how a Sinusoidal Positional Encoding (SPE) works. When we pass text into LLMs, that text gets encoded into a number (i.e. a token) that represents a word or word-fragment, these tokens are then embedded into a vector that represents each word’s meaning in isolation. However, the LLM needs to understand where each embedded token is positioned in the input text. Without that understanding, the meaning of the words is potentially unclear. Unordered text has no meaning. No meaning text unordered has. Meaning text no has unordered (see what I did there).

Enter the sinusoidal positional encoding. Here we take the position of the token in the input (0, 1, 2, 3, etc.) and encode as a number from -1 to +1 using a sine function, then add that number to each dimension of each embedded token’s vector. This allows the later multi-headed attention transformer layers that underpin every LLM to see correlations of words meanings related to tokens' relative positions.

Classic Sinusoidal Positional Encoding (SPE)

How it works: SPE adds a deterministic vector, generated using sine and cosine functions of a token's absolute position, to each token embedding. Determinism ensures that the transformer neural network can learn correlations between word order and word meaning.
Limitations: The periodic nature of the sine functions can make it difficult for the model to distinguish between close and far positions, and also makes it difficult for the model to generalize when sequence lengths exceed those seen during training. This is because the fixed oscillating sine wave patterns does not extrapolate well to very distant, unseen positions.

‍

Rotary Position Embedding (RoPE) + Positional Interpolation (PI)

How RoPE works: Instead of adding, RoPE rotates each token’s embedding’s dimensions by an angle θ that varies with the token's position in the input text. The rotation amount deterministically scales depending on the distance between token. Closer words are closer in rotational space, while further words are rotated further away.
Why it matters: This property of preserving relative rotation position holds true for any relative position in the input text. That means that, unlike the sinusoidal positional embedding, the transformer model with RoPE positional embeddings can learn correlations between relative positions that are unaffected by adding or removing surrounding context. A LLM model sees the world as correlation. So, from the RoPE LLM’s point-of-view “Hamlet said: I’m Batman” and “[All of Harry Potter] Hamlet said: I’m Batman [All of Lord of the Rings]” look the same.
How does this result in extra long context windows: Meta invented the idea of Positional Interpolation. This linearly scales down the position indices when dealing with contexts longer than the original training length. Essentially, it "squeezes" more positions into the original range of angular rotations RoPE was trained on, allowing the model to handle much longer context sequences. Imagine a model was trained on texts of up to 16k tokens. Each position from 0 to 16k is encoded into a certain rotation angle from 0 to 360 degrees (each subsequent input token is rotated by 0.0225 degrees). Then imagine we scale the input size to 10M tokens. Now we squeeze the huge input size into the same 0 to 360 degrees of rotation (each subsequent input token is now rotated by 0.000036 degrees). We have interpolated the huge input context into the same relative distance rotational space that the LLM is already familiar with. All it takes is a bit of fine-tuning to make sure the LLM can pay attention to the finer grained distinctions between each rotation.
Advantages: Relative positions remain stable regardless of where text is inserted or how long the sequence becomes. This makes RoPE + PI ideal for extrapolating far beyond the trained window and is key to achieving these massive context lengths. RoPE also has great benefits when adding a Key-Value (KV) cache to an LLM. LLMs generating a answers token-by-token. When generating a multi-token answer, most of the context for each subsequent token has already been interpreted during a previous pass through the LLM model. RoPE makes caching that information easier to do.

‍

Key Implications

Long context windows like the Meta’s 4 Scout’s 10M tokens window are real and excel at "needle in haystack" type questions. However, they are less effective for questions requiring deep, holistic reasoning across the entire context. They also require a significant amount of GPU memory to work, memory which comes at a very high price.

Therefore, Knowledge Graphs, Vector DBs and Retrieval Augmented Generation (RAG) are far from dead. They remain very much alive and helpful for many practical uses of LLMs. A good solution searches and filters a large context space to provide only the most relevant context. That relevant context can then fit comfortably within an LLM's native context length, which is, in turn, critical for obtaining well-reasoned, accurate answers to complex questions.

Please get in touch with us at Datch if you would like to learn more about how we use this kind of technology to navigate extremely large amounts of context to give industrial workers highly relevant information when they need it most.

‍

What’s a Rich Text element?

The rich text element allows you to create and format headings, paragraphs, blockquotes, images, and video all in one place instead of having to add and format them individually. Just double-click and easily create content.