Every once in a while, we find ourselves led down a rabbit hole. Recently, I came across a video about RAG (Retrieval Augmented Generation) for large context LLMs. RAG is a strategy of loading external data sources during processing or inference. The video was good but it surprised me by leading to me to something tangential — something I didn’t expect.
In the video, the speakers described an interesting behavior they observed in large context LLMs. They conducted the "Needle in a Haystack" test, where they randomly scattered semantically out-of-place tokens (in their case, the names of pizza ingredients) throughout a large set of documents and used RAG to load them.
The best-performing LLMs were able to quickly find these tokens. However, there was a curious pattern - their accuracy was higher for documents loaded last, at the end of the context window. Initially, this was thought to be a form of Recency Bias, the tendency to give more weight to recent events but there’s more to the story. The LLMs failed the test more often when the prompt included an additional step, such as doing something with the ingredients once found. This doesn't align with Recency Bias. To me it seems more like a multitasking limitation.
Why was this happening? Although anthropomorphizing LLMs is often frowned upon, it's worth noting the similarity between their behavior and the way our minds work. I don’t think it is a coincidence. The neural networks in our brains and those in software as are similar enough to produce comparable emergent behaviors.
One possible explanation for this is the concept of finite focus. Just as humans can typically hold only 7 ± 2 items in their minds at once (Miller's Law), LLMs may have limits on what they can attend to or process simultaneously. When attention moves other things fall out of focus much like how we forget where we placed our keys when distracted by other tasks.
Is this a good model? I don't know, but it does seem to support the evidence. Let's see if it can explain some other behaviors.
The Chain of Thought (COT) prompting technique helps LLMs approach problem-solving in a step-by-step manner. For example, you can ask an LLM to break down a large code refactoring into a series of steps. It may require an example to understand the task, but sometimes simply appending "please do this step by step" to the prompt is enough to switch the LLM into a procedural mode. Our model seems to explain this. LLMs encounter examples of procedural reasoning many times during training. They may just need a reminder to engage in that style of processing.
Another example is the Directed Stimulus prompting technique. It involves dropping hints about an answer or a related area of thought. Though it has a fancy name, it's essentially the same as giving clues in a guessing game without directly revealing the answer.
The model seems to work to some degree, but what can that do for us? For me, at least, it has led to different way of looking at LLM behavior. I call it surfacing.
Imagine that your mind (or LLM) is a set of connections among ideas. If there is an idea you want to work with, you can pull it toward you and use it, but because of the connections, other related ideas move toward you too. They are closer to your awareness and more likely to surface during an interaction or (in the case of an LLM) a response to a prompt. Awareness is constantly moving, the ideas you had a few minutes ago become a little less retrievable - new ones take their place.
I wonder whether there is a way to quantify the strength and capacity of this focus? Some LLMs are able to handle more at once than others. No doubt it is a function of the architecture and the design of the attention mechanism. Regardless, thinking about focus and surfacing helps me in my interactions with LLMs. It might help you too.
Over the past year, I've been grappling with the issue of LLMs and software development, particularly in the areas of refactoring and testing. The landscape is changing rapidly, and companies have adopted a wide range of stances toward AI. Some have banned it completely, while others allow selective use.
Fortunately, there are ways of using AI with code that may overcome many of the concerns.
Soon I’ll be running a seminar about what I’ve learned.
Subscribe to get on the announcement list. Alternatively, send me an email or a dm on your favorite platform.