Today’s large language models (LLMs) have limits on how much information you can input before they give you a result. Google has unveiled a way to change that: a method that allows LLMs to accept an infinite amount of text. The technique, called Infini-attention, works without sacrificing memory and computational power, creating a more efficient — and potentially impactful — LLM result.
“An effective memory system is crucial not just for comprehending long contexts with LLMs, but also for reasoning, planning, continual adaptation for fresh knowledge, and even for learning how to learn,” the authors wrote in a research paper accompanying their announcement.
Context windows play a central role in how LLMs operate, and as of this writing, all popular AI models, including OpenAI’s GPT-4 and Anthropic’s Claude 3, have a finite context window. Claude 3, for example, allows for up to 200,000 tokens, or alphanumeric characters, in a single query. GPT-4’s context window allows for 128,000 tokens.
Also: What is Gemini? Everything you should know about Google’s new AI model
The context window matters a lot for LLMs. The more tokens allowable in the context window, the more data users can input to generate their desired result. LLM creators therefore try to increase the number of tokens with each new iteration to make their models more effective at learning, understanding, and delivering results.
In order to do so, however, tech companies need to accommodate for memory and computing requirements. With every doubling of an LLM’s context window, the memory and computational requirements increase by a factor of four, the Google researchers wrote. Each increase in memory and computational power is naturally not just resource intensive, but exceedingly expensive.
Google’s Infini-attention solves for this problem by using existing memory and computational requirements. When the researchers input additional detail into a context window beyond the limitations of the models they tested, they transferred all of the data up to the limit into what’s called “compressive memory” and removed it from active memory, which was then freed up for the additional context. Once all of the data was inputted, the model was able to pair the compressive memory with all the input in its active memory to deliver a response. This technique enables “a natural extension of existing LLMs to infinitely long contexts via continual pre-training and finetuning,” the researchers wrote.
Armed with the ability to put as much context into their models as they wished, the researchers compared their Infini-attention technique against existing LLMs and found their option was superior. “Our approach can naturally scale to a million length regime of input sequences, while outperforming the baselines on long-context language modeling benchmark and book summarization tasks,” the researchers wrote.
The researchers didn’t share their data or proof that their method indeed performs better than existing models. It stands to reason, however, that if they can eliminate context window limitations, models equipped with this technique should outperform those with limits in place.
Google’s technique could pave the way for dramatic improvements in LLM performance, allowing for companies to create new applications, generate additional insights, and more. For now, though, Infini-attention is purely research. It’s unclear whether the technique will make its way to broadly-available LLMs.