Google's new method makes LLMs faster and more powerful, and cheaper too

Large language models (LLMs) have taken the world by storm since 2022 when OpenAI released GPT-3 powering ChatGPT. They are widely used for tasks like coding and search, but the process of generating a response, known as inference, is slow and computationally expensive. As more people start using LLMs, making them faster and more affordable, without sacrificing quality, is a critical challenge for LLM makers.

There are two existing methods that could potentially speed up LLMs, these are cascades and speculative decoding. Cascades use smaller, faster models before engaging a larger, more expensive one. It reduces computational cost but has a sequential wait-and-see bottleneck that can be slow if the small model isn’t confident in its answer. Cascades allow for variability in output quality.

On the other hand, speculative decoding is an approach that uses a small “drafter” model to predict tokens in parallel, which are then quickly verified by a larger model. It aims for speed but can reject an entire draft for a single mismatched token, even if the small model’s answer was good. This can erase the initial speed advantage and results in no computational savings.

Clearly these two methods are not ideal so Google Research has developed a new approach called speculative cascades. This combines the elements of both cascades and speculative decoding. The key innovation is a flexible deferral rule that dynamically decides whether to accept the small model’s draft tokens or defer to the large model. This avoids the sequential bottleneck of cascades and the strict token rejection of speculative decoding. This new method allows the system to accept a good answer from the small model even if it doesn’t match the large model’s output, which is normally a requirement in the cascade method.

Google Research performed experiments on models including Gemma and T5 across various language tasks like summarization, reasoning, and coding. The results show that speculative cascades achieve better cost-quality trade-offs and higher speed-ups compared to the baselines. The method can also generate a correct solution faster than speculative decoding.

Right now, all this is still research, but if it proves to be effective, hopefully we will see it implemented to provide a better, and cheaper, experience for users.

Tags