Implementing LLM Speculative Sampling in Under 100 Lines of Code
Introduction
Today we'll explore and implement DeepMind's paper: Accelerating large language model decoding with speculative sampling 1. I'll demonstrate how to reproduce this technique in less than 100 lines of code while achieving more than 2x speedup in inference time.
For example, using the following prompt (with temperature set to 0 to ensure deterministic results):
Prompt: Alan Turing theorized that computers would one day become
GPT2-XLARGE inference time is 16.8244 seconds:
-------------------- Naive GPT2 Auto-Regressive --------------------
Naive GPT2 Auto-Regressive completed in 16.8244 s
Generated: Alan Turing theorized that computers would one day become so powerful that they would be able to think like humans.
In the 1950s, he proposed a way to build a computer that could think like a human.
(skip ...)
-------------------- Naive GPT2 Auto-Regressive --------------------
With speculative sampling, we can generate the exact same inference result in just 7.9031 seconds - a 2.12x speedup:
-------------------- Speculative Sampling --------------------
Speculative Sampling completed in 7.9031 s
Generated: Alan Turing theorized that computers would one day become so powerful that they would be able to think like humans.
In the 1950s, he proposed a way to build a computer that could think like a human.
(skip ...)
-------------------- Speculative Sampling --------------------
Overview
Speculative sampling is a technique for accelerating large language model inference. Specifically, it's a method that trades space for time in the inference process. When running inference on large language models, the process is typically slow. We want to optimize this inference speed. The model we're trying to optimize is called the "Target Model."
To accelerate inference, speculative sampling introduces a much smaller "Draft Model" that predicts multiple tokens at once. It then uses a technique similar to rejection sampling to ensure the generated distribution matches the Target Model's distribution. There are two key factors that make speculative sampling effective:
- The Draft Model must be significantly faster than the Target Model
- The Draft Model and Target Model must have high similarity in their predictions
I'll explain both of these factors in detail below. All experiments in this article are based on GPT2 models, with GPT2-XLARGE as the Target Model and GPT2-SMALL as the Draft Model. The complete implementation code is available on GitHub: ai-glimpse/toyllm