跳转至

Blog

Implementing LLM Speculative Sampling in Under 100 Lines of Code

Introduction

Today we'll explore and implement DeepMind's paper: Accelerating large language model decoding with speculative sampling 1. I'll demonstrate how to reproduce this technique in less than 100 lines of code while achieving more than 2x speedup in inference time.

For example, using the following prompt (with temperature set to 0 to ensure deterministic results):

Prompt: Alan Turing theorized that computers would one day become

GPT2-XLARGE inference time is 16.8244 seconds:

-------------------- Naive GPT2 Auto-Regressive --------------------
Naive GPT2 Auto-Regressive completed in 16.8244 s
Generated: Alan Turing theorized that computers would one day become so powerful that they would be able to think like humans.

In the 1950s, he proposed a way to build a computer that could think like a human.
(skip ...)
-------------------- Naive GPT2 Auto-Regressive --------------------

With speculative sampling, we can generate the exact same inference result in just 7.9031 seconds - a 2.12x speedup:

-------------------- Speculative Sampling --------------------
Speculative Sampling completed in 7.9031 s
Generated: Alan Turing theorized that computers would one day become so powerful that they would be able to think like humans.

In the 1950s, he proposed a way to build a computer that could think like a human.
(skip ...)
-------------------- Speculative Sampling --------------------

Overview

Speculative sampling is a technique for accelerating large language model inference. Specifically, it's a method that trades space for time in the inference process. When running inference on large language models, the process is typically slow. We want to optimize this inference speed. The model we're trying to optimize is called the "Target Model."

To accelerate inference, speculative sampling introduces a much smaller "Draft Model" that predicts multiple tokens at once. It then uses a technique similar to rejection sampling to ensure the generated distribution matches the Target Model's distribution. There are two key factors that make speculative sampling effective:

  1. The Draft Model must be significantly faster than the Target Model
  2. The Draft Model and Target Model must have high similarity in their predictions

I'll explain both of these factors in detail below. All experiments in this article are based on GPT2 models, with GPT2-XLARGE as the Target Model and GPT2-SMALL as the Draft Model. The complete implementation code is available on GitHub: ai-glimpse/toyllm

LLM Speculative Sampling

前言

今天我们将介绍并复现 Deepmind 的一篇关于 LLM Speculative Sampling 的论文:Accelerating large language model decoding with speculative sampling1. 我们将用不到 100 行代码来复现这篇论文,并得到 2 倍以上的速度提升。

比如以当前的 Prompt 为例 (试验采用的 temperature 为 0 以保证确定性):

Prompt: Alan Turing theorized that computers would one day become

GPT2-XLARGE 的推理的时间为 16.8244 秒:

-------------------- Naive GPT2 Auto-Regressive --------------------
Naive GPT2 Auto-Regressive completed in 16.8244 s
Generated: Alan Turing theorized that computers would one day become so powerful that they would be able to think like humans.

In the 1950s, he proposed a way to build a computer that could think like a human.
(skip ...)
-------------------- Naive GPT2 Auto-Regressive --------------------

推测采样可以生成完全一样的推理结果,且其推理的时间为 7.9031 秒,提升了 2.12 倍:

-------------------- Speculative Sampling --------------------
Speculative Sampling completed in 7.9031 s
Generated: Alan Turing theorized that computers would one day become so powerful that they would be able to think like humans.

In the 1950s, he proposed a way to build a computer that could think like a human.
(skip ...)
-------------------- Speculative Sampling --------------------

简介

Speculative Sampling 中文翻译有多个,这里我们将统一使用推测采样。 推测采样是一种大语言模型推理加速的方法。具体来说,推测采样是一种空间换时间的推理加速算法。 我们知道,在做一个参数较大的 LLM 模型推理的时候,其推理速度一般会较慢,我们希望能够优化其推理速度。 这里我们称待优化的模型为 Target Model。

为了加速其推理速度,推测采样通过引入一个参数量很小的 Draft Model 来一次做多步推理的预测, 然后通过一种类似于 Rejection Sampling(拒绝采样) 的方法来保证算法生成的分布和 Target Model 是一致的。 推测采样能够生效的关键有两点:Draft Model 的推理速度比 Target Model 要快得多; Draft Model 和 Target Model 具有较高的相似性。在后续的介绍中我们将对这两点进行展开介绍。 接下来我们来梳理一下算法的流程。 本文中的实验全部基于 GPT2 模型完成,其中 Target Model 为 GPT2-XLARGE,Draft Model 为 GPT2-SMALL。算法实现的全部代码开源在 GitHub: ai-glimpse/toyllm

Presentia: 简单而优雅的 Presentation 模板

Why

我真的用不好 PowerPoint,Keynote 也不行,这些工具对我来说都太复杂了。 这些基于拖拽的工具有很多小的问题让我很难受,比如两段文字到底有没有对齐…… 我想要的是一个简单的工具,让我可以专注于内容,且可以自动生成美观大方的排版。 同时这些内容的源文件是 文本,这样我就可以用 Git 来做版本控制了。

对于这个问题,我的第一个解法 LaTeX 的 Beamer,第二个解法是 Typst 的 Touying。

Deepseek GRPO 中的 KL Divergence

Deepseek R1 发布之后,看到了论文中 RL 的算法用的是 GRPO,而 GRPO 是在之前 Deepseek Math 的论文中被提出来的。GRPO 的目标函数如下:

\[ \begin{aligned} \mathcal{J}_{GRPO}(\theta) &= \mathbb{E}_{[q \sim P(Q), \{o_i\}_{i=1}^G \sim \pi_{\theta_{old}}(O\mid q)]} \frac{1}{G}\sum_{i=1}^G \frac{1}{|o_i|} \sum_{t=1}^{|o_i|} \Biggl\{ \min \Biggl[ \frac{\pi_\theta(o_{i,t} \mid q, o_{i,<t})}{\pi_{\theta_{old}}(o_{i,t} \mid q, o_{i,<t})} \hat{A}_{i,t}, \text{clip}\Biggl( \frac{\pi_\theta(o_{i,t} \mid q, o_{i,<t})}{\pi_{\theta_{old}}(o_{i,t} \mid q, o_{i,<t})}, 1 - \epsilon, 1 + \epsilon \Biggr) \hat{A}_{i,t} \Biggr] \\ &\quad - \beta \, \mathbb{D}_{KL}\left[\pi_{\theta} \parallel \pi_{ref}\right] \Biggr\} \end{aligned} \]

LLM in 2024

Image title
管窥精密机器内部核心组件 (GPT 模型的内部构造 Transformer 部分)
(From bbycroft.net/llm)

我个人对 AI(人工智能)/LLM(Large Language Model, 大语言模型) 是完全祛魅的。即使是在 ChatGPT 问世之后,即使是在 LLM 在各个领域掀起热潮的今天, 我也仍然认为这里并没有什么所谓“智能”的东西——我个人不认为现在的 LLM 会思考,不认为它能真正地创作等等。 我更倾向于将现在的 LLM 看作一个庞大而又精密的机器:庞大到包含几百亿个元件,精密到可以和人类对话并完成各种复杂的任务。 尽管如此,我仍然认为我们正处于一个人工智能的黄金时代,一个 AI 可以大方异彩,可以很大程度上改变我们的未来生活方式的时代!

zhplot: 让 Python 中文做图变得简单

Why

在日常工作的少数的场景,我需要用 Python 画一些包含中文的图,一般为了简单快捷都会使用 matplotlib。 在半分钟写完画图代码后,发现图片的文字部分一堆方框后是真的很无奈... 是的,中文字体的支持并不在很多开源库的考虑范围内,这是事实,在社区搜一下能看到一大把的图片显示中文的 issue。

我本来只是想画个图而已,但是我现在需要去搜索怎么安装中文字体,怎么让这些开源库能够找到自己安装的字体...本来半分钟搞定的事情, 现在怎么都要花个十来分钟去搜索解决方案,并做一系列字体相关的操作。 这种“小而烦”的问题有时候很影响心情,更不用说这种 Context Switch 的带来的原工作节奏扰乱。 解决这个“小而烦”的问题就是 zhplot 项目要达成的目标。