终身学习¶
Implementing LLM Speculative Sampling in Under 100 Lines of Code
Introduction
Today we'll explore and implement DeepMind's paper: Accelerating large language model decoding with speculative sampling 1. I'll demonstrate how to reproduce this technique in less than 100 lines of code while achieving more than 2x speedup in inference time.
For example, using the following prompt (with temperature set to 0 to ensure deterministic results):
Prompt: Alan Turing theorized that computers would one day become
GPT2-XLARGE inference time is 16.8244 seconds:
-------------------- Naive GPT2 Auto-Regressive --------------------
Naive GPT2 Auto-Regressive completed in 16.8244 s
Generated: Alan Turing theorized that computers would one day become so powerful that they would be able to think like humans.
In the 1950s, he proposed a way to build a computer that could think like a human.
(skip ...)
-------------------- Naive GPT2 Auto-Regressive --------------------
With speculative sampling, we can generate the exact same inference result in just 7.9031 seconds - a 2.12x speedup:
-------------------- Speculative Sampling --------------------
Speculative Sampling completed in 7.9031 s
Generated: Alan Turing theorized that computers would one day become so powerful that they would be able to think like humans.
In the 1950s, he proposed a way to build a computer that could think like a human.
(skip ...)
-------------------- Speculative Sampling --------------------
Overview
Speculative sampling is a technique for accelerating large language model inference. Specifically, it's a method that trades space for time in the inference process. When running inference on large language models, the process is typically slow. We want to optimize this inference speed. The model we're trying to optimize is called the "Target Model."
To accelerate inference, speculative sampling introduces a much smaller "Draft Model" that predicts multiple tokens at once. It then uses a technique similar to rejection sampling to ensure the generated distribution matches the Target Model's distribution. There are two key factors that make speculative sampling effective:
- The Draft Model must be significantly faster than the Target Model
- The Draft Model and Target Model must have high similarity in their predictions
I'll explain both of these factors in detail below. All experiments in this article are based on GPT2 models, with GPT2-XLARGE as the Target Model and GPT2-SMALL as the Draft Model. The complete implementation code is available on GitHub: ai-glimpse/toyllm
LLM Speculative Sampling
前言
今天我们将介绍并复现 Deepmind 的一篇关于 LLM Speculative Sampling 的论文:Accelerating large language model decoding with speculative sampling1. 我们将用不到 100 行代码来复现这篇论文,并得到 2 倍以上的速度提升。
比如以当前的 Prompt 为例 (试验采用的 temperature 为 0 以保证确定性):
Prompt: Alan Turing theorized that computers would one day become
GPT2-XLARGE 的推理的时间为 16.8244 秒:
-------------------- Naive GPT2 Auto-Regressive --------------------
Naive GPT2 Auto-Regressive completed in 16.8244 s
Generated: Alan Turing theorized that computers would one day become so powerful that they would be able to think like humans.
In the 1950s, he proposed a way to build a computer that could think like a human.
(skip ...)
-------------------- Naive GPT2 Auto-Regressive --------------------
推测采样可以生成完全一样的推理结果,且其推理的时间为 7.9031 秒,提升了 2.12 倍:
-------------------- Speculative Sampling --------------------
Speculative Sampling completed in 7.9031 s
Generated: Alan Turing theorized that computers would one day become so powerful that they would be able to think like humans.
In the 1950s, he proposed a way to build a computer that could think like a human.
(skip ...)
-------------------- Speculative Sampling --------------------
简介
Speculative Sampling 中文翻译有多个,这里我们将统一使用推测采样。 推测采样是一种大语言模型推理加速的方法。具体来说,推测采样是一种空间换时间的推理加速算法。 我们知道,在做一个参数较大的 LLM 模型推理的时候,其推理速度一般会较慢,我们希望能够优化其推理速度。 这里我们称待优化的模型为 Target Model。
为了加速其推理速度,推测采样通过引入一个参数量很小的 Draft Model 来一次做多步推理的预测, 然后通过一种类似于 Rejection Sampling(拒绝采样) 的方法来保证算法生成的分布和 Target Model 是一致的。 推测采样能够生效的关键有两点:Draft Model 的推理速度比 Target Model 要快得多; Draft Model 和 Target Model 具有较高的相似性。在后续的介绍中我们将对这两点进行展开介绍。 接下来我们来梳理一下算法的流程。 本文中的实验全部基于 GPT2 模型完成,其中 Target Model 为 GPT2-XLARGE,Draft Model 为 GPT2-SMALL。算法实现的全部代码开源在 GitHub: ai-glimpse/toyllm
Deepseek GRPO 中的 KL Divergence
起
在 Deepseek R1 发布之后,看到了论文中 RL 的算法用的是 GRPO,而 GRPO 是在之前 Deepseek Math 的论文中被提出来的。GRPO 的目标函数如下:
大语言模型与深度学习书籍推荐
前言
之前在朋友圈/推特上推荐的几本 NLP/LLM 的书大家都比较喜欢,这里为了方便大家查阅,统一整理了一下 (另外加上了一些深度学习基础知识学习的书籍), 同时也发在公众号上方便大家收藏查阅。
斯坦福小镇 (AI-Town) 系统解读
核心要点
本文解读了斯坦福小镇(AI-Town)项目,重点关注其在生成式代理方面的创新架构设计。 主要包含以下几个关键部分:
- 记忆系统(Memory Stream):长期记忆模块
- 反思机制(Reflection):高层次推理能力
- 计划系统(Planning):行为规划与执行
- 评估方法(Evaluation):代理行为的可信度验证
启发与应用
本文的核心概念对游戏 NPC 设计具有重要的参考价值,特别是在:
- NPC 记忆系统的设计
- 行为的真实性和可信度
- 动态社交关系的构建
- 环境互动的自然性