Today we'll explore and implement DeepMind's paper: Accelerating large language model decoding with speculative sampling1. I'll demonstrate how to reproduce this technique in less than 100 lines of code while achieving more than 2x speedup in inference time.
今天我们将介绍并复现 Deepmind 的一篇关于 LLM Speculative Sampling 的论文:Accelerating large language model decoding with speculative sampling1. 我们将用不到 100 行代码来复现这篇论文,并得到 2 倍以上的速度提升。