LLM Engineering¶

LLM Serving¶

Importance metrics¶

Databricks: LLM Inference Performance Engineering: Best Practices

Time To First Token (TTFT): How quickly users start seeing the model's output after entering their query. Low waiting times for a response are essential in real-time interactions, but less important in offline workloads. This metric is driven by the time required to process the prompt and then generate the first output token.
Time Per Output Token (TPOT): ime to generate an output token for each user that is querying our system. This metric corresponds with how each user will perceive the "speed" of the model. For example, a TPOT of 100 milliseconds/tok would be 10 tokens per second per user, or ~450 words per minute, which is faster than a typical person can read.
Latency: The overall time it takes for the model to generate the full response for a user. Overall response latency can be calculated using the previous two metrics: latency = (TTFT) + (TPOT) * (the number of tokens to be generated).
Throughput: The number of output tokens per second an inference server can generate across all users and requests.

Optimization¶

NVIDIA: Mastering LLM Techniques: Inference Optimization

Speculative Sampling¶

DeepMind: Accelerating large language model decoding with speculative sampling¹
Google: Fast inference from transformers via speculative decoding²

DeepMind 和 Google 前后分别发了一篇 Speculative Sampling 的文章，内容比较相似 (还是有些许不同)。

feifeibear/LLMSpeculativeSampling 给出了两个算法的 PyTorch 实现
jaymody/speculative-sampling 给出了 DeepMind 算法的 Jax 实现
ai-glimpse/toyllm 给出了 DeepMind 算法的 PyTorch 实现 (基于 GPT2 做验证)

KV Cache¶

图解大模型推理优化之 KV Cache 给出了非常详细且简洁的解释，文章中参考的 HuggingFace 代码来自 transformers/models/decision_transformer/modeling_decision_transformer.py

query = self._split_heads(query, self.num_heads, self.head_dim)
key = self._split_heads(key, self.num_heads, self.head_dim)
value = self._split_heads(value, self.num_heads, self.head_dim)

if layer_past is not None:
    past_key, past_value = layer_past
    key = torch.cat((past_key, key), dim=-2)
    value = torch.cat((past_value, value), dim=-2)

核心的逻辑就是如果有历史的 key 和 value，就把当前的 key 和 value 拼接到历史的 key 和 value 上，以此减少计算量。

MQA¶

Multi-Query Attention(MQA)

GQA¶

Grouped-query Attention(GQA)

Flash Attention¶

PagedAttention¶

C. Chen, S. Borgeaud, G. Irving, J.-B. Lespiau, L. Sifre, and J. Jumper, “Accelerating large language model decoding with speculative sampling,” arXiv preprint arXiv:2302.01318, 2023. ↩
Y. Leviathan, M. Kalman, and Y. Matias, “Fast inference from transformers via speculative decoding,” in International conference on machine learning, PMLR, 2023, pp. 19274–19286. ↩