LLM Engineering¶
LLM Serving¶
Importance metrics¶
Databricks: LLM Inference Performance Engineering: Best Practices
- Time To First Token (TTFT): How quickly users start seeing the model's output after entering their query. Low waiting times for a response are essential in real-time interactions, but less important in offline workloads. This metric is driven by the time required to process the prompt and then generate the first output token.
- Time Per Output Token (TPOT): ime to generate an output token for each user that is querying our system. This metric corresponds with how each user will perceive the "speed" of the model. For example, a TPOT of 100 milliseconds/tok would be 10 tokens per second per user, or ~450 words per minute, which is faster than a typical person can read.
- Latency: The overall time it takes for the model to generate the full response for a user. Overall response latency can be calculated using the previous two metrics: latency = (TTFT) + (TPOT) * (the number of tokens to be generated).
- Throughput: The number of output tokens per second an inference server can generate across all users and requests.
Optimization¶
NVIDIA: Mastering LLM Techniques: Inference Optimization
Speculative Sampling¶
- DeepMind: Accelerating large language model decoding with speculative sampling1
- Google: Fast inference from transformers via speculative decoding2
DeepMind 和 Google 前后分别发了一篇 Speculative Sampling 的文章,内容比较相似 (还是有些许不同)。
- feifeibear/LLMSpeculativeSampling 给出了两个算法的 PyTorch 实现
- jaymody/speculative-sampling 给出了 DeepMind 算法的 Jax 实现
- ai-glimpse/toyllm 给出了 DeepMind 算法的 PyTorch 实现 (基于 GPT2 做验证)
KV Cache¶
图解大模型推理优化之 KV Cache 给出了非常详细且简洁的解释,文章中参考的 HuggingFace 代码来自 transformers/models/decision_transformer/modeling_decision_transformer.py
query = self._split_heads(query, self.num_heads, self.head_dim)
key = self._split_heads(key, self.num_heads, self.head_dim)
value = self._split_heads(value, self.num_heads, self.head_dim)
if layer_past is not None:
past_key, past_value = layer_past
key = torch.cat((past_key, key), dim=-2)
value = torch.cat((past_value, value), dim=-2)
MQA¶
Multi-Query Attention(MQA)
GQA¶
Grouped-query Attention(GQA)
Flash Attention¶
PagedAttention¶
-
C. Chen, S. Borgeaud, G. Irving, J.-B. Lespiau, L. Sifre, and J. Jumper, “Accelerating large language model decoding with speculative sampling,” arXiv preprint arXiv:2302.01318, 2023. ↩
-
Y. Leviathan, M. Kalman, and Y. Matias, “Fast inference from transformers via speculative decoding,” in International conference on machine learning, PMLR, 2023, pp. 19274–19286. ↩