跳转至

LLM Engineering

LLM Serving

Importance metrics

Databricks: LLM Inference Performance Engineering: Best Practices
  • Time To First Token (TTFT): How quickly users start seeing the model's output after entering their query. Low waiting times for a response are essential in real-time interactions, but less important in offline workloads. This metric is driven by the time required to process the prompt and then generate the first output token.
  • Time Per Output Token (TPOT): ime to generate an output token for each user that is querying our system. This metric corresponds with how each user will perceive the "speed" of the model. For example, a TPOT of 100 milliseconds/tok would be 10 tokens per second per user, or ~450 words per minute, which is faster than a typical person can read.
  • Latency: The overall time it takes for the model to generate the full response for a user. Overall response latency can be calculated using the previous two metrics: latency = (TTFT) + (TPOT) * (the number of tokens to be generated).
  • Throughput: The number of output tokens per second an inference server can generate across all users and requests.

Optimization

NVIDIA: Mastering LLM Techniques: Inference Optimization

KV Cache

图解大模型推理优化之 KV Cache 给出了非常详细且简洁的解释,文章中参考的 HuggingFace 代码来自 transformers/models/decision_transformer/modeling_decision_transformer.py

query = self._split_heads(query, self.num_heads, self.head_dim)
key = self._split_heads(key, self.num_heads, self.head_dim)
value = self._split_heads(value, self.num_heads, self.head_dim)

if layer_past is not None:
    past_key, past_value = layer_past
    key = torch.cat((past_key, key), dim=-2)
    value = torch.cat((past_value, value), dim=-2)
核心的逻辑就是如果有历史的 key 和 value,就把当前的 key 和 value 拼接到历史的 key 和 value 上,以此减少计算量。

MQA

Multi-Query Attention(MQA)

GQA

Grouped-query Attention(GQA)

Flash Attention

PagedAttention