LLM Engineering¶

LLM Serving¶

Importance metrics¶

Databricks: LLM Inference Performance Engineering: Best Practices

Time To First Token (TTFT): How quickly users start seeing the model's output after entering their query. Low waiting times for a response are essential in real-time interactions, but less important in offline workloads. This metric is driven by the time required to process the prompt and then generate the first output token.
Time Per Output Token (TPOT): ime to generate an output token for each user that is querying our system. This metric corresponds with how each user will perceive the "speed" of the model. For example, a TPOT of 100 milliseconds/tok would be 10 tokens per second per user, or ~450 words per minute, which is faster than a typical person can read.
Latency: The overall time it takes for the model to generate the full response for a user. Overall response latency can be calculated using the previous two metrics: latency = (TTFT) + (TPOT) * (the number of tokens to be generated).
Throughput: The number of output tokens per second an inference server can generate across all users and requests.

Optimization¶

NVIDIA: Mastering LLM Techniques: Inference Optimization

KV Cache¶

图解大模型推理优化之KV Cache 给出了非常详细且简洁的解释，文章中参考的HuggingFace代码来自 transformers/models/decision_transformer/modeling_decision_transformer.py

query = self._split_heads(query, self.num_heads, self.head_dim)
key = self._split_heads(key, self.num_heads, self.head_dim)
value = self._split_heads(value, self.num_heads, self.head_dim)

if layer_past is not None:
    past_key, past_value = layer_past
    key = torch.cat((past_key, key), dim=-2)
    value = torch.cat((past_value, value), dim=-2)

核心的逻辑就是如果有历史的key和value，就把当前的key和value拼接到历史的key和value上，以此减少计算量。

LLM Engineering¶

LLM Serving¶

Importance metrics¶

Optimization¶

KV Cache¶

MQA¶

GQA¶

Flash Attention¶

PagedAttention¶