Write for one person: If you try to write for everyone, you write for no one. Focus on one person. It could be your loved one, your boss, or your peers. What do you want to share with that person?
A method to shrink the size of a model and make it run faster. It works by converting the model’s internal numbers from high precision (like 32-bit floats) to smaller ones (like 8-bit integers). This helps reduce memory usage and speed up inference with only a small drop in accuracy.
The time it takes from sending a request to when the first token of the response appears. Even if the full answer takes longer, a fast TTFT makes the system feel more responsive to the user.
The process of using a trained model to make predictions on new data. In the case of large language models, this means generating a response based on the input prompt. It’s what happens when you “ask” the model something.
Coherent and non-overlapping categories of errors that emerge from analyzing LLM traces (e.g., hallucination, incorrect format, or missed instruction). Each binary failure type is easy to recognize and forms the basis for targeted metrics.
Two approaches to defining AI metrics. Bottom-up analysis identifies application-specific failure modes directly from the data. Top-down analysis applies generic metrics (like hallucination or toxicity) that may miss domain-specific nuances.