概述#
¥Overview
什么是评估?#
¥What are evaluations?
评估是检查你的 AI 工作流程可靠性的关键技术。它可以决定一个不稳定的概念验证流程和一个稳定可靠的生产工作流程之间的区别。在构建阶段和部署到生产环境后,这一点都很重要。
¥Evaluation is a crucial technique for checking that your AI workflow is reliable. It can be the difference between a flaky proof of concept and a solid production workflow. It's important both in the building phase and after deploying to production.
评估的基础是运行测试数据集以运行你的工作流程。此数据集包含多个测试用例。每个测试用例包含工作流的示例输入,通常也包含预期输出。
¥The foundation of evaluation is running a test dataset through your workflow. This dataset contains multiple test cases. Each test case contains a sample input for your workflow, and often includes the expected output(s) too.
评估功能允许你:
¥Evaluation allows you to:
- 使用一系列输入测试你的工作流,以便了解其在极端情况下的性能。
¥Test your workflow over a range of inputs so you know how it performs on edge cases
- 自信地进行更改,避免无意中导致其他地方出现问题
¥Make changes with confidence without inadvertently making things worse elsewhere
- 比较不同模型或提示的性能。
¥Compare performance across different models or prompts
以下视频解释了什么是评估、评估的用途以及评估的工作原理:
¥The following video explains what evaluations are, why they're useful, and how they work:
为什么需要评估?#
¥Why is evaluation needed?
AI 模型与代码有着本质的区别。代码是确定性的,你可以对其进行推断。由于 LLM 是黑盒,因此很难实现这一点。你必须通过运行数据并观察输出来衡量 LLM 的输出。
¥AI models are fundamentally different than code. Code is deterministic and you can reason about it. This is difficult to do with LLMs, since they're black boxes. Instead, you must measure LLM output by running data through them and observing the output.
只有在多个输入上运行模型,并准确反映其在生产环境中需要处理的所有边界情况后,才能确信你的模型性能可靠。
¥You can only build confidence that your model performs reliably after you have run it over multiple inputs that accurately reflect all the edge cases that it will have to deal with in production.
两种评估方式#
¥Two types of evaluation
轻量级评估(部署前)#
¥Light evaluation (pre-deployment)
构建一个干净、全面的数据集很困难。在初始构建阶段,通常只需生成少量示例即可。这些操作足以将工作流程迭代到可发布状态(或概念验证)。你可以直观地比较结果,从而了解工作流的质量,而无需设置正式的指标。
¥Building a clean, comprehensive dataset is hard. In the initial building phase, it often makes sense to generate just a handful of examples. These can be enough to iterate the workflow to a releasable state (or a proof of concept). You can visually compare the results to get a sense of the workflow's quality, without setting up formal metrics.
基于指标的评估(部署后)#
¥Metric-based evaluation (post-deployment)
部署工作流后,可以更轻松地从生产环境中构建更大、更具代表性的数据集。当你发现错误时,可以将导致错误的输入添加到数据集中。修复错误时,务必将整个数据集以 回归测试 模式再次运行整个工作流,以检查修复是否无意中导致其他问题恶化。
¥Once you deploy your workflow, it's easier to build a bigger, more representative dataset from production executions. When you discover a bug, you can add the input that caused it to the dataset. When fixing the bug, it's important to run the whole dataset over the workflow again as a regression test to check that the fix hasn't inadvertently made something else worse.
由于测试用例过多,无法逐一检查,因此评估使用指标(代表特定特性的数值)来衡量输出质量。这还允许你跟踪运行之间的质量变化。
¥Since there are too many test cases to check individually, evaluations measure the quality of the outputs using a metric, a numeric value representing a particular characteristic. This also allows you to track quality changes between runs.
比较评估类型#
¥Comparison of evaluation types
| 轻量级评估(部署前) | 基于指标的评估(部署后) | |
|---|---|---|
| 每次迭代性能提升 | Large | 小 |
| 数据集大小 | 小 | Large |
| 数据集来源 | 手动生成 AI 生成 其他 |
生产执行 AI 生成 其他 |
| 实际输出 | 必需 | 必需 |
| 预期输出 | 可选 | 必填(通常) |
| 评估指标 | 可选 | 必需 |
了解更多#
¥Learn more
- 轻量级评估:非常适合在开发过程中使用精心挑选的测试用例来评估你的 AI 工作流程。
¥Light evaluations: Perfect for evaluating your AI workflows against hand-selected test cases during development.
- 基于指标的评估:通过使用评分和指标以及大型数据集,进行高级评估以维护生产环境中的性能和正确性。
¥Metric-based evaluations: Advanced evaluations to maintain performance and correctness in production by using scoring and metrics with large datasets.
- 提示和常见问题:了解如何设置特定的评估用例并解决常见问题。
¥Tips and common issues: Learn how to set up specific evaluation use cases and work around common issues.