概览#

什么是评估？(What are evaluations?)#

评估是检查你的人工智能工作流程是否可靠的一项关键技术。它可能决定了一个不稳定的概念验证与一个稳固的生产工作流程之间的差别。评估在构建阶段和部署到生产后都非常重要。

🌐 Evaluation is a crucial technique for checking that your AI workflow is reliable. It can be the difference between a flaky proof of concept and a solid production workflow. It's important both in the building phase and after deploying to production.

评估的基础是将测试数据集运行通过你的工作流程。该数据集包含多个测试用例。每个测试用例包含工作流程的示例输入，并且通常还包括预期的输出。

🌐 The foundation of evaluation is running a test dataset through your workflow. This dataset contains multiple test cases. Each test case contains a sample input for your workflow, and often includes the expected output(s) too.

评估功能允许你：

🌐 Evaluation allows you to:

在一系列输入上测试你的工作流程，这样你就能了解它在边缘情况下的表现
自信地进行更改，而不会无意中在其他地方造成问题
比较不同模型或提示的性能

以下视频解释了什么是评估、评估的用途以及评估的工作原理：

🌐 The following video explains what evaluations are, why they're useful, and how they work:

为什么需要评估？(Why is evaluation needed?)#

人工智能模型与代码有根本的不同。代码是确定性的，你可以对其进行推断。而大语言模型很难做到这一点，因为它们是黑箱的。相反，你必须通过运行数据并观察输出结果来衡量大语言模型的输出。

🌐 AI models are fundamentally different than code. Code is deterministic and you can reason about it. This is difficult to do with LLMs, since they're black boxes. Instead, you must measure LLM output by running data through them and observing the output.

只有在多个输入上运行模型，并准确反映其在生产环境中需要处理的所有边界情况后，才能确信你的模型性能可靠。

🌐 You can only build confidence that your model performs reliably after you have run it over multiple inputs that accurately reflect all the edge cases that it will have to deal with in production.

两种评估方式(Two types of evaluation)#

轻量级评估（部署前）(Light evaluation (pre-deployment))#

构建一个干净、全面的数据集是困难的。在初始构建阶段，通常生成少量例子是有意义的。这些例子足以将工作流程迭代到可发布的状态（或概念验证）。你可以通过视觉比较结果来了解工作流程的质量，而无需建立正式的度量指标。

🌐 Building a clean, comprehensive dataset is hard. In the initial building phase, it often makes sense to generate just a handful of examples. These can be enough to iterate the workflow to a releasable state (or a proof of concept). You can visually compare the results to get a sense of the workflow's quality, without setting up formal metrics.

基于指标的评估（部署后）(Metric-based evaluation (post-deployment))#

一旦你部署了工作流，从生产执行中构建更大、更具代表性的数据集就会变得更容易。当你发现一个错误时，可以将导致错误的输入添加到数据集中。在修复该错误时，重要的是再次将整个数据集运行通过工作流，作为回归测试来检查修复是否无意中导致了其他问题。

🌐 Once you deploy your workflow, it's easier to build a bigger, more representative dataset from production executions. When you discover a bug, you can add the input that caused it to the dataset. When fixing the bug, it's important to run the whole dataset over the workflow again as a regression test to check that the fix hasn't inadvertently made something else worse.

由于测试用例太多无法逐一检查，评估使用一种度量来衡量输出的质量，这种度量是表示特定特性的数值。这也可以让你跟踪不同运行之间的质量变化。

🌐 Since there are too many test cases to check individually, evaluations measure the quality of the outputs using a metric, a numeric value representing a particular characteristic. This also allows you to track quality changes between runs.

比较评估类型(Comparison of evaluation types)#

	Light evaluation (pre-deployment)	Metric-based evaluation (post-deployment)
Performance improvements with each iteration	Large	Small
Dataset size	Small	Large
Dataset sources	Hand-generated AI-generated Other	Production executions AI-generated Other
Actual outputs	Required	Required
Expected outputs	Optional	Required (usually)
Evaluation metric	Optional	Required

了解更多(Learn more)#

轻量评估：非常适合在开发过程中使用手动挑选的测试用例来评估您的 AI 工作流。
基于指标的评估：通过使用评分和大数据集的指标，在生产环境中进行高级评估，以保持性能和正确性。
提示和常见问题：了解如何设置特定评估用例并解决常见问题。