基于指标的评估#
¥Metric-based evaluations
Available on Pro and Enterprise plans
基于指标的评估功能在专业版和企业版套餐中可用。已注册的社区版和入门计划用户也可以将其用于单个工作流。
¥Metric-based evaluation is available on Pro and Enterprise plans. Registered community and Starter plan users can also use it for a single workflow.
什么是基于指标的评估?#
¥What are metric-based evaluations?
工作流准备部署后,通常需要使用比 构建时的时间 更多的示例进行测试。
¥Once your workflow is ready for deployment, you often want to test it on more examples than when you were building it.
例如,当生产环境中的执行开始出现一些极端情况时,你需要将它们添加到测试数据集中,以确保这些情况都被覆盖。
¥For example, when production executions start to turn up edge cases, you want to add them to your test dataset so that you can make sure they're covered.
对于像从生产数据构建的大型数据集,仅凭目测结果很难准确评估性能。你必须衡量性能。基于指标的评估可以为每次测试运行分配一个或多个分数,你可以将其与之前的运行进行比较。各个分数将被汇总以衡量整个数据集的性能。
¥For large datasets like the ones built from production data, it can be hard to get a sense of performance just by eyeballing the results. Instead, you must measure performance. Metric-based evaluations can assign one or more scores to each test run, which you can compare to previous runs. Individual scores get rolled up to measure performance on the whole dataset.
此功能允许你运行评估以计算指标,跟踪这些指标在运行之间的变化,并深入分析这些变化的原因。
¥This feature allows you to run evaluations that calculate metrics, track how those metrics change between runs and drill down into the reasons for those changes.
指标可以是确定性函数(例如两个字符串之间的距离),也可以使用 AI 计算。指标通常涉及检查输出与参考输出(也称为真实值)之间的距离。为此,数据集必须包含该参考输出。不过,某些评估不需要此参考输出(例如,检查文本的情感倾向或毒性)。
¥Metrics can be deterministic functions (such as the distance between two strings) or you can calculate them using AI. Metrics often involve checking how far away the output is from a reference output (also called ground truth). To do so, the dataset must contain that reference output. Some evaluations don't need this reference output though (for example, checking text for sentiment or toxicity).
怎么运行的#
¥How it works
Credentials for Google Sheets
评估使用数据表或 Google 表格来存储测试数据集。要使用 Google Sheets 作为数据集源,请配置 Google Sheets 凭据。
¥Evaluations use data tables or Google Sheets to store the test dataset. To use Google Sheets as a dataset source, configure a Google Sheets credential.
- 设置 轻量级评估
¥Set up light evaluation 2. 向工作流添加指标
¥Add metrics to workflow 3. 运行评估并查看结果
¥Run evaluation and view results
1. 设置轻量级评估#
¥ Set up light evaluation
请按照 设置说明 的步骤创建数据集并将其连接到你的工作流,将输出写回数据集。
¥Follow the setup instructions to create a dataset and wire it up to your workflow, writing outputs back to the dataset.
以下步骤使用与轻量级评估文档相同的支持工单分类工作流程:
¥The following steps use the same support ticket classification workflow from the light evaluation docs:
2. 向工作流添加指标#
¥ Add metrics to workflow
指标是用于评估工作流输出的维度。它们通常会将实际工作流输出与参考输出进行比较。通常会使用 AI 来计算指标,但有时也可以直接使用代码。在 n8n 中,指标始终为数字。
¥Metrics are dimensions used to score the output of your workflow. They often compare the actual workflow output with a reference output. It's common to use AI to calculate metrics, although it's sometimes possible to just use code. In n8n, metrics are always numbers.
你需要在工作流生成输出之后,添加计算指标的逻辑。你可以将指标使用的任何参考输出作为数据集中的一列。这确保了它们在工作流中可用,因为它们将由评估触发器输出。
¥You need to add the logic to calculate the metrics for your workflow, at a point after it has produced the outputs. You can add any reference outputs your metric uses as a column in your dataset. This makes sure they it will be available in the workflow, since they will be output by the evaluation trigger.
使用“设置指标”操作计算:
¥Use the Set Metrics operation to calculate:
- 正确性(基于人工智能):答案的含义是否与提供的参考答案一致。使用 1 到 5 的评分等级,5 为最佳。
¥Correctness (AI-based): Whether the answer's meaning is consistent with a supplied reference answer. Uses a scale of 1 to 5, with 5 being the best.
- 实用性(基于人工智能):响应是否回答了给定的查询。使用 1 到 5 的评分等级,5 为最佳。
¥Helpfulness (AI-based): Whether the response answers the given query. Uses a scale of 1 to 5, with 5 being the best.
- 字符串相似度:答案与参考答案的接近程度,逐字符测量(编辑距离)。返回 0 到 1 之间的分数。
¥String Similarity: How close the answer is to the reference answer, measured character-by-character (edit distance). Returns a score between 0 and 1.
- 分类:答案是否与参考答案完全匹配。匹配时返回 1,否则返回 0。
¥Categorization: Whether the answer is an exact match with the reference answer. Returns 1 when matching and 0 otherwise.
- 使用的工具:执行过程中是否使用了工具。返回 0 到 1 之间的分数。
¥Tools Used: Whether the execution used tools or not. Returns a score between 0 and 1.
你还可以添加自定义指标。只需在工作流中计算指标,然后将其映射到“评估”节点即可。使用“设置指标”操作,并选择“自定义指标”作为指标。然后你可以设置要返回的指标的名称和值。
¥You can also add custom metrics. Just calculate the metrics within the workflow and then map them into an Evaluation node. Use the Set Metrics operation and choose Custom Metrics as the Metric. You can then set the names and values for the metrics you want to return.
例如:
¥For example:
- RAG 文档相关性:在使用向量数据库时,需要判断检索到的文档是否与问题相关。
¥RAG document relevance: when working with a vector database, whether the documents retrieved are relevant to the question.
计算指标可能会增加延迟和成本,因此你可能只想在运行评估时进行计算,并在生产环境中避免计算。你可以通过将度量逻辑放在 '检查是否正在评估' 操作 之后来完成此操作。
¥Calculating metrics can add latency and cost, so you may only want to do it when running an evaluation and avoid it when making a production execution. You can do this by putting the metric logic after a 'check if evaluating' operation.
3. 运行评估并查看结果#
¥ Run evaluation and view results
切换到工作流中的“评估”选项卡,然后单击“运行评估”按钮。评估即将开始。评估完成后,将显示每个指标的汇总分数。
¥Switch to the Evaluations tab on your workflow and click the Run evaluation button. An evaluation will start. Once the evaluation has finished, it will display a summary score for each metric.
你可以点击测试运行行来查看每个测试用例的结果。点击单个测试用例将在新标签页中打开生成该测试用例的执行过程。
¥You can see the results for each test case by clicking on the test run row. Clicking on an individual test case will open the execution that produced it (in a new tab).

