十、benchmark 评测

Posted Apr 27, 2026

By Ahern

1 min read

十、benchmark 评测

benchmark 是为了多维度评估 Agent 的能力水平

评估指标

指标也需要覆盖多个环节

结果质量

pass_rate：任务通过率
reward / avg_reward：平均得分
accuracy：完全正确比例
partial_match_rate：部分正确比例

Agent 行为

turns_used：任务用了多少轮 LLM 调用
finished_naturally：是否自然结束
tool 调用错误数：未知工具、参数 JSON 错误、terminal 执行失败

效率与成本

task elapsed time：单题耗时
token 数：消耗 token 数
API cost：消耗模型价格

数据集任务分类

单任务Benchmark：单个任务 coding/sysadmin 能力；得分：pass/fai
长期规划Benchmark：评测多回合（100+）规划能力连贯性；得分：0.5 × survival + 0.5 × normalised_funds

评测原理

Benchmark = 数据集任务 + HermesAgentLoop 多轮工具调用 + verifier/reward + metrics/logging

参考

-Available Benchmarks

ai, hermes-agent

ai hermes-agent

This post is licensed under CC BY 4.0 by the author.