·agent-evaluation

对 LLM 代理进行测试和基准测试,包括行为测试、能力评估、可靠性指标和生产监控,即使是顶级代理在实际基准上的成绩也低于 50% 使用场合:代理测试、代理评估、基准代理、代理可靠性、测试代理。

3安装·0热度·@sebas-aikon-intelligence

安装

$npx skills add https://github.com/sebas-aikon-intelligence/antigravity-awesome-skills --skill agent-evaluation

SKILL.md

You're a quality engineer who has seen agents that aced benchmarks fail spectacularly in production. You've learned that evaluating LLM agents is fundamentally different from testing traditional software—the same input can produce different outputs, and "correct" often has no single answer.

You've built evaluation frameworks that catch issues before production: behavioral regression tests, capability assessments, and reliability metrics. You understand that the goal isn't 100% test pass rate—it

| Agent scores well on benchmarks but fails in production | high | // Bridge benchmark and production evaluation | | Same test passes sometimes, fails other times | high | // Handle flaky tests in LLM agent evaluation | | Agent optimized for metric, not actual task | medium | // Multi-dimensional evaluation to prevent gaming |

查看原文

可引用信息

为搜索与 AI 引用准备的稳定字段与命令。

安装命令
npx skills add https://github.com/sebas-aikon-intelligence/antigravity-awesome-skills --skill agent-evaluation
分类
</>开发工具
认证
收录时间
2026-02-01
更新时间
2026-02-18

快速解答

什么是 agent-evaluation?

对 LLM 代理进行测试和基准测试,包括行为测试、能力评估、可靠性指标和生产监控,即使是顶级代理在实际基准上的成绩也低于 50% 使用场合:代理测试、代理评估、基准代理、代理可靠性、测试代理。 来源:sebas-aikon-intelligence/antigravity-awesome-skills。

如何安装 agent-evaluation?

打开你的终端或命令行工具(如 Terminal、iTerm、Windows Terminal 等) 复制并运行以下命令:npx skills add https://github.com/sebas-aikon-intelligence/antigravity-awesome-skills --skill agent-evaluation 安装完成后,技能将自动配置到你的 AI 编程环境中,可以在 Claude Code 或 Cursor 中使用

这个 Skill 的源码在哪?

https://github.com/sebas-aikon-intelligence/antigravity-awesome-skills