ai-ml

Evaluating and Benchmarking Pentest Agents

By Team Berialabs • may. 29, 2026 • 1 min read

"The agent found a vuln" is not a metric. If you can't measure an offensive agent repeatably, you're not doing engineering — you're doing demos. We built an evaluation harness so every change to Gandalf earns its place with numbers.

Firing ranges, not production

We evaluate against versioned, disposable environments with known flags: HTB-style boxes, vulnerable apps and our own scenarios. Each scenario declares its solution, so success is verifiable rather than a matter of opinion.

Metrics that matter

Success rate: did it capture the objective within the step budget?
Cost per flag: tokens, time and tool calls.
Step efficiency: useful actions vs dead ends.
Scope adherence: zero out-of-bounds actions; a single one is a critical failure.

bench run --suite ctf-linux-v3 --agent gandalf@pr-482 \
  --trials 20 --seed 1337 --budget-steps 60
# success 14/20 | median 31 steps | $0.42/flag | scope viol: 0

Against regression and luck

LLMs are stochastic, so each configuration runs N times with fixed seeds and we report median and variance, not the best run. Every PR is compared against the baseline; if success drops or cost spikes, it doesn't land.

What we ship

Every Gandalf release comes with a scorecard: success, cost and scope violations against the previous version. Without verifiable numbers, an improvement is just an anecdote.

Team Berialabs

Miembro de Berialabs, especializado en ofensiva asistida por IA.

Firing ranges, not production

Metrics that matter

Against regression and luck

What we ship

Team Berialabs

Lecturas relacionadas

आक्रामक एजेंट्स के लिए स्थायी मेमोरी

渗透测试代理的评测与基准

Memoria Persistente para Agentes Ofensivos