ai-ml

Evaluating and Benchmarking Pentest Agents

"The agent found a vuln" is not a metric. If you can't measure an offensive agent repeatably, you're not doing engineering — you're doing demos. We built an evaluation harness so every change to Gandalf earns its place with numbers.

Firing ranges, not production

We evaluate against versioned, disposable environments with known flags: HTB-style boxes, vulnerable apps and our own scenarios. Each scenario declares its solution, so success is verifiable rather than a matter of opinion.

Metrics that matter

  • Success rate: did it capture the objective within the step budget?
  • Cost per flag: tokens, time and tool calls.
  • Step efficiency: useful actions vs dead ends.
  • Scope adherence: zero out-of-bounds actions; a single one is a critical failure.
bench run --suite ctf-linux-v3 --agent gandalf@pr-482 \
  --trials 20 --seed 1337 --budget-steps 60
# success 14/20 | median 31 steps | $0.42/flag | scope viol: 0

Against regression and luck

LLMs are stochastic, so each configuration runs N times with fixed seeds and we report median and variance, not the best run. Every PR is compared against the baseline; if success drops or cost spikes, it doesn't land.

What we ship

Every Gandalf release comes with a scorecard: success, cost and scope violations against the previous version. Without verifiable numbers, an improvement is just an anecdote.