Evaluating and Benchmarking Pentest Agents
If you can't measure an offensive agent repeatably, you're doing demos, not engineering. Our harness: success rate, cost per flag and scope adherence.
If you can't measure an offensive agent repeatably, you're doing demos, not engineering. Our harness: success rate, cost per flag and scope adherence.
Letting an agent run exploits demands serious isolation. When seccomp-bpf is enough, and when we stack gVisor on top.
A pentest does not fit in a context window. How we give our agents operational memory that survives across phases without leaking scope.
Flat RAG retrieves paragraphs; an operation thinks in relationships. We wired Beorn to an ATT&CK knowledge graph to decide the next move.
Why a single LLM cannot run an entire pentest end-to-end, and how we extended the Thought-Action-Observation loop to coordinate agents in Gandalf CLI.
How we built minimal seccomp-bpf profiles so that the exploits an LLM runs don't turn into an accidental rm -rf on the host.
Why a red team without traces is indefensible: how we instrument every decision of our agent with OpenTelemetry, eBPF and spans mapped to MITRE ATT&CK.
How a kill-switch, a seccomp-bpf filter, and CIDR rules cut off the silent leak of an LLM agent in a lab with no internet. Lessons from the field.
We train a PPO agent to turn crashes into control flow hijacking. Rewards with eBPF, honest failures and real code. What we learned along the way.
Indexing 9115 HTB writeups isn't building a search engine: it's giving operational memory to an agent in the middle of an exploit. Here's what we learned.
Why we make three agents (critical, evidential, and technical) debate each finding before closing it, with real metrics and trade-offs.
How we combined AFL++ with LLM-generated seeds in Gwaihir CLI to fuzz complex parsers without drowning in initial validation crashes.