ReAct Prompting for Kill Chain Orchestration
The first time we tried to have a single model orchestrate a complete pentest campaign, it failed at minute forty. Not for lack of technical capability. The model knew perfectly well what to do with an exposed Redis, what to do with an unauthenticated Jenkins, and how to chain SSRF with cloud metadata. The problem was something else, more boring: at forty pages of context, the agent had forgotten the scope we gave it at the beginning. It started scanning ranges that were out of scope, hallucinated two tools that don't exist in our toolbox (nmap-fast-aggressive and burp-pro-headless, neither real), and proposed exploiting a CVE that the client didn't even have installed.
That night we rewrote a good part of the orchestrator. What came out of that is the foundation of what is today the Gandalf CLI engine: a fairly stubborn extension of the ReAct framework adapted to long, multi-phase campaigns, with several models collaborating under the same roof. This article is an attempt to explain why we arrived at that design, what we borrowed, what we had to invent ourselves, and where we keep breaking our faces.
ReAct, without marketing
ReAct was published by Yao and colleagues in late 2022 (Yao et al., 2022)[1]. The idea is deceptively simple. Instead of asking the model just for an answer, or just for an action, you ask it to interleave three things: a Thought (explicit natural-language reasoning about what needs to be done), an Action (the call to an external tool, normally well-typed) and an Observation (the output that the tool returns, which goes back into the context). Loop. Another turn. And so on until the model decides it has enough.
The important thing is that the reasoning appears as text in the rollout itself. That solves two things at once. On one hand, the model gains the capability of Chain-of-Thought (Wei et al., 2022)[2], that is, decomposing a problem before acting. On the other, it doesn't get trapped in its own head: every few iterations it has to touch the real world and come back with real observations. That greatly reduces hallucination compared to pure CoT.
For a HotpotQA question or a WebShop recipe it's perfect. For running a pentest, it isn't.
Where ReAct breaks when the campaign lasts more than an hour
Three failures we have experienced in practically all the tests we did with vanilla ReAct on real targets.
The first is compaction. A Thought/Action/Observation trace in a decent recon session takes up much more than it seems. A single nmap -sV -p- against a reasonable /24 is several thousand tokens in the Observation. Multiply by twenty hosts, mix with HTTP fingerprinting, and after two hours the context is saturated. When the orchestrator starts trimming old messages to fit the new ones, the first thing usually lost is precisely what matters most: rules of engagement, the initial hypothesis, the client's restrictions.
The second is drift. Without an objective explicitly re-anchored on each turn, the model wanders off. It starts with "find an initial vector" and ends up enumerating subdomains of a provider that wasn't part of the scope. PentestGPT (Deng et al., 2023)[3] documented this problem quite well and proposed dividing it into modules. We arrived at a similar conclusion on our own.
The third is tool hallucination. When the model has gone through many iterations, it starts inventing actions that aren't in the toolbox. masscan-with-banners, burp-active-scan, nuclei-cloud-edition. Plausible names that don't exist. If your runner is strict, it fails hard. If it's lax, worse: it does something similar but not what you asked for.
What we added: phase handoff and typed artifacts
The key intuition for us was to treat the kill chain as a state machine, not as a single loop. Each phase (Recon, Planning, Execution, Reporting) has its own agent, its own prompt and its own model. And between phases no raw context is passed: artifacts are passed.
An artifact is a typed structure that one agent produces when closing its phase and another consumes when opening its own. It's not the complete trace. It's not the intermediate thoughts. It's the distilled conclusion, validated by a second review step before advancing.
Today the distribution we use in Gandalf CLI by default is the following: Beorn for reconnaissance, where we want a cheap model resistant to noisy observations. Gandalf for planning, where we need the finest reasoning available and accept paying latency. Gwaihir for execution, where we prioritize throughput and rapid response to tool feedback. Each one has its prompt system adapted and, most importantly, its own restricted toolbox. Beorn cannot call exploits. Gwaihir cannot modify the scope.
The latter is deliberate. Restricting the toolbox per phase is the cheapest way we found to cut tool hallucination. The model doesn't invent what it doesn't see.
A traced example
Let this serve as a sample. A Beorn agent finishes the reconnaissance phase on a host with this:
Thought: He completado el escaneo de servicios. El host expone
SSH (22), HTTP (80), HTTPS (443) y un Jenkins en 8080 con
header X-Jenkins: 2.346.1. Ese build tiene CVE-2024-23897
(arbitrary file read pre-auth). Esto es relevante.
Action: emit_artifact({
"phase": "recon",
"host": "10.42.0.17",
"services": [...],
"candidate_cves": ["CVE-2024-23897"],
"confidence": "high",
"evidence": "banner X-Jenkins: 2.346.1"
})
Observation: artifact_id=art_8c1a stored. handoff ready.
The artifact moves to Gandalf, which enters planning. Gandalf doesn't see Beorn's full trace. It only receives the artifact, the rules of engagement, and the current inventory. Its job is to decide the vector.
Thought: CVE-2024-23897 permite leer ficheros arbitrarios
sin autenticación vía el CLI de Jenkins. El objetivo del
cliente es validar acceso a credenciales. La ruta natural
es leer /var/jenkins_home/secrets/master.key y luego
credentials.xml. Confianza alta, ruido bajo. Acepto.
Action: emit_plan({
"vector": "cve-2024-23897-file-read",
"steps": [
{"id": 1, "tool": "jenkins-cli-cve", "target": "10.42.0.17:8080",
"args": {"file": "/var/jenkins_home/secrets/master.key"}},
{"id": 2, "tool": "jenkins-cli-cve", "target": "10.42.0.17:8080",
"args": {"file": "/var/jenkins_home/credentials.xml"}}
],
"abort_if": ["http_status != 200", "response_size < 16"]
})
Observation: plan_id=pln_2f9b validated. handoff to execution.
Gwaihir picks up the plan, executes step by step, returns raw and normalized evidence. If a step fails, it doesn't improvise: it returns control to Gandalf for replanning. That decision, not letting the execution agent improvise, is one of the things that has reduced false positives the most for us.
Cross-validation between phases
Between the output of one agent and the input of the next we insert a verification step that is not from the original agent. For critical artifacts (candidate CVEs, credentials found, exploitation decisions), a second model, usually the same Gandalf in critical mode, reviews whether the artifact is well-formed, whether the evidence justifies the conclusion, and whether the next step doesn't go out of scope.
It's expensive in tokens. We accept it. The alternative we tried at first, trusting the self-consistency of the agent itself, didn't work. The model tends to agree with itself.
Comparison with what already existed
We didn't invent anything radically new. What we did was mix pieces from various places.
Compared to pure Chain-of-Thought (Wei et al., 2022), we gain the capability of acting and verifying against the world. CoT by itself is a good prompt for closed problems, not for campaigns with external feedback.
Compared to Tree of Thoughts (Yao et al., 2023)[4], we simplify. ToT proposes exploring several reasoning branches in parallel with backtracking. It's elegant, but the cost in tokens and latency is prohibitive when a branch involves launching a real scanner against infrastructure. What we did adopt is the idea of evaluating candidates before committing.
Compared to AutoGPT and descendants, which pursue complete autonomy with a single agent, we go in the opposite direction. The field's review (Wang et al., 2023)[5] is clear: monolithic autonomous agents get stuck in loops, burn budget and lose track. Multi-agent with disciplined handoff gives us much more predictability.
Compared to PentestGPT (Deng et al., 2023), we share modularization by sub-task and the idea of isolating context. The main difference is that we type the artifacts and force cross-validation, we don't let the handoff between modules be free text.
Honest trade-offs
There are three permanent tensions that we don't resolve, only manage.
The first is latency vs. precision. Each handoff adds seconds. Each cross-validation, more. A campaign that a monolithic agent would dispatch in fifteen minutes takes us forty. We have decided it's okay. The client prefers to wait and receive a report without hallucinations.
The second is token budget. Cross-validating means paying twice for critical decisions. We compensate by using cheap models for low-risk phases and reserving expensive models for planning. Without that segmentation the cost per engagement would be unsustainable.
The third is debugging. When a campaign fails, you have to reconstruct which agent decided what with what context. Logging the complete Thought/Action/Observation traces is non-negotiable. Without that you can't do a postmortem. But it takes up an indecent amount of disk.
Last month's failure
We close with a recent example, still warm. A campaign against a web perimeter. Beorn detected an endpoint that responded with a Server: Werkzeug/3.0.1 Python/3.11.4 header. Its artifact marked "possible Flask in debug mode" with medium confidence. Gandalf, in planning, decided to try the /console path expecting a PIN-protected debugger.
Up to that point, perfect. The problem came in Gwaihir. The toolbox had a werkzeug-debugger-pin-bruteforce module that was marked as deprecated in an internal branch but was still selectable. Gwaihir chose it, the module launched requests at a rate that the WAF marked as abuse, and they closed the origin IP on us in the middle of the engagement.
The root cause was not the model's. It was the toolbox's. The tool registry had a deprecated: true flag that the loader was not respecting. Three lines of change in the loader and a regression test. But the failure taught us something deeper: cross-validation between phases assumes that the toolbox is correct. If the toolbox lies, the entire orchestrator lies.
Since then we have an additional check when starting each agent: it lists the tools it will be able to call, prints them in the log, and a human signs the set the first time it executes against a new client. It's not elegant. But no agent, however well orchestrated, replaces a pair of eyes when there's production at stake.
The Thought/Action/Observation loop remains the best primitive we know for an LLM to talk to the world. What changes, in a real campaign, is what happens between loops. That's where you win or lose.
References
- Yao, S. et al. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629.
- Wei, J. et al. (2022). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. arXiv:2201.11903.
- Deng, G. et al. (2023). PentestGPT: An LLM-empowered Automatic Penetration Testing Tool. arXiv:2308.06782.
- Yao, S. et al. (2023). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arXiv:2305.10601.
- Wang, L. et al. (2023). A Survey on Large Language Model based Autonomous Agents. arXiv:2308.11432.