Multi-Agent Debate for Vulnerability Triage
A few months ago we closed a finding as a false positive. It was an endpoint that looked vulnerable to SSRF: it accepted a URL from the user, processed it in the backend, and returned content. Our single-model pipeline flagged it as "likely false positive — the URL appears to be validated against an allow-list". We let it go.
Three weeks later, the client wrote to us: an external researcher had reported exactly that endpoint. The allow-list bypass was trivial (a trick using @ in the userinfo of the URL). Reputational damage, an uncomfortable conversation, and the lesson: a single model is a bad triager, especially when the evidence is ambiguous.
Why a single model is a bad triager
- Sycophancy toward the context. Sharma et al. (2023) shows that modern models agree even when the user is objectively wrong.
- Anchoring on the first hypothesis. Liang et al. (2023) call this Degeneration-of-Thought: once a plausible solution is fixed, the model does not generate real alternatives.
- Absence of adversarial check. A human triager thinks "how would I break this validation?". A zero-shot LLM thinks "is this reasonable?".
Self-consistency (Wang et al., 2022) mitigates noise but not systematic bias.
Multi-agent debate: Du et al. (2023)
Improving Factuality and Reasoning through Multiagent Debate proposes: several instances respond independently; then each agent receives the responses of the others and generates a new conditioned response. It repeats for N rounds. Output: consensus or vote.
What's interesting is that it does not require fine-tuning, external judges, or special architecture. It is prompting. The improvement comes from the fact that each agent, when seeing opposing arguments, reviews parts of its reasoning that it took for granted in isolation.
Our implementation: three roles, three rounds, one judge
- Gandalf — critical mode. Assumes the finding is real and constructs the worst exploitation scenario.
- Beorn — evidential mode. Reasons only about artifacts: HTTP responses, headers, body, traces. We forbid it from inferring behavior that has not been demonstrated.
- Gwaihir — technical mode. Reasons about the architecture: stack, framework, known defenses, sector context.
The three receive the same package. They generate independent verdicts in round 1. In rounds 2-3 each one sees the others and produces a revised version. A judge agent issues the closing: true_positive, false_positive, needs_manual.
If after round 2 the three converge, we skip round 3. If after round 3 disagreement persists, we escalate to a human. That rule is deliberate — persistent disagreement between orthogonal roles is an informative signal.
Traced example: SSRF candidate
Round 1 — independent verdicts:
Gandalf: "Possible SSRF.Server: gunicorndifferent from the frontend (nginx) suggests the backend makes the request.?url=http://169.254.169.254/returned 403, but?url=http://localhost:8080returned 200 with HTML. True positive, high confidence."
Beorn: "Two requests: one with bypass (200), another against metadata (403). The 403 may be an allow-list or a cloud firewall. Needs manual."
Gwaihir: "Python + gunicorn stack behind nginx on AWS. The 403 is consistent with enforced IMDSv2, not with endpoint mitigation. Probable true positive."
Round 2:
Gandalf: "Beorn is right that the 403 is not proof of mitigation. But the 200 against localhost:8080 does demonstrate internal access. I maintain true positive."
Beorn: "I reconsider. The 200 response includes HTML from an internal admin panel. I change to true positive, high confidence."
Gwaihir: "I agree. The@-trick payload would also pass the allow-list. I recommend adding the test to the report."
Convergence in round 2. Judge: true_positive. Report with three distinct lines of evidence.
Metrics
3 months, 1,847 findings (412 ground-truth):
- Precision on true positives: 0.71 (single) → 0.89 (debate).
- Recall: 0.68 → 0.83. Recall matters more than precision.
- Median latency: 8s → 47s per finding.
- Token cost: ~4.2x single-model.
- Escalation to human: 3% → 11%. The debate detects more genuinely ambiguous cases.
When NOT to use debate
- Low severity with clear evidence (missing security headers). A deterministic check is enough.
- Trivial binary decisions (is this an API key?). Regex or single-call with critique.
- High-frequency pipelines. The cost accumulates.
Quick comparison
CoT-SC: good for noise, bad for systematic bias.
Tree of Thoughts: excellent for structured spaces (Game of 24). Overkill for triage.
Single-call with self-critique: the same model critiques the same hypotheses, collapsing to the same fixed point.
Multi-agent debate: we win when plurality of roles is needed and the evidence is ambiguous.
Closing
The client's SSRF should never have been closed in error. What we learned is not that LLMs are bad triagers — it is that a single LLM, no matter how good, has a single way of looking. Forcing plurality through well-defined roles is not elegant; it is pragmatic engineering. If your pipeline closes ambiguous findings without debate, you are probably letting real vulnerabilities slip through.
Bibliography
- Du, Y. et al. (2023). Improving Factuality and Reasoning through Multiagent Debate. arXiv:2305.14325.
- Wang, X. et al. (2022). Self-Consistency Improves CoT Reasoning. arXiv:2203.11171.
- Yao, S. et al. (2023). Tree of Thoughts. arXiv:2305.10601.
- Chan, C-M. et al. (2023). ChatEval. arXiv:2308.07201.
- Liang, T. et al. (2023). Encouraging Divergent Thinking through Multi-Agent Debate. arXiv:2305.19118.
- Sharma, M. et al. (2023). Understanding Sycophancy in Language Models. arXiv:2310.13548.