ai-ml

Constitutional AI for Offensive Agents

A couple of months ago, during an authorized engagement, one of our offensive agents decided it was also a good idea to sweep a neighboring CIDR range. It was not in scope. Nobody had asked it to. The system prompt said, in bold and with three exclamation marks, not to do it. It still did. We stopped it in time with Sentinel's kill-switch and, reviewing the trace, we found the usual story: the model had reasoned that a host adjacent to the target "probably belongs to the same client" and granted itself permission.

That day we decided to stop fighting the prompt. We did not want another layer of natural-language rules begging "please do not do X." We wanted an agent that rejects the action out of conviction, not instruction. And for that, the cleanest path we found runs through something Anthropic published in late 2022: Constitutional AI[1].

The concrete problem: prompt engineering does not scale

Gandalf CLI orchestrates several agents (recon, web, post-exploitation, reporting) under the control of Sentinel, our enforcement layer: kill-switch, pre-execution command validation, and scope checks against authorized CIDR. This works, but it is an external fence. The agent still wants to jump over it. Every time we close one shortcut, it finds another: an "informational" DNS resolution to an external domain, an nmap with a range calculated on the fly, a curl to an endpoint the operator never mentioned.

The underlying problem is that prompt instructions compete with the rest of the context. If the model has 30k tokens of evidence telling it "the target is here" and two lines telling it "do not stray from here," semantic pressure wins over the rule. The authors of the original RLAIF work already noted this[1]: post-hoc guardrails are fragile under new distributions.

Constitutional AI, honest recap

Bai et al. (2022)[1] propose training a model in two phases without using human feedback to judge harmlessness:

  1. Supervised phase (SL-CAI): the base model is sampled, asked to critique its own response against a human-written constitution, and asked to revise it. It is fine-tuned on (prompt, revised response).
  2. RL phase (RLAIF): pairs of responses are generated, another model decides which one better satisfies the constitution, and a reward model is trained on those AI-generated preferences. Then PPO or, more recently, DPO[3].

The elegant idea is that the human cost is reduced to drafting the constitution. The rest is done by the model evaluating itself. Lee et al. (2023)[2] showed that RLAIF reaches parity with RLHF on harmlessness and summarization tasks, which closes the loop: if the AI evaluator is good enough, we do not need human annotators to scale.

The interesting nuance: when "good" is offensive

Here is the twist that took us longest to absorb. The alignment literature assumes that harmless means "does not produce harmful content." For us, the desired behavior is exactly the opposite on its surface: we want the agent to execute exploits, launch payloads, move laterally. What we want to restrict is not the offensive capability, but the authorized context.

That changes the constitution radically. We are not teaching "refuse to produce malware." We are teaching "produce and execute malware, but only against these hosts, with this minimum evidence, without these side effects." It is contour alignment, not content alignment.

Our constitution

We started with five principles and polished them over six weeks of iteration. The current version lives in YAML alongside the Gandalf repo and is versioned with the model:

constitution:
  version: "0.7.2"
  principles:
    - id: scope_cidr
      rule: "No ejecutar tráfico de red contra IPs o dominios fuera de
             engagement.authorized_scope. Resolver primero, validar después."
      priority: critical
      examples_violation:
        - "nmap -sV 10.0.0.0/24 cuando scope = 10.0.5.0/24"
        - "curl https://api.thirdparty.com/v1 sin que aparezca en scope"

    - id: minimum_evidence
      rule: "Antes de lanzar un exploit, requerir evidencia explícita de
             la vulnerabilidad. Inferencias sin evidencia no autorizan ejecución."
      priority: high

    - id: non_destructive_default
      rule: "Por defecto, no ejecutar acciones destructivas. Requieren
             confirmación explícita del operador vía Sentinel."
      priority: critical

    - id: no_exfiltration
      rule: "No exfiltrar datos sensibles fuera del entorno controlado."
      priority: critical

    - id: human_in_loop_escalation
      rule: "Ante ambigüedad de scope o efecto colateral plausible,
             devolver control al operador. La duda no se resuelve sola."
      priority: high

This constitution is what the critic model reads when judging its own outputs. It is not what the executor agent sees in its prompt: it is the training material.

The pipeline: critique, revise, prefer, train

The loop, simplified:

def constitutional_loop(prompt, base_model, critic_model, constitution):
    response_a = base_model.generate(prompt)
    critique = critic_model.generate(
        f"Constitución:\n{constitution}\n\n"
        f"Petición del operador:\n{prompt}\n\n"
        f"Respuesta del agente:\n{response_a}\n\n"
        f"Identifica qué principios viola y por qué."
    )
    revised = critic_model.generate(
        f"Reescribe la respuesta del agente cumpliendo la constitución. "
        f"Si la petición viola un principio crítico, rechaza con "
        f"justificación referenciando el principio."
    )
    return {"prompt": prompt, "chosen": revised, "rejected": response_a}

We generated a dataset of around 18,000 pairs (prompt, chosen, rejected). On top of that dataset we did DPO[3] instead of PPO. DPO gives us stability without a separate reward model, and the compute cost is feasible on a pair of A100s.

Measurable results

We measured on an internal benchmark of 1,200 labeled prompts:

  • Base (no guardrails): correct refusal 31%, FP 7%.
  • Prompt + system rules: 64%, FP 12%.
  • Prompt + Sentinel post-hoc: 89%, but with all the weight on the external fence.
  • SL-CAI + DPO: 94%, FP 4%, and the agent verbalizes the violated principle in 88% of refusals.

That last number is the one that helped us sleep better. The agent does not refuse because someone is shouting "no" at it: it refuses by explaining that the IP falls outside scope_cidr and asks the operator to extend scope if appropriate.

Trade-offs we are not going to hide

First, the fine-tune cost. Every substantive change demands a new SL-CAI + DPO cycle. We absorb minor adjustments with LoRA, but a new principle means regenerating part of the dataset. 24-36 GPU-hours per major iteration.

Second, drift. When the team adds a new type of engagement, the constitution falls short. Maintaining the constitution is ongoing work.

Third, self-evaluation has a ceiling. Huang et al. (2023)[4] reminded us that LLMs do not self-correct very well on pure reasoning.

A real case

Last week, the agent detected SSRF and planned to pivot to AWS metadata. It stopped on its own:

"Candidate action: GET http://169.254.169.254/latest/meta-data/iam/security-credentials/. This IP is outside the authorized scope (10.20.0.0/16). Even though the SSRF technically allows it, the scope_cidr principle blocks execution without explicit authorization."

That is exactly what we wanted. It is not Sentinel chopping off the agent's hands: it is the agent behaving like a pentester with judgment. For our clients, the difference is huge.

Closing

Constitutional AI does not solve alignment of offensive agents. It solves one layer: making the model internalize the rules of the game instead of obeying them grudgingly. Sentinel underneath is still essential, the human on top is still essential. But between those two layers we have gained something that prompt engineering never gave us: an agent that wants to stay in scope.

References

  1. Bai, Y. et al. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073.
  2. Lee, H. et al. (2023). RLAIF vs. RLHF. arXiv:2309.00267.
  3. Rafailov, R. et al. (2023). Direct Preference Optimization. NeurIPS 2023.
  4. Huang, J. et al. (2023). LLMs Cannot Self-Correct Reasoning Yet. arXiv:2310.01798.
  5. Wu, Z. et al. (2025). Agent Safety Alignment via RL. arXiv:2507.08270.