ai-ml

Reinforcement Learning for Exploit Generation

A few months ago, during a CTF final, our team got stuck for three hours on a pwn binary. Stack canary, NX, partial PIE, an overflow of exactly 72 bytes in an unremarkable function. We tried everything. Nothing worked. At 03:14 in the morning, one of us launched almost as a joke a five-line script that mutated the input with AFL-like bitflips and resent it without thinking. Twenty minutes later, the script won the flag through pure statistical luck.

That night we went home with a question rattling around: what if instead of a dumb process we had a process that learned from every failed attempt?

Why RL fits with control flow hijacking

The search space of an exploit is brutal but not uniform. Most inputs produce the same behavior, and only a narrow region leads to the machine state you want: instruction pointer pointing to memory you control. What distinguishes that subspace is an information gradient: an input that comes close to corrupting RIP exhibits observable side effects — partially overwritten registers, stack values matching input bytes, jumps to invalid addresses but near controlled bytes.

That is exactly what well-designed reward functions can capture.

The state of the art

Avgerinos and Brumley published in 2011 the first end-to-end system with AEG (NDSS 2011) on symbolic execution. Mayhem (Cha et al., S&P 2012) scaled the analysis. Both are quasi-deterministic but pay the classic price of symbolic execution: path explosion and choked SMT solvers.

Böttinger formalized fuzzing as MDP with Deep Reinforcement Fuzzing (2018) using DQN. PwnGPT (Shao et al., ACL 2025) reports raising the exploit generation rate from 26.3% to 57.9% with o1-preview. And PPO (Schulman et al., 2017) + RLHF (Ouyang et al., 2022) have matured enough that training an agent on a binary environment is a serious conversation.

Our setup

Gym-style environment over ptrace and an eBPF sidecar. The observation is a dense tensor with:

  • State of the 16 general registers (x86_64), normalized.
  • Stack window around RSP (256 bytes) as raw bytes.
  • Control flags: NX, canary_present, RIP_corrupted, register_controlled[].
  • Hash of the last coverage trace taken from eBPF (uprobe over each basic block).

The action space is discrete over input bytes. The reward combines new coverage (positive, decreasing), exploitability heuristics and a length penalty.

import gym
from gwaihir.tools.rl_env import BinaryPwnEnv

class CTFPwnEnv(BinaryPwnEnv):
    def step(self, action):
        mutated = self.apply_action(self.current_input, action)
        trace = self.run_with_ebpf(mutated)

        new_bbs = len(trace.basic_blocks - self.coverage_seen)
        cov_reward = np.log1p(new_bbs) * 0.3
        self.coverage_seen |= trace.basic_blocks

        expl = 0.0
        if trace.rip_controlled_bytes > 0:
            expl += 5.0 + 0.1 * trace.rip_controlled_bytes
        if trace.canary_leaked: expl += 2.5
        if trace.tainted_syscall_args: expl += 1.5

        reward = cov_reward + expl - 0.001 * len(mutated)
        done = trace.rip_fully_controlled or self.steps > 4096
        return self.observe(trace), reward, done, {"trace": trace}

PPO vs DQN

We started with DQN. As soon as the action space grew, DQN became fragile: the target network diverged every time the exploitability reward triggered a huge, isolated value. PPO, with its clipping of the policy ratio, absorbs those spikes without collapsing. PPO with GAE (λ=0.95), entropy coef 0.01 and batches of 4096 gave us stable training over 24 hours on an RTX 4090.

A detail that is rarely told: most of the time is not spent on gradients, it is spent running the binary. Each step requires fork, ptrace, eBPF collection, parse. We parallelized 32 envs per GPU with SubprocVecEnv and even so the bottleneck was I/O.

Results on 10 CTF binaries

Battery: two babys, three with canary leak required, two with format string, two with minimal ROP chain and one with heap (UAF). The agent, trained from scratch per binary, solved 7 of 10. Of the three it failed: the heap one (expected), one ROP that required a gadget not present, and a format string where the policy got stuck in a local minimum leaking addresses forever.

Compared against pure angr: RL won 5, tied 2, lost 3 (angr solved heap and format-write earlier). It is not a superior tool, it is complementary.

Brutal limitations

  • One policy per binary. We do not transfer knowledge between challenges.
  • Training times of hours for challenges that a senior human solves in minutes.
  • Fragile reward shaping. Custom allocators or non-standard mitigations break the heuristics.
  • Symbolic execution still wins when the path is deterministically derivable.

The real trade-off: cheap broad coverage (RL) vs expensive deep reasoning (symbolic). We run them in parallel inside Gwaihir.

What is coming next

Three fronts: (a) pretraining on exploit-db and CTF writeups, (b) RLHF-style reward learning, (c) integrating the agent as a sub-tool of a coordinator LLM.

For now we keep training. And the next time a team gets stuck for three hours at three in the morning, we want it to be the agent that finds the flag — not luck.

Bibliography

  1. Avgerinos, T. et al. (2011). AEG: Automatic Exploit Generation. NDSS.
  2. Cha, S.K. et al. (2012). Unleashing Mayhem. IEEE S&P.
  3. Böttinger, K. et al. (2018). Deep Reinforcement Fuzzing. IEEE SPW. arXiv:1801.04589.
  4. Schulman, J. et al. (2017). PPO. arXiv:1707.06347.
  5. Ouyang, L. et al. (2022). InstructGPT/RLHF. NeurIPS.
  6. Shao, Y. et al. (2025). PwnGPT. ACL.