LLM-Guided Fuzzing: More Coverage, Fewer Silly Crashes
There was one Tuesday when we lost seven hours chasing a SIGSEGV that looked beautiful. We were fuzzing a particular client's JWT parser, and AFL++ had spit out a reproducible crash almost instantly. It smelled like a finding. We isolated it, minimized with afl-tmin, wrote the PoC, and shipped it to the internal channel with the confidence of someone already mentally drafting the CVE.
The crash was a strncpy over an empty buffer because the JWT header was literally the string { followed by binary bytes. The parser, before reaching anything interesting, was running base64url_decode and blowing up on a path that would never be reached in production: the front balancer requires Content-Type and at least one dot in the token. It was a bug, sure. It was useless, too.
That afternoon we decided enough was enough.
Why grey-box fuzzers choke on grammars
AFL's mutation strategies (bit flipping, splicing, arithmetic) are grammar-blind: most mutated inputs don't get past the parser. Wang et al. (Superion, ICSE 2019) already pointed out that "AFL spends a great deal of time dealing with syntactic correctness and only finds parsing errors".
For JSON, ASN.1, or any stateful format, effective coverage stalls. Lots of crashes appear, but most are from the prologue: validators, base64 decoders, length checks. The interesting stuff (signature verification logic, kid handling, algorithm confusion-style attacks) sits behind a wall that bit flips don't cross.
The LLM as a semantic mutation engine
Xia et al. in Fuzz4All (ICSE 2024)[1] formalize it: the LLM "implicitly learns syntax, semantics, and valid API constraints". It's a probabilistic generator of plausible inputs.
Deng et al. with TitanFuzz (ISSTA 2023)[2] demonstrated 30-50% more coverage over TensorFlow/PyTorch. Meng et al. with ChatAFL (NDSS 2024)[3] took the principle to network protocols: 47.6% more state transitions, 9 new CVEs. Yang et al. with WhiteFox (OOPSLA 2024)[4] added an agent that reads the source code to derive input requirements.
The common pattern: the LLM doesn't fuzz, it proposes candidates. The traditional fuzzer remains the coverage-guided engine.
Our architecture: AFL++ with a semantic plug-in
Gwaihir CLI wraps AFL++ and adds two things: a custom mutator that delegates a percentage of mutations to a semantic provider, and Beorn, which brings known grammars, historical samples, and previous CVEs from the same format.
- Gwaihir analyzes the target. If Beorn recognizes the format, it injects a seed grammar.
- AFL++ starts with an initial corpus generated by an LLM from the spec.
- The custom mutator hooks into
afl_custom_fuzz. Every N iterations it asks the LLM for a semantic mutation. - Inputs that increase coverage are reinjected into the corpus.
// gwaihir_mutator.c — illustrative fragment
size_t afl_custom_fuzz(void *data, uint8_t *buf, size_t buf_size,
uint8_t **out_buf, uint8_t *add_buf,
size_t add_buf_size, size_t max_size) {
gwaihir_ctx_t *ctx = (gwaihir_ctx_t *)data;
ctx->counter++;
if (ctx->counter % ctx->llm_every == 0) {
return gwaihir_llm_mutate(ctx, buf, buf_size, out_buf, max_size);
}
return afl_havoc_mutate(ctx, buf, buf_size, out_buf, max_size);
}The JWT parser case
24-hour campaign on the same binary:
- Vanilla AFL++: 12.3% coverage, 47 unique crashes, 2 exploitable.
- AFL++ + dictionary tokens: 19.8% coverage, 31 crashes, 3 exploitable.
- AFL++ + Gwaihir/Beorn: 41.7% coverage, 18 unique crashes, 7 exploitable.
Fewer crashes overall, but denser ones. Three of the seven ended up being real bugs in the kid logic and in the handling of asymmetric algorithms.
Fuzz4All versus our approach
Fuzz4All is more ambitious: the LLM is the generation loop. Enormous versatility (98 bugs in GCC/Clang/Z3/OpenJDK) but two limitations: without coverage feedback it gropes blindly, and every mutation is a call to the model. Our approach is more modest and cheaper: AFL++ does 98% at bit-flip speed, and the LLM only steps in to push coverage past the parser.
Trade-offs
Token cost. 24h x 1 call/50 execs x 5,000 execs/s = ~8.6M calls. Unviable without aggressive caching and local models. We use a quantized model on a local GPU for 95% and reserve a large model for rescue mutations.
Latency. One LLM call, even local, is tens/hundreds of ms. AFL++ is synchronous. We use an asynchronous queue where the LLM fills a pool of pre-generated mutations.
Underspecification. Vague prompt -> the LLM hallucinates boring inputs. Narrow prompt -> it replicates known cases. We iterate on the prompts almost as much as on the code.
What we take away
LLM-guided fuzzing is not a replacement for AFL++. It is a complement that raises the coverage ceiling when the target has complex parsing. It doesn't chase more crashes; it chases better crashes. For the team, that has meant fewer hours spent on useless triage and more on bugs that matter to the client.
Bibliography
- Xia, C.S. et al. (2024). Fuzz4All: Universal Fuzzing with LLMs. ICSE 2024. arXiv:2308.04748.
- Deng, Y. et al. (2023). TitanFuzz. ISSTA 2023. arXiv:2212.14834.
- Meng, R. et al. (2024). ChatAFL. NDSS 2024.
- Yang, C. et al. (2024). WhiteFox. OOPSLA 2024. arXiv:2310.15991.
- Liu, Z. et al. (2024). InputBlaster. ICSE 2024. arXiv:2310.15657.