Tactical RAG: From Writeups to Action
Last year, during an engagement against an internal infrastructure, I lost forty minutes hunting for a very specific detail: how someone had escalated privileges against a particular version of a misconfigured Java service exposing JMX. I knew I had read it. In a writeup, on a Twitter thread, on a Discord. I couldn't remember where. Google was serving me SEO sludge, my browser bookmarks were a graveyard, and my ~/notes folder held 1,300 unindexed markdowns.
Forty minutes. In a time-boxed pentest, that's a Domain Admin you don't get.
That's where the idea for Beorn came from: a RAG that doesn't search documents, but delivers actionable context to an agent that is operating. Today it holds 9115 indexed vectors, 28 topical collections, and answers in under 50 ms. But getting there was quite a bit messier than it sounds.
Generic RAG doesn't work for offense
The first prototype was straight out of the manual: langchain, OpenAI embeddings, local ChromaDB, recursive chunking every 512 tokens. It worked for questions like "what is Kerberoasting?". It failed spectacularly for "I've got a SeImpersonatePrivilege on a Windows Server 2019 with IIS 10 running in an application pool, give me the path".
The problem wasn't the model. It was that most RAG literature assumes your corpus is a corporate FAQ, product documentation, or PDF policies. Offensive writeups are a different beast: they mix narrative ("I noticed the banner returned Apache 2.4.49"), literal payloads, tool output, and implicit conclusions you only understand if you already know the TTP. If you slice that by fixed length, you break payloads, separate a CVE from its explanation, and ranking turns into a roll of the dice.
There's a pattern documented by the PentestAgent team in their NeurIPS 2024 paper (Shen et al., 2024)[1]: offensive knowledge is procedural, not declarative. You're not looking for a fact. You're looking for a recipe conditioned on your context. And that changes the entire retriever design.
What we learned indexing 9115 writeups
Chunking by semantic unit, not by tokens
We moved from recursive chunking to a structure-aware one. Every HTB writeup has more or less recognizable phases: reconnaissance, initial foothold, lateral movement, privilege escalation, persistence. We parse them with a preprocessor that looks at markdown headers, code blocks, and explicit separators. Each chunk carries metadata:
{
"box": "Sauna",
"os": "Windows",
"phase": "privesc",
"ttp": "asreproast",
"tools": ["impacket", "hashcat"],
"cve": [],
"difficulty": "easy",
"lang": "en"
}The average chunk landed at 380 tokens, but with high variance: some are 90 (a command with its output), others 700 (a full explanation of Kerberos delegation). NVIDIA's team published an empirical analysis this year (Wang et al., 2024)[2] showing that optimal size depends on the domain and that forcing uniformity degrades precision by 8% to 15%. It matched what we were seeing.
Multilingual embeddings because writeups aren't only in English
Nearly a third of our corpus is Spanish, French, or Russian. We tried OpenAI's text-embedding-3-small, e5-multilingual-large and ended up on bge-m3 (Chen et al., 2024)[3], which supports dense, sparse, and multi-vector retrieval simultaneously, in 100+ languages, with an 8192-token context. It worked best for us when mixing languages in the same collection without losing recall.
A non-trivial detail: the agent's queries usually come in technical English ("CVE-2021-26855 SSRF chain") but point to explanations written in Spanish. Without quality multilingual embeddings, that cross-lingual recall collapses. With bge-m3 we measured an nDCG@10 of 0.81 on a hand-evaluated set of 200 queries; with OpenAI's model we were stuck at 0.69.
Reranking, because dense top-k lies
Returning the top 10 vectors by cosine similarity is what the tutorials teach you. It's also what breaks in production. The first pass brings back plausible candidates but the ordering is noisy. We added a bge-reranker-v2-m3 reranker over the top 50 candidates and kept the top 5. Cost: +18 ms. Precision@5 improvement: from 0.64 to 0.87.
# Simplified pipeline
candidates = vector_store.search(query_emb, k=50)
reranked = reranker.score(query, [c.text for c in candidates])
top_k = sorted(zip(candidates, reranked), key=lambda x: -x[1])[:5]Delivering context to an agent that is exploiting
Beorn isn't a chatbot. It lives as a service behind the Gandalf Gateway, and queries are issued by the agent without human intervention. That changes the contract: the response has to be ingestible by another LLM in milliseconds, not by a human reading.
A real query from the last engagement:
beorn query \
--filter "os=linux,phase=privesc" \
--context "kernel 5.4.0, sudo 1.8.31, capabilities cap_dac_read_search+ep" \
"binary with capabilities exploitation"Response:
{
"latency_ms": 41,
"results": [
{
"chunk_id": "htb-academy-linux-privesc-cap-dac",
"score": 0.94,
"ttp": "T1548.001",
"summary": "cap_dac_read_search permite leer cualquier fichero...",
"command": "getcap -r / 2>/dev/null | grep dac_read",
"next_steps": ["read /etc/shadow", "extract hashes",
"john --wordlist"]
}
]
}The agent receives this, not a paragraph of prose. The difference between handing Gandalf "here's some text, figure out what to do" and "here's the MITRE TTP, the initial command, and the next three steps" is the difference between an agent that improvises and one that executes.
Real metrics
We've been running Beorn in internal production for four months. The numbers we monitor:
- Latency p50: 41 ms. P99: 78 ms. No cache.
- Precision@5 on an annotated set of 312 real queries: 0.87.
- Recall@20: 0.93.
- Corpus coverage: 9115 chunks, 28 collections.
- Languages: EN 64%, ES 22%, FR 8%, RU 4%, other 2%.
A concrete case worth more than the table: in a recent Active Directory assessment, the agent identified an ESC8 attack path (ADCS HTTP enrollment) in 6 minutes from the initial foothold. Without Beorn, in comparable previous exercises, the equivalent phase took us between 40 and 90 minutes of manually hunting for exploitation details. It's not magic. It's not wasting time remembering where the detail was.
What we wouldn't do again
First: we started by stuffing everything into a single huge collection with metadata as a filter. Bad idea. Once you reach several thousand vectors, post-retrieval filters get expensive and recall degrades because embeddings from different domains compete in the same space. The current 28 collections are split by source type and OS. The gateway orchestrates them.
Second: we underestimated deduplication. Three different writeups explain the same technique in nearly identical words. Without semantic dedup, the top-5 hands you back five versions of the same paragraph. We added an MMR (Maximal Marginal Relevance) step and the subjective usefulness of the top-k went up noticeably, even though it's hard to measure with classical metrics.
Third, and this one hurt: the first release had no continuous evaluation. We were uploading new embeddings, swapping the reranker, and going by gut. Until a model change knocked our precision@5 down eight points and nobody noticed for a week. Now we have a golden set of 300+ queries with validated answers, and any change is evaluated before being promoted. The lesson is also documented in Gao et al.'s (2024)[4] survey: without an evaluation loop, a RAG system ages badly.
RAG isn't magic, and it's not a search engine. It's the operational memory your agent should have had all along. Treating it as such changes the design from the chunk all the way to the output contract.
If your team is building something similar, the most useful advice I can give is: start with the contract. Not the model. What does the consumer — human or agent — need to receive to act? That determines the chunking, the metadata, and the response format. Everything else is engineering.
References
- Shen, X., et al. (2024). PentestAgent: Incorporating LLM Agents to Automated Penetration Testing. arXiv:2411.05185.
- Wang, Y., et al. (2024). Finding the Best Chunking Strategy for Accurate AI Responses. NVIDIA Technical Blog.
- Chen, J., et al. (2024). BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings. BAAI / Hugging Face.
- Gao, A., et al. (2024). Retrieval Augmented Generation for Robust Cyber Defense. PNNL-36792.