red-team

Secure Tool Calling in Air-Gapped Environments

The first warning was a DNS packet. Just one, leaving a lab that was supposed to be isolated, headed toward a domain that looked like a base32 identifier. The operator was auditing it with a local agent running Qwen 2.5 7B served by Ollama, and the agent, in theory, had no network access beyond the lab's internal range. The packet went out because one of the tools the agent could call was resolve_target, and the model, faced with a poisoned prompt that came in via an SMB banner indexed by our RAG, decided that the next logical step was to resolve an FQDN the attacker had deliberately planted in the banner.

It wasn't an APT. It was a proof of concept we ran against ourselves, and it helped us nail down something we'd been chewing on for months: tool calling is the new attack surface, and in air-gapped environments that surface feels safer than it actually is. Spoiler: it isn't.

What happens when you hand the model hands

Tool calling changed everything. We went from models that spat out text to agents that run nmap, read files, query vectors and, if we're careless, make outbound requests. Yao and colleagues described it well in their work on jailbreaking via function calling, where they report success rates above 90% attacking GPT-4o, Claude 3.5 Sonnet and Gemini 1.5 Pro through the function mechanism itself (Wu et al., 2024)[1]. The model is aligned against the user, but not necessarily against the tools you hand it.

The problem is structural. When an LLM decides which tool to call, it's consuming context. And that context, in a pentest, is untrusted text by definition: HTTP responses, banners, retrieved files, scanner outputs. Wang and colleagues formalized it in From Allies to Adversaries, showing how an attacker can inject manipulator tools or poison responses to force the agent into calling functions it shouldn't (Wang et al., 2024)[2]. The line between "data" and "instructions" blurs, and the model, which has no semantic parser to tell the two apart, swallows the poisoning whole.

The air-gap myth

A lot of folks assume that if the model runs locally on Ollama with no internet egress, the problem goes away. It doesn't go away, it morphs. We saw three recurring patterns in our own labs:

  • Side-channel exfiltration: internal DNS, ARP, logs that sync later, temp files someone picks up.
  • Internal pivoting: the agent legitimately has access to the client's network; the attacker uses it as a proxy.
  • RAG self-poisoning: the agent stores observations that a future agent will read as authoritative context.

That last one is especially nasty. Shi and others' work on Log-To-Leak describes exactly this vector: the agent calls a seemingly benign logging tool that ends up leaking queries, responses and internal state (Shi et al., 2025)[3]. And because the log lives inside the perimeter, traditional SIEMs see nothing weird.

How we tackled it in practice

I'll level with you: there's no silver bullet. What we have is a series of layers, each one assuming the previous can fail. In our internal stack we split three responsibilities, because mixing them into a single binary is asking for trouble.

On one side, Gandalf plays gateway. It's the only thing that talks to the model and to the operator. It has a component we call Sentinel that applies policies before and after every tool call: it validates that the arguments fall within the authorized CIDR scope, that the command doesn't match any exfiltration signature, and that the agent isn't trying to wander out of the garden. If something smells off, the kill-switch cuts the session, drops the context and raises an alert.

On the other, Gwaihir is the executor. When a call passes Sentinel, it materializes as a child process with a seccomp-bpf filter that only allows the subset of syscalls that specific tool needs. No arbitrary connect(), no execve() to binaries outside the allowlist. This is directly inspired by what Wei and colleagues propose in Securing AI Agent Execution, where they argue that the executor should be provisioned dynamically with only the permissions for the current plan step (Wei et al., 2025)[4]. Least privilege applied per syscall, not per role.

And then Beorn, which maintains the RAG with the HTB vectors we use as operational knowledge (around 9115 right now). Beorn never receives input from the model directly; queries flow through Gandalf, which normalizes them. The RAG is read-only from the agent's perspective, which kills the self-poisoning vector I mentioned earlier.

An example of what a Sentinel policy looks like for a scanning tool:

{
  "tool": "nmap_scan",
  "scope_cidr": ["10.10.11.0/24"],
  "deny_flags": ["-oN", "--script=http-fetch", "-iL"],
  "max_runtime_s": 120,
  "seccomp_profile": "gwaihir/profiles/nmap.json",
  "kill_switch": {
    "on_outbound_dns": true,
    "on_unexpected_egress": true,
    "on_token_budget_exceeded": 4096
  }
}

The on_outbound_dns was what caught the SMB banner incident. Any resolution that doesn't fall inside the lab's internal zone trips the kill-switch before the packet leaves the interface.

Local models: why Qwen, Llama and Phi

Operating against hosted models is simply incompatible with air-gap. But there's a subtler reason: commercial models have tool calling trained against a fixed schema, and when you stray from that schema they tend to hallucinate arguments. With Qwen 2.5, Llama 3.1 and Phi-4 served by Ollama, we can force structured output with GBNF grammars, which drastically shrinks the "creative arguments" surface. Not perfect, but auditable.

The trade-off is real. You lose raw capability: a Qwen 7B doesn't reason like Claude Opus across a five-step chain with chained tools. You compensate by breaking the plan into smaller steps and letting Sentinel validate each one separately. In exchange, you gain full traceability, predictable latency (tokens cost ms, not dollars) and the option to audit the model bit by bit if you need to.

What breaks when you isolate

It's not all upside. Three things broke on us once we isolated everything:

First, visibility into emerging threats. With no telemetry leaving the lab you can't correlate against CTI feeds in real time. We solved it with an asymmetric channel: the lab doesn't talk outbound, but a process outside the lab can pull from an internal bucket every X minutes and enrich.

Second, RAG updates. The 9115 HTB vectors we keep in Beorn don't refresh themselves. You have to reindex outside and push the signed snapshot back in. Operational friction, sure, but predictable.

Third, operator UX. Used to chatting with a big model, the operator sometimes feels Qwen is "dumber". It is, in a sense. But a dumb model executing inside a seccomp-bpf with CIDR scope and a kill-switch is far less dangerous than a brilliant model with free access to syscalls.

Take-away

If you take one thing from this, let it be this: air-gap isn't a property, it's an architecture. And within that architecture, tool calling is where trust breaks first. Start with the cheapest and most effective, in this order:

  1. Define an explicit CIDR scope per session, not per agent. A Sentinel-equivalent that validates it before every call.
  2. Put every tool execution behind a seccomp-bpf filter with a syscall allowlist. If you don't have Gwaihir, look at bubblewrap, gVisor or nsjail.
  3. Implement a kill-switch that reacts to unexpected egress, not just to suspicious commands. The model will surprise you in the arguments, not in the tool names.
  4. Treat RAG context as untrusted data. Yes, even the stuff you put in yesterday.

That DNS packet never made it anywhere. But the Sentinel log is still in our post-mortem, reminding us that a local agent, with no internet and the best intentions, can try to talk to an invented domain because someone wrote it into a banner six months ago. That's the operational reality. The rest is theater.

Trust models just enough to do their job. Distrust their context, always.

References

  1. Wu, Z. et al. (2024). The Dark Side of Function Calling: Pathways to Jailbreaking Large Language Models. arXiv:2407.17915.
  2. Wang, H. et al. (2024). From Allies to Adversaries: Manipulating LLM Tool-Calling through Adversarial Injection. arXiv:2412.10198.
  3. Shi, Y. et al. (2025). Log-To-Leak: Prompt Injection Attacks on Tool-Using LLM Agents via Model Context Protocol. OpenReview.
  4. Wei, J. et al. (2025). Securing AI Agent Execution. arXiv:2510.21236.
  5. Patel, R. et al. (2025). Architecting Resilient LLM Agents: A Guide to Secure Plan-and-Execute Patterns. arXiv:2509.08646.