Observability in Offensive Operations

By Team Berialabs • may. 15, 2026 • mins read

The email arrived at 09:14 on a Tuesday. The client's CISO, blunt: "what exactly did your agent do at 03:42 last night? The SOC saw an outbound connection that wasn't in the runbook and blocked it. We need to know if it was your doing."

We answered in eleven minutes. Not because we had good memories, but because we looked up the trace_id of the command the SOC sent us and reconstructed the entire chain: the agent had found an endpoint /api/internal/health that responded with a version banner, had decided to validate whether it was exploitable, and before sending anything Sentinel cut the operation because the destination was outside the authorized CIDR. The "outbound connection" was a SYN that didn't even complete the handshake. We sent the client the full span, with the ATT&CK ID, the policy engine's decision and the 47 bytes that went over the wire.

That day we decided observability was going to be a product, not a nice-to-have.

Offensive observability doesn't look like defensive observability

I've spent years building telemetry pipelines for SOCs. Loki, Tempo, Jaeger, whatever fits. And at first I thought instrumenting an offensive agent was the same problem, just with another schema. I was wrong.

A defensive SIEM assumes that system noise is signal: every auth.fail matters, every weird DNS matters. Cardinality explodes but the mental model is clear: keep everything, decide later. In offense the model inverts. The agent is the noise by definition: it scans, probes, fails, retries. If you keep everything with EDR-level granularity, a four-hour operation generates 18 GB of telemetry no one will read.

But there's something you do have to be able to reconstruct with surgical precision: the causal chain of any action that touched the client. Any of them. If the agent sends a payload to production, you have to know why it decided on that payload, what prior information justified it, what the policy engine validated and what the target responded. It's not paranoia: it's what separates a professional red team from a script kiddie with a budget.

MITRE ATT&CK mapping guides say it without hedging: evidence has to be anchored in real telemetry, not in operator assumptions (CISA, 2023)^[1]. If your report says "T1190 was executed against host X", I want to see the span with the http.request.body, the agent's decision, and the response code. Without that, it's opinion.

Event design: spans, traces, attributes

We use OpenTelemetry as the backbone. Not because it's trendy, but because the semantic conventions solve a real problem we had: every operator wrote logs however they felt like. One day target_host, another day dst, another victim_ip. Impossible to correlate. OpenTelemetry imposes a shared schema and, more importantly, propagates the trace context across processes and workers (OpenTelemetry, 2024)^[2].

The Gandalf CLI emits each command as a span. Every span carries its trace_id, its span_id, a parent_span_id pointing to the reasoning that originated it, and attributes following our berialabs.* convention on top of the standard semantics. The internal rule: if an attribute exists in the official spec, we use it as-is. If it's specific to offense (ATT&CK technique, agent decision, payload hash), we prefix it to avoid polluting the namespace.

A concrete example. This is a real span (anonymized) from an agent testing a SQL injection on a search parameter:

{
  "name": "gandalf.exploit.sqli_attempt",
  "trace_id": "4a1f9b2c8e3d7f6a5b4c3d2e1f0a9b8c",
  "span_id": "7c8b9a0d1e2f3a4b",
  "parent_span_id": "6b7a8c9d0e1f2a3b",
  "start_time_unix_nano": 1709823742891000000,
  "end_time_unix_nano": 1709823743104000000,
  "kind": "SPAN_KIND_CLIENT",
  "status": { "code": "STATUS_CODE_OK" },
  "attributes": {
    "http.request.method": "GET",
    "http.response.status_code": 500,
    "url.full": "https://target.example.com/api/search?q=*REDACTED*",
    "server.address": "10.42.7.18",
    "server.port": 443,
    "berialabs.attack.tactic": "TA0001",
    "berialabs.attack.technique": "T1190",
    "berialabs.agent.decision_id": "dec_8f3a",
    "berialabs.agent.reasoning_ref": "trace://4a1f9b2c.../6b7a8c9d",
    "berialabs.payload.sha256": "9e3f...c7a1",
    "berialabs.payload.family": "boolean_blind_sqli",
    "berialabs.sentinel.scope_check": "passed",
    "berialabs.sentinel.cidr_match": "10.42.0.0/16",
    "berialabs.evidence.response_signature": "mysql_error_xpath"
  },
  "events": [
    {
      "name": "sentinel.validation",
      "attributes": {
        "policy.id": "scope_v3",
        "policy.result": "allow"
      }
    },
    {
      "name": "response.received",
      "attributes": {
        "response.size_bytes": 1247,
        "response.contains_error_signature": true
      }
    }
  ]
}

There are three things I care about in this format. First, the parent_span_id points to the agent's reasoning, not to the previous command; that lets me reconstruct why it did what it did, not just what it did. Second, the payload's sha256 is referenced, not inlined: the full body lives in a separate store, and the span only keeps the hash. Third, the span's events capture key transitions (Sentinel's validation, the response) that a flat attribute couldn't represent well.

ATT&CK as an attribute, not a loose tag

We map each action to its ATT&CK technique at the moment of emitting the span, not after the fact. The agent carries an internal table that associates payload families with technique IDs, and the attribute travels with the span all the way to Tempo. When the client asks us for a report, it's not an archaeological exercise: it's a query in Grafana filtering by berialabs.attack.technique.

eBPF to capture what the agent doesn't tell you

Here's the trick that took me time to accept. No matter how well you instrument your agent, there are things the agent doesn't know it's doing. A third-party library that opens a socket you weren't expecting. A DNS call that goes through getaddrinfo without going through your HTTP client. A child process that writes a temporary file. If you only trust userspace instrumentation, your traceability has holes.

That's why we put eBPF hooks in the kernel of the host where the agent runs. eBPF lets you execute sandboxed programs inside the kernel and capture events without modifying it (Gregg, 2019; ebpf.io)^[3]. We hook four things: tcp_connect, execve, openat and DNS resolution via udp_sendmsg. Each event is enriched with the process's cgroup_id, which we correlate with the agent's active trace_id through a small shared-memory table.

The result: if the agent opens a connection to an IP that the span says it opened, perfect, everything matches. If it opens one to an IP that doesn't appear in any span, an alert fires. We've used this a couple of times to discover that a scraping library was prefetching favicons without warning. It wasn't malicious, but it could have been, and the client had a right to know.

An uncomfortable note: eBPF is powerful, but it's not unbreakable. There's public work showing how ad-hoc designed rootkits can blind eBPF-based tools if the attacker already controls the kernel (Matheuz, 2024)^[4]. In our case the threat model is different (we want to audit our own agent, not defend against an adversary with root), but it's worth keeping in mind.

The pipeline

From agent to Grafana, the hops are these. The Gandalf CLI emits OTLP over gRPC to the OpenTelemetry Collector running as a sidecar. The Collector does three things: filters sensitive attributes with a processor that writes to /dev/null anything matching PII patterns (emails, numbers that look like cards, auth headers), batches and re-exports. Traces go to Tempo. Structured logs go to Loki via the otlphttp exporter, which Loki accepts natively since 3.0 (Grafana Labs, 2024)^[5]. Metrics (latency per technique, ratio of payloads blocked by Sentinel, request throughput) go to Prometheus.

On top of everything, Grafana. And the killer feature isn't any pretty dashboard: it's trace-to-logs. Click on any span, jump to the logs correlated by trace_id. Click on any log, jump to the span. That correlation is what let us reply to the CISO in eleven minutes.

The trade-offs that hurt

Not everything is pretty. Three real tensions we keep negotiating.

Noise vs signal. If you instrument every internal decision of the agent, you generate millions of spans per operation. If you only instrument external commands, you lose the causal chain. We found a reasonable balance instrumenting the agent's decision nodes (not every token) and external side-effects without exception. Even so, in long operations we've seen spikes of 200k spans/hour.

Retention. Tempo on S3 is cheap, but logs in Loki with high cardinality cost more. Our current policy: full traces 90 days, logs 30 days, aggregates (metrics and summaries) three years. The legal pressure from SOC 2 and red team contracts pushes us to keep more, not less.

PII in logs. This is the one that worries me most. If the agent extracts a database dump to demonstrate impact, that dump must not end up in the logs. The Collector's filter helps, but it's not enough. We keep a second layer: sensitive findings are encrypted with the client's public key before touching the pipeline, and only references are stored.

How it changed how we operate

Before having this pipeline, we did red team reports by hand. Screenshots, copy-paste of outputs, narrative reconstructed from memory. It took us days. Now most of the report is generated by querying Tempo and Loki, and the human operator focuses on interpreting, not transcribing.

More importantly: we discuss with the client from common ground. Not "we think the agent did X" but "here's the span at 03:42:18 with the decision, the payload and the response". A client told us last month it was the first time a red team handed them telemetry their SOC could ingest as-is to train its detection rules. That, to me, is the metric that matters: that offensive observability is useful also for the defender.

And when an email arrives at 09:14 on a Tuesday, we answer in eleven minutes.

References

Team Berialabs

Miembro de Berialabs, especializado en ofensiva asistida por IA.

Offensive observability doesn't look like defensive observability

Event design: spans, traces, attributes

ATT&CK as an attribute, not a loose tag

eBPF to capture what the agent doesn't tell you

The pipeline

The trade-offs that hurt

How it changed how we operate

References

Team Berialabs

Lecturas relacionadas

面向自主代理的 Seccomp-bpf

自律エージェントのためのSeccomp-bpf

स्वायत्त एजेंट्स के लिए Seccomp-bpf