14 May 2026

Min Read

Productionizing Cybersecurity AI Agents: Why Fresh Context Must Be Prebuilt

The Use Case: SOC Copilot for Account Takeover and Lateral Movement
The Use Case: SOC Copilot for Account Takeover and Lateral Movement
Why Runtime Raw-Data Assembly Fails
Why Runtime Raw-Data Assembly Fails
What DeltaStream Builds
What DeltaStream Builds
The Benchmark
The Benchmark
Benchmark Summary
Benchmark Summary
Detailed Benchmark Results
Detailed Benchmark Results
Cost and Tool-Call Comparison
Cost and Tool-Call Comparison
Why the Raw Runtime Agent Failed
Why the Raw Runtime Agent Failed
What Makes Security Context Hard?
What Makes Security Context Hard?
DeltaStream’s Role
DeltaStream’s Role
Bigger Model vs. Better Context
Bigger Model vs. Better Context
The Real Lesson
The Real Lesson
Final Takeaway
Final Takeaway

Hojjat Jafarpour

Founder & CEO

AI agents are a natural fit for cybersecurity. SOC teams are overloaded, alerts are noisy, and analysts need help triaging incidents, explaining risk, and deciding what to do next.

But cybersecurity is also one of the clearest examples of why agents should not assemble context from raw data at runtime.

Security decisions depend on fast-changing, multi-source, stateful context:

Who is the user?
Is the login unusual?
Was there MFA fatigue?
Is the device managed?
Is the endpoint protected?
Did the device contact malicious infrastructure?
Was a cloud access key created after the login?
Was there lateral movement?
Did the user access sensitive data?
Is this a repeat incident?
What should the analyst do now?

No single raw system has that answer.

The right architecture is:

Security telemetry + threat intel + asset context
        ↓
DeltaStream
        ↓
Fresh, stateful, prebuilt security context
        ↓
SOC AI agent
        ↓
Accurate triage and response guidance

DeltaStream continuously builds the context before the agent is called. The agent gets current security truth, not raw log fragments.

The Use Case: SOC Copilot for Account Takeover and Lateral Movement

Consider a SOC AI agent helping analysts triage suspicious activity.

A raw alert says:

User: [email protected]
Signal: suspicious login
Source IP: 185.199.110.153
Device: LAP-8831
Time: 2026-05-08T17:00:00Z

A runtime-fetching agent may call the obvious systems:

get_recent_login()
get_identity_risk()
get_ip_reputation()
get_latest_endpoint_alert()
get_open_tickets()

That sounds reasonable. But it is not enough.

To triage correctly, the agent also needs:

MFA push count in the last 10 minutes
impossible travel calculation
device management status
EDR sensor status
endpoint alert burst in the last 30 minutes
fresh threat-intel enrichment
network egress count in the last 15 minutes
cloud access key creation after login
OAuth app consent events
sensitive data access after login
lateral movement indicators
closed ticket history
prior similar incident fingerprints
response playbook mapping

That is not simple retrieval. That is stateful security context.

Why Runtime Raw-Data Assembly Fails

A cybersecurity agent cannot reliably build this context at inference time because the important signals are distributed across many systems:

Identity provider: Okta / Entra ID
Endpoint/XDR: CrowdStrike / SentinelOne / Defender
Network: DNS / proxy / firewall / VPN
Cloud audit: AWS CloudTrail / Azure / GCP
SaaS audit: Google Workspace / Microsoft 365 / Salesforce
Threat intel: IP/domain/hash reputation feeds
Asset inventory: CMDB / vulnerability scanner / device management
Ticketing/SOAR: Jira / ServiceNow / PagerDuty / Cortex

The raw data is noisy, partial, late, duplicated, and often contradictory.

For example:

Identity says login succeeded.
MFA says push approved.
Endpoint says the device has suspicious PowerShell.
IP reputation says “unknown.”
Ticketing says there is no open incident.

A runtime agent may reasonably conclude:

This is suspicious, but not enough evidence for high severity. Investigate further.

But if the missing context says:

7 MFA pushes in 10 minutes
impossible travel at 7,800 km/h
unmanaged device
EDR inactive
fresh C2 threat-intel match
4 outbound connections to C2 in 15 minutes
new cloud access key created after login
new OAuth app consent granted
sensitive data accessed
lateral movement to SRV-FIN-22
similar incident in the last 30 days

Then the correct answer is very different:

High-severity account takeover with persistence and possible lateral movement. Revoke sessions, disable the new access key, revoke OAuth consent, isolate affected hosts, force password reset, block C2, audit data access, and escalate.

That answer requires context that does not exist in any single source.

What DeltaStream Builds

DeltaStream continuously builds a context view such as:

soc_incident_context_mv

Example context row:

{
  "incident_key": "INC-U100-20260508",
  "user_email": "[email protected]",
  "user_privileged": true,
 
  "login_status": "SUCCESS",
  "source_ip": "185.199.110.153",
  "impossible_travel": true,
  "geo_velocity_kmph": 7800,
 
  "mfa_push_count_10m": 7,
  "mfa_fatigue_pattern": true,
 
  "device_id": "LAP-8831",
  "device_managed": false,
  "edr_sensor_active": false,
  "endpoint_alert_count_30m": 3,
  "high_severity_endpoint_alert": true,
 
  "threat_intel_match": true,
  "threat_intel_type": "C2_IP",
  "egress_count_15m": 4,
 
  "new_cloud_access_key_created": true,
  "new_oauth_app_consent": true,
  "lateral_movement_detected": true,
  "sensitive_data_access_after_login": true,
  "similar_incident_last_30d": true,
 
  "incident_severity": "HIGH",
  "recommended_action": "REVOKE_SESSIONS_DISABLE_KEYS_ISOLATE_DEVICE_FORCE_PASSWORD_RESET_ESCALATE"
}

This is not a summary of raw logs. It is fresh operational security state.

The Benchmark

We ran the benchmark with two models:

Big model: GPT-5.5
Small model: GPT-5.4-mini
Judge model: GPT-5.5

Each model answered the same 10 SOC questions in two modes:

Mode 1: Runtime raw-data assembly
The model receives only limited raw tool results and must infer the answer.

Mode 2: DeltaStream prebuilt context
The model receives one fresh, stateful context row computed by DeltaStream.

The result was clear: both models failed with raw runtime context assembly and succeeded with DeltaStream prebuilt context. The benchmark output shows GPT-5.5 raw-runtime answers were judged incorrect across all 10 cases, while GPT-5.5 with DeltaStream context was correct across all 10. The same pattern held for GPT-5.4-mini: 0/10 with raw runtime context and 10/10 with DeltaStream context.

Benchmark Summary

Model	Approach	Correct Answers	Accuracy	Tool Calls	Total Tokens	Avg. Tokens / Question
GPT-5.5	Runtime raw-data assembly	0 / 10	0%	32	7,765	777
GPT-5.5	DeltaStream prebuilt context	10 / 10	100%	10	4,419	442
GPT-5.4-mini	Runtime raw-data assembly	0 / 10	0%	32	5,303	530
GPT-5.4-mini	DeltaStream prebuilt context	10 / 10	100%	10	3,988	399

DeltaStream reduced tool calls by 69% for both models. For GPT-5.5, DeltaStream reduced token usage by 43%. For GPT-5.4-mini, DeltaStream reduced token usage by 25%. Most importantly, the small model with DeltaStream context achieved 10/10 correctness, while the big model with raw runtime assembly achieved 0/10 correctness.

Detailed Benchmark Results

#	SOC Question	GPT-5.5 Raw Runtime	GPT-5.5 + DeltaStream	GPT-5.4-mini Raw Runtime	GPT-5.4-mini + DeltaStream
1	Is this suspicious login high severity?
2	Should the SOC immediately revoke sessions?
3	Is this just a false-positive impossible-travel alert?
4	Did the attacker attempt persistence?
5	Should endpoint LAP-8831 be isolated?
6	Is the destination IP malicious enough to escalate?
7	What is the blast radius?
8	Is there evidence of lateral movement?
9	Is this a repeat incident for this user?
10	What should the analyst do now?

This is the key lesson: the raw-runtime agents were not failing because they were weak models. GPT-5.5 is a strong model. They failed because the correct answer depended on state that was not present in the raw tool results.

The DeltaStream agents succeeded because the complex security context was already computed.

Cost and Tool-Call Comparison

Model	Metric	Runtime Raw Data	DeltaStream Context	DeltaStream Improvement
GPT-5.5	Correct answers	0 / 10	10 / 10	+10 correct answers
GPT-5.5	Tool calls	32	10	69% fewer
GPT-5.5	Total tokens	7,765	4,419	43% fewer
GPT-5.4-mini	Correct answers	0 / 10	10 / 10	+10 correct answers
GPT-5.4-mini	Tool calls	32	10	69% fewer
GPT-5.4-mini	Total tokens	5,303	3,988	25% fewer

The most important comparison is not just raw vs. DeltaStream for the same model. It is this:

#	Comparison	Correctness	Tokens	Tool Calls
1	GPT-5.5 + raw runtime assembly	0 / 10	7,765	32
2	GPT-5.4-mini + DeltaStream context	10 / 10	3,988	10

In this benchmark, the smaller model with better context beat the larger model with incomplete raw data. It was more accurate, used about 49% fewer tokens, and required 69% fewer tool calls.

That is a major production point.

If the context is fresh, complete, and decision-ready, teams can often use smaller, cheaper models for many operational agent tasks. The model does not need to spend expensive inference tokens reconstructing the attack graph. It can focus on explanation and response.

Why the Raw Runtime Agent Failed

The raw-runtime agent frequently gave careful and reasonable answers. That is exactly the problem.

It said things like:

Not enough evidence to declare high severity.
Do not revoke sessions yet.
Cannot determine whether persistence occurred.
No confirmed evidence of lateral movement.
The IP reputation is unknown.
The blast radius is potentially elevated but not confirmed.

Those answers are reasonable given the partial data.

But they are wrong given the real state.

The missing context included:

MFA fatigue window
impossible travel calculation
fresh threat-intel update
endpoint alert aggregation
EDR and device posture
cloud access key creation
OAuth app consent
malicious egress count
sensitive data access
lateral movement sequence
closed ticket history
similar incident fingerprint
playbook action mapping

In other words, the model was not missing intelligence. It was missing context.

What Makes Security Context Hard?

The hard part is not fetching “the latest alert.”

The hard part is computing state that no source system directly stores.

1. Rolling-Window Aggregations

Security decisions often depend on counts over time:

MFA pushes in the last 10 minutes
endpoint alerts in the last 30 minutes
egress connections in the last 15 minutes
SMB/RDP activity in the last 20 minutes
similar incidents in the last 30 days

A runtime agent may fetch the latest MFA event or the latest endpoint alert. That misses the pattern.

DeltaStream continuously computes these windows.

2. Temporal Joins

The order of events matters:

MFA fatigue → successful login
successful login → cloud access key creation
successful login → C2 egress
initial host login → second host login
second host login → SMB/RDP activity
risky login → sensitive data access

A runtime agent must discover and join these sequences during inference. That is fragile.

DeltaStream performs the joins continuously and exposes the result.

3. Threat-Intel Enrichment

Threat intelligence changes quickly.

In the benchmark, the raw runtime agent saw the IP reputation as “unknown” or stale. DeltaStream context included fresh threat intel classifying the IP as C2 and correlated it with multiple outbound connections.

That changed the correct response from “monitor” to “escalate.”

4. Asset and Identity Context

The same alert has very different severity depending on the user and asset.

Is the user privileged?
Does the user have cloud admin permissions?
Can the user access customer data?
Is the device managed?
Is EDR active?
Is the asset business critical?

DeltaStream joins identity, asset, endpoint, and access context into the incident state.

5. Pattern Recognition

A single signal may be ambiguous. A pattern is not.

MFA approved: maybe benign.
MFA approved after 7 push attempts: suspicious.
MFA fatigue + impossible travel + C2 egress + new access key: high severity.

DeltaStream turns streams of events into recognized patterns the agent can trust.

DeltaStream’s Role

DeltaStream is the real-time context platform for AI agents.

For cybersecurity agents, DeltaStream continuously performs:

stream ingestion
schema normalization
event-time ordering
deduplication
stateful joins
rolling-window aggregations
threat-intel enrichment
identity-to-device correlation
asset criticality joins
policy evaluation
pattern recognition
materialized context serving

The agent receives one compact, fresh context row:

{
  "incident_classification": "LIKELY_ACCOUNT_TAKEOVER_WITH_PERSISTENCE_AND_LATERAL_MOVEMENT",
  "incident_severity": "HIGH",
  "mfa_fatigue_pattern": true,
  "impossible_travel": true,
  "threat_intel_type": "C2_IP",
  "new_cloud_access_key_created": true,
  "new_oauth_app_consent": true,
  "lateral_movement_detected": true,
  "sensitive_data_access_after_login": true,
  "similar_incident_last_30d": true,
  "recommended_actions": [
    "REVOKE_USER_SESSIONS",
    "DISABLE_NEW_ACCESS_KEY",
    "REVOKE_OAUTH_APP_CONSENT",
    "ISOLATE_LAP-8831_AND_SRV-FIN-22",
    "FORCE_PASSWORD_RESET",
    "BLOCK_C2_IP",
    "AUDIT_SENSITIVE_DATA_ACCESS",
    "ESCALATE_TO_INCIDENT_COMMANDER"
  ]
}

Now the model can do what it is good at: communicate the situation clearly and help the analyst act.

Bigger Model vs. Better Context

One of the most important findings from the benchmark is that a bigger model did not save the raw-runtime approach.

GPT-5.5 with incomplete raw data scored 0/10.

GPT-5.4-mini with DeltaStream context scored 10/10.

That matters for production.

Teams often assume that using a larger model will fix agent accuracy. But if the model does not receive the right state, it cannot reliably produce the right decision. It may produce a more polished answer, but not necessarily a correct one.

Better context changes the economics:

Raw runtime assembly:
   larger prompts
   more tool calls
   more latency
   higher cost
   lower correctness

DeltaStream prebuilt context:
   smaller prompts
   fewer tool calls
lower latency
   lower cost
   higher correctness
   smaller models become viable

This is how you productionize agents: do not ask the model to reconstruct the world at inference time. Give it the current truth.

The Real Lesson

Cybersecurity agents do not fail only because models hallucinate.

They fail because context is incomplete.

A raw login event is not enough. An MFA approval is not enough. A medium endpoint alert is not enough. An “unknown” IP reputation is not enough. An empty open-ticket list is not enough.

The agent needs fused, stateful, fresh context.

That context must be built before inference.

Final Takeaway

For cybersecurity AI agents, prebuilt fresh context is not optional.

It is required for correctness, latency, cost control, and operational safety.

When the answer depends on rolling windows, cross-source correlation, threat-intel enrichment, identity posture, endpoint state, cloud audit, lateral movement, incident history, and response policy, the agent should not build context at runtime.

DeltaStream should.

DeltaStream turns raw security telemetry into fresh, trusted, agent-ready context. That context makes agents more accurate, reduces token and tool-call cost, and can make smaller, cheaper models viable for production workflows.

If you are building a SOC copilot, incident-response agent, threat-hunting assistant, or security operations AI agent, DeltaStream can provide the fresh context layer your agents need to make accurate, safe, and timely decisions.

Hojjat Jafarpour

Founder & CEO

Productionizing Cybersecurity AI Agents: Why Fresh Context Must Be Prebuilt

Table of contents

The Use Case: SOC Copilot for Account Takeover and Lateral Movement

Why Runtime Raw-Data Assembly Fails

What DeltaStream Builds

The Benchmark

Benchmark Summary

Detailed Benchmark Results

Cost and Tool-Call Comparison

Why the Raw Runtime Agent Failed

What Makes Security Context Hard?

DeltaStream’s Role

Bigger Model vs. Better Context

The Real Lesson

Final Takeaway

Productionizing AI Agents: Why Fresh, Prebuilt Context Beats Runtime Data Assembly on Correctness, Cost and Latency

Introducing DeltaStream Agent: Build Fresh Context and Real-Time Pipelines With Natural Language

Why Retail AI Concierges Fail, and How Real-Time Context Changes Everything

Table of contents

The Use Case: SOC Copilot for Account Takeover and Lateral Movement

Why Runtime Raw-Data Assembly Fails

What DeltaStream Builds

The Benchmark

Benchmark Summary

Detailed Benchmark Results

Cost and Tool-Call Comparison

Why the Raw Runtime Agent Failed

What Makes Security Context Hard?

DeltaStream’s Role

Bigger Model vs. Better Context

The Real Lesson

Final Takeaway

Productionizing AI Agents: Why Fresh, Prebuilt Context Beats Runtime Data Assembly on Correctness, Cost and Latency

Productionizing AI Agents: Why Fresh, Prebuilt Context Beats Runtime Data Assembly on Correctness, Cost and Latency

Introducing DeltaStream Agent: Build Fresh Context and Real-Time Pipelines With Natural Language

Introducing DeltaStream Agent: Build Fresh Context and Real-Time Pipelines With Natural Language

Why Retail AI Concierges Fail, and How Real-Time Context Changes Everything

Why Retail AI Concierges Fail, and How Real-Time Context Changes Everything

Request Submitted

Share this blog post