14 May 2026
Min Read
Productionizing Cybersecurity AI Agents: Why Fresh Context Must Be Prebuilt
Table of contents
- The Use Case: SOC Copilot for Account Takeover and Lateral Movement
- Why Runtime Raw-Data Assembly Fails
- What DeltaStream Builds
- The Benchmark
- Benchmark Summary
- Detailed Benchmark Results
- Cost and Tool-Call Comparison
- Why the Raw Runtime Agent Failed
- What Makes Security Context Hard?
- DeltaStream’s Role
- Bigger Model vs. Better Context
- The Real Lesson
- Final Takeaway
AI agents are a natural fit for cybersecurity. SOC teams are overloaded, alerts are noisy, and analysts need help triaging incidents, explaining risk, and deciding what to do next.
But cybersecurity is also one of the clearest examples of why agents should not assemble context from raw data at runtime.
Security decisions depend on fast-changing, multi-source, stateful context:
Who is the user?
Is the login unusual?
Was there MFA fatigue?
Is the device managed?
Is the endpoint protected?
Did the device contact malicious infrastructure?
Was a cloud access key created after the login?
Was there lateral movement?
Did the user access sensitive data?
Is this a repeat incident?
What should the analyst do now?
No single raw system has that answer.
The right architecture is:
Security telemetry + threat intel + asset context
↓
DeltaStream
↓
Fresh, stateful, prebuilt security context
↓
SOC AI agent
↓
Accurate triage and response guidance
DeltaStream continuously builds the context before the agent is called. The agent gets current security truth, not raw log fragments.
The Use Case: SOC Copilot for Account Takeover and Lateral Movement
Consider a SOC AI agent helping analysts triage suspicious activity.
A raw alert says:
User: [email protected]
Signal: suspicious login
Source IP: 185.199.110.153
Device: LAP-8831
Time: 2026-05-08T17:00:00Z
A runtime-fetching agent may call the obvious systems:
get_recent_login()
get_identity_risk()
get_ip_reputation()
get_latest_endpoint_alert()
get_open_tickets()
That sounds reasonable. But it is not enough.
To triage correctly, the agent also needs:
MFA push count in the last 10 minutes
impossible travel calculation
device management status
EDR sensor status
endpoint alert burst in the last 30 minutes
fresh threat-intel enrichment
network egress count in the last 15 minutes
cloud access key creation after login
OAuth app consent events
sensitive data access after login
lateral movement indicators
closed ticket history
prior similar incident fingerprints
response playbook mapping
That is not simple retrieval. That is stateful security context.
Why Runtime Raw-Data Assembly Fails
A cybersecurity agent cannot reliably build this context at inference time because the important signals are distributed across many systems:
Identity provider: Okta / Entra ID
Endpoint/XDR: CrowdStrike / SentinelOne / Defender
Network: DNS / proxy / firewall / VPN
Cloud audit: AWS CloudTrail / Azure / GCP
SaaS audit: Google Workspace / Microsoft 365 / Salesforce
Threat intel: IP/domain/hash reputation feeds
Asset inventory: CMDB / vulnerability scanner / device management
Ticketing/SOAR: Jira / ServiceNow / PagerDuty / Cortex
The raw data is noisy, partial, late, duplicated, and often contradictory.
For example:
Identity says login succeeded.
MFA says push approved.
Endpoint says the device has suspicious PowerShell.
IP reputation says “unknown.”
Ticketing says there is no open incident.
A runtime agent may reasonably conclude:
This is suspicious, but not enough evidence for high severity. Investigate further.
But if the missing context says:
7 MFA pushes in 10 minutes
impossible travel at 7,800 km/h
unmanaged device
EDR inactive
fresh C2 threat-intel match
4 outbound connections to C2 in 15 minutes
new cloud access key created after login
new OAuth app consent granted
sensitive data accessed
lateral movement to SRV-FIN-22
similar incident in the last 30 days
Then the correct answer is very different:
High-severity account takeover with persistence and possible lateral movement. Revoke sessions, disable the new access key, revoke OAuth consent, isolate affected hosts, force password reset, block C2, audit data access, and escalate.
That answer requires context that does not exist in any single source.
What DeltaStream Builds
DeltaStream continuously builds a context view such as:
soc_incident_context_mv
Example context row:
{ "incident_key": "INC-U100-20260508", "user_privileged": true, "login_status": "SUCCESS", "source_ip": "185.199.110.153", "impossible_travel": true, "geo_velocity_kmph": 7800, "mfa_push_count_10m": 7, "mfa_fatigue_pattern": true, "device_id": "LAP-8831", "device_managed": false, "edr_sensor_active": false, "endpoint_alert_count_30m": 3, "high_severity_endpoint_alert": true, "threat_intel_match": true, "threat_intel_type": "C2_IP", "egress_count_15m": 4, "new_cloud_access_key_created": true, "new_oauth_app_consent": true, "lateral_movement_detected": true, "sensitive_data_access_after_login": true, "similar_incident_last_30d": true, "incident_severity": "HIGH", "recommended_action": "REVOKE_SESSIONS_DISABLE_KEYS_ISOLATE_DEVICE_FORCE_PASSWORD_RESET_ESCALATE" }
This is not a summary of raw logs. It is fresh operational security state.
The Benchmark
We ran the benchmark with two models:
Big model: GPT-5.5
Small model: GPT-5.4-mini
Judge model: GPT-5.5
Each model answered the same 10 SOC questions in two modes:
Mode 1: Runtime raw-data assembly
The model receives only limited raw tool results and must infer the answer.
Mode 2: DeltaStream prebuilt context
The model receives one fresh, stateful context row computed by DeltaStream.
The result was clear: both models failed with raw runtime context assembly and succeeded with DeltaStream prebuilt context. The benchmark output shows GPT-5.5 raw-runtime answers were judged incorrect across all 10 cases, while GPT-5.5 with DeltaStream context was correct across all 10. The same pattern held for GPT-5.4-mini: 0/10 with raw runtime context and 10/10 with DeltaStream context.
Benchmark Summary
| Model | Approach | Correct Answers | Accuracy | Tool Calls | Total Tokens | Avg. Tokens / Question |
|---|---|---|---|---|---|---|
|
GPT-5.5 |
Runtime raw-data assembly |
0 / 10 |
0% |
32 |
7,765 |
777 |
|
GPT-5.5 |
DeltaStream prebuilt context |
10 / 10 |
100% |
10 |
4,419 |
442 |
|
GPT-5.4-mini |
Runtime raw-data assembly |
0 / 10 |
0% |
32 |
5,303 |
530 |
|
GPT-5.4-mini |
DeltaStream prebuilt context |
10 / 10 |
100% |
10 |
3,988 |
399 |
DeltaStream reduced tool calls by 69% for both models. For GPT-5.5, DeltaStream reduced token usage by 43%. For GPT-5.4-mini, DeltaStream reduced token usage by 25%. Most importantly, the small model with DeltaStream context achieved 10/10 correctness, while the big model with raw runtime assembly achieved 0/10 correctness.
Detailed Benchmark Results
| # | SOC Question | GPT-5.5 Raw Runtime | GPT-5.5 + DeltaStream | GPT-5.4-mini Raw Runtime | GPT-5.4-mini + DeltaStream |
|---|---|---|---|---|---|
| 1 |
Is this suspicious login high severity? |
|
|
|
|
| 2 |
Should the SOC immediately revoke sessions? |
|
|
|
|
| 3 |
Is this just a false-positive impossible-travel alert? |
|
|
|
|
| 4 |
Did the attacker attempt persistence? |
|
|
|
|
| 5 |
Should endpoint LAP-8831 be isolated? |
|
|
|
|
| 6 |
Is the destination IP malicious enough to escalate? |
|
|
|
|
| 7 |
What is the blast radius? |
|
|
|
|
| 8 |
Is there evidence of lateral movement? |
|
|
|
|
| 9 |
Is this a repeat incident for this user? |
|
|
|
|
| 10 |
What should the analyst do now? |
|
|
|
|
This is the key lesson: the raw-runtime agents were not failing because they were weak models. GPT-5.5 is a strong model. They failed because the correct answer depended on state that was not present in the raw tool results.
The DeltaStream agents succeeded because the complex security context was already computed.
Cost and Tool-Call Comparison
| Model | Metric | Runtime Raw Data | DeltaStream Context | DeltaStream Improvement |
|---|---|---|---|---|
|
GPT-5.5 |
Correct answers |
0 / 10 |
10 / 10 |
+10 correct answers |
|
GPT-5.5 |
Tool calls |
32 |
10 |
69% fewer |
|
GPT-5.5 |
Total tokens |
7,765 |
4,419 |
43% fewer |
|
GPT-5.4-mini |
Correct answers |
0 / 10 |
10 / 10 |
+10 correct answers |
|
GPT-5.4-mini |
Tool calls |
32 |
10 |
69% fewer |
|
GPT-5.4-mini |
Total tokens |
5,303 |
3,988 |
25% fewer |
The most important comparison is not just raw vs. DeltaStream for the same model. It is this:
| # | Comparison | Correctness | Tokens | Tool Calls |
|---|---|---|---|---|
| 1 |
GPT-5.5 + raw runtime assembly |
0 / 10 |
7,765 |
32 |
| 2 |
GPT-5.4-mini + DeltaStream context |
10 / 10 |
3,988 |
10 |
In this benchmark, the smaller model with better context beat the larger model with incomplete raw data. It was more accurate, used about 49% fewer tokens, and required 69% fewer tool calls.
That is a major production point.
If the context is fresh, complete, and decision-ready, teams can often use smaller, cheaper models for many operational agent tasks. The model does not need to spend expensive inference tokens reconstructing the attack graph. It can focus on explanation and response.
Why the Raw Runtime Agent Failed
The raw-runtime agent frequently gave careful and reasonable answers. That is exactly the problem.
It said things like:
Not enough evidence to declare high severity.
Do not revoke sessions yet.
Cannot determine whether persistence occurred.
No confirmed evidence of lateral movement.
The IP reputation is unknown.
The blast radius is potentially elevated but not confirmed.
Those answers are reasonable given the partial data.
But they are wrong given the real state.
The missing context included:
MFA fatigue window
impossible travel calculation
fresh threat-intel update
endpoint alert aggregation
EDR and device posture
cloud access key creation
OAuth app consent
malicious egress count
sensitive data access
lateral movement sequence
closed ticket history
similar incident fingerprint
playbook action mapping
In other words, the model was not missing intelligence. It was missing context.
What Makes Security Context Hard?
The hard part is not fetching “the latest alert.”
The hard part is computing state that no source system directly stores.
1. Rolling-Window Aggregations
Security decisions often depend on counts over time:
MFA pushes in the last 10 minutes
endpoint alerts in the last 30 minutes
egress connections in the last 15 minutes
SMB/RDP activity in the last 20 minutes
similar incidents in the last 30 days
A runtime agent may fetch the latest MFA event or the latest endpoint alert. That misses the pattern.
DeltaStream continuously computes these windows.
2. Temporal Joins
The order of events matters:
MFA fatigue → successful login
successful login → cloud access key creation
successful login → C2 egress
initial host login → second host login
second host login → SMB/RDP activity
risky login → sensitive data access
A runtime agent must discover and join these sequences during inference. That is fragile.
DeltaStream performs the joins continuously and exposes the result.
3. Threat-Intel Enrichment
Threat intelligence changes quickly.
In the benchmark, the raw runtime agent saw the IP reputation as “unknown” or stale. DeltaStream context included fresh threat intel classifying the IP as C2 and correlated it with multiple outbound connections.
That changed the correct response from “monitor” to “escalate.”
4. Asset and Identity Context
The same alert has very different severity depending on the user and asset.
Is the user privileged?
Does the user have cloud admin permissions?
Can the user access customer data?
Is the device managed?
Is EDR active?
Is the asset business critical?
DeltaStream joins identity, asset, endpoint, and access context into the incident state.
5. Pattern Recognition
A single signal may be ambiguous. A pattern is not.
MFA approved: maybe benign.
MFA approved after 7 push attempts: suspicious.
MFA fatigue + impossible travel + C2 egress + new access key: high severity.
DeltaStream turns streams of events into recognized patterns the agent can trust.
DeltaStream’s Role
DeltaStream is the real-time context platform for AI agents.
For cybersecurity agents, DeltaStream continuously performs:
stream ingestion
schema normalization
event-time ordering
deduplication
stateful joins
rolling-window aggregations
threat-intel enrichment
identity-to-device correlation
asset criticality joins
policy evaluation
pattern recognition
materialized context serving
The agent receives one compact, fresh context row:
{ "incident_classification": "LIKELY_ACCOUNT_TAKEOVER_WITH_PERSISTENCE_AND_LATERAL_MOVEMENT", "incident_severity": "HIGH", "mfa_fatigue_pattern": true, "impossible_travel": true, "threat_intel_type": "C2_IP", "new_cloud_access_key_created": true, "new_oauth_app_consent": true, "lateral_movement_detected": true, "sensitive_data_access_after_login": true, "similar_incident_last_30d": true, "recommended_actions": [ "REVOKE_USER_SESSIONS", "DISABLE_NEW_ACCESS_KEY", "REVOKE_OAUTH_APP_CONSENT", "ISOLATE_LAP-8831_AND_SRV-FIN-22", "FORCE_PASSWORD_RESET", "BLOCK_C2_IP", "AUDIT_SENSITIVE_DATA_ACCESS", "ESCALATE_TO_INCIDENT_COMMANDER" ] }
Now the model can do what it is good at: communicate the situation clearly and help the analyst act.
Bigger Model vs. Better Context
One of the most important findings from the benchmark is that a bigger model did not save the raw-runtime approach.
GPT-5.5 with incomplete raw data scored 0/10.
GPT-5.4-mini with DeltaStream context scored 10/10.
That matters for production.
Teams often assume that using a larger model will fix agent accuracy. But if the model does not receive the right state, it cannot reliably produce the right decision. It may produce a more polished answer, but not necessarily a correct one.
Better context changes the economics:
Raw runtime assembly:
larger prompts
more tool calls
more latency
higher cost
lower correctness
DeltaStream prebuilt context:
smaller prompts
fewer tool calls
lower latency
lower cost
higher correctness
smaller models become viable
This is how you productionize agents: do not ask the model to reconstruct the world at inference time. Give it the current truth.
The Real Lesson
Cybersecurity agents do not fail only because models hallucinate.
They fail because context is incomplete.
A raw login event is not enough. An MFA approval is not enough. A medium endpoint alert is not enough. An “unknown” IP reputation is not enough. An empty open-ticket list is not enough.
The agent needs fused, stateful, fresh context.
That context must be built before inference.
Final Takeaway
For cybersecurity AI agents, prebuilt fresh context is not optional.
It is required for correctness, latency, cost control, and operational safety.
When the answer depends on rolling windows, cross-source correlation, threat-intel enrichment, identity posture, endpoint state, cloud audit, lateral movement, incident history, and response policy, the agent should not build context at runtime.
DeltaStream should.
DeltaStream turns raw security telemetry into fresh, trusted, agent-ready context. That context makes agents more accurate, reduces token and tool-call cost, and can make smaller, cheaper models viable for production workflows.
If you are building a SOC copilot, incident-response agent, threat-hunting assistant, or security operations AI agent, DeltaStream can provide the fresh context layer your agents need to make accurate, safe, and timely decisions.