18 May 2026

9 Min Read

Productionizing Flight Disruption AI Agents: Why Fresh Context Must Be Prebuilt

The Use Case: Flight Disruption Manager Agent
The Use Case: Flight Disruption Manager Agent
Why Runtime Raw-Data Assembly Fails
Why Runtime Raw-Data Assembly Fails
What DeltaStream Builds
What DeltaStream Builds
The Benchmark
The Benchmark
Benchmark Summary
Benchmark Summary
Detailed Benchmark Results
Detailed Benchmark Results
Cost and Tool-Call Comparison
Cost and Tool-Call Comparison
Why the Raw Runtime Agent Failed
Why the Raw Runtime Agent Failed
What Makes Flight Disruption Context Hard?
What Makes Flight Disruption Context Hard?
Why Smaller Models Become Viable
Why Smaller Models Become Viable
Why DeltaStream Is the Right Platform
Why DeltaStream Is the Right Platform
Final Takeaway
Final Takeaway

Hojjat Jafarpour

Founder & CEO

Airline operations are one of the strongest examples of why AI agents need fresh, prebuilt context.

A Flight Disruption Manager Agent sounds simple on the surface: a flight is delayed, and the agent should recommend what to do next. But in real airline operations, that decision depends on a fast-changing web of operational state:

flight status and delay propagation
passenger itineraries and connection risk
airport minimum connection times
seat inventory and reaccommodation options
crew legality and reserve availability
aircraft tail assignment and maintenance restrictions
gate, slot, and airport constraints
baggage scan location and reroute feasibility
voucher eligibility and customer-care policy
loyalty tier, special service requests, and passenger priority rules

No single raw system has the full answer.

That is why runtime context assembly breaks down.

The right architecture is:

Flight ops + crew + aircraft + passengers + baggage + inventory + policy
        ↓
DeltaStream
        ↓
Fresh, stateful, prebuilt disruption context
        ↓
Flight Disruption Manager Agent
        ↓
Correct, timely, lower-cost operational decisions

DeltaStream continuously builds the context before the agent is called. The agent gets current operational truth, not raw operational fragments.

The Use Case: Flight Disruption Manager Agent

Consider a delayed DFW-LGA flight with passengers connecting onward to BOS. Some passengers will misconnect. A few seats are available on a DFW-BOS nonstop. Bags may or may not be reroutable. A downstream LGA-BOS connection could be held, but holding it might violate crew or ATC slot constraints. Another aircraft looks available but has a maintenance restriction. A separate cancelled flight may require hotel vouchers, but only if the disruption is airline-controllable.

A production agent must answer questions like:

Should we proactively rebook this passenger?
Should we hold the connection?
Can we safely swap aircraft?
Is the crew still legal?
Should we issue hotel vouchers?
Can the checked bag be rerouted?
Which passengers should get the last few seats?
Is this weather or airline-controllable?
What is the best integrated next action?

Those answers require more than retrieval. They require stateful, policy-aware, real-time context.

Why Runtime Raw-Data Assembly Fails

In a runtime-fetch architecture, the agent calls a few tools:

get_flight_status()
get_passenger_itinerary()
get_seat_inventory()
get_crew_schedule()
get_aircraft_status()
get_bag_status()
get_voucher_policy()

But the correct answer often depends on the data the agent did not fetch:

alternate origin/destination reaccommodation inventory
crew duty projection after the new arrival time
active MEL restrictions and mission compatibility
ATC slot risk if a connection is held
bag runner availability and load-control acceptance
passenger special-service priority rules
final controllability classification after delay-code override
integrated next-best action across passengers, bags, aircraft, crew, and policy

A large model can reason well over the data it sees. But if the state was not fetched or computed, the model cannot reliably make the correct operational decision.

This is the key lesson: the bottleneck is not the model. The bottleneck is context.

What DeltaStream Builds

DeltaStream continuously turns raw airline operations events into fresh, agent-ready context.

Example context:

{
  "case_id": "FD-AA2197-PAX1001",
  "flight": "AA2197",
  "passenger_id": "PAX-1001",
  "estimated_arrival_lga": "2026-05-08T19:45:00-04:00",
  "connection_departure": "2026-05-08T20:05:00-04:00",
  "connection_buffer_minutes": 20,
  "minimum_connection_minutes": 35,
  "will_misconnect": true,
  "last_same_day_lga_bos_missed": true,
  "alternate_option": "AA1802 DFW-BOS nonstop",
  "alternate_seats_available": 3,
  "bag_reroute_feasible": true,
  "rebook_decision": "PROACTIVELY_REBOOK"
}

This is not a summary. It is live operational context computed from multiple systems.

DeltaStream builds context such as:

passenger disruption context
misconnect risk context
reaccommodation option context
crew legality context
aircraft swap safety context
bag reroute feasibility context
voucher eligibility context
connection hold decision context
passenger priority scoring context
delay controllability context
integrated next-best-action context

The agent receives the state it needs and explains the action.

The Benchmark

We ran a benchmark with 10 realistic flight disruption questions. Each question was evaluated in two modes:

Mode 1: Runtime raw-data assembly
The model receives limited raw tool results and must infer the operational answer.

Mode 2: DeltaStream prebuilt context
The model receives one fresh, stateful context row computed by DeltaStream.

We tested both a large model and a smaller, cheaper model:

Large model: GPT-5.5
Small model: GPT-5.4-mini
Judge model: GPT-5.5

The benchmark results show a clear pattern: DeltaStream prebuilt context produced 100% correctness for both models, while raw runtime assembly failed on most cases. The benchmark output also captured exact token usage, tool calls, model outputs, and GPT-5.5 judge decisions.

Benchmark Summary

Model	Approach	Correct Answers	Accuracy	Tool Calls	Total Tokens	Avg. Tokens / Question
GPT-5.5	Runtime raw-data assembly	2 / 10	20%	37	9,326	933
GPT-5.5	DeltaStream prebuilt context	10 / 10	100%	10	4,547	455
GPT-5.4-mini	Runtime raw-data assembly	0 / 10	0%	37	6,608	661
GPT-5.4-mini	DeltaStream prebuilt context	10 / 10	100%	10	4,194	419

The large model improved from 20% to 100% correctness with DeltaStream context. The small model improved from 0% to 100% correctness.

DeltaStream also reduced tool calls from 37 to 10, a 73% reduction, for both models.

Detailed Benchmark Results

#	Flight Ops Question	GPT-5.5 Raw Runtime	GPT-5.5 + DeltaStream	GPT-5.4-mini Raw Runtime	GPT-5.4-mini + DeltaStream
1	Should PAX-1001 be proactively rebooked from delayed AA2197?
2	Should N732AA be swapped onto AA441 ORD-SFO?
3	Can crew C-883 legally operate AA1289 after delay?
4	Should hotel vouchers be issued for AA771 PHX-SEA?
5	Should bag B-7782 be rerouted after rebooking?
6	Should AA4321 LGA-BOS be held for misconnects?
7	Who gets the three remaining AA1802 seats?
8	Confirmed same-day rebooking or standby for PAX-2040?
9	Is AA610 weather-related or airline-controllable?
10	What is the integrated next action for AA2197?

The raw-runtime agent was not making random mistakes. It often gave reasonable, cautious answers based on incomplete tool results.

That is exactly the problem.

In airline operations, a cautious but incomplete answer can still be operationally wrong.

Cost and Tool-Call Comparison

Using OpenAI’s listed standard short-context API pricing, GPT-5.5 is priced at $2.50 per 1M input tokens and $15 per 1M output tokens, while GPT-5.4-mini is priced at $0.375 per 1M input tokens and $2.25 per 1M output tokens. (OpenAI Developers)

Model	Approach	Input Tokens	Output Tokens	Estimated Token Cost
GPT-5.5	Runtime raw-data assembly	4,505	4,505 4,821	$0.0836
GPT-5.5	DeltaStream prebuilt context	3,169	1,378	$0.0286
GPT-5.4-mini	Runtime raw-data assembly	4,505	2,103	$0.0064
GPT-5.4-mini	DeltaStream prebuilt context	3,169	1,025	$0.0035

For GPT-5.5, DeltaStream reduced estimated token cost by about 66% in this benchmark.

For GPT-5.4-mini, DeltaStream reduced estimated token cost by about 46%.

But the more important comparison is this:

Comparison	Correctness	Tool Calls	Total Tokens	Estimated Token Cost
GPT-5.5 + raw runtime assembly	2 / 10	37	9,326	$0.0836
GPT-5.4-mini + DeltaStream context	10 / 10	10	4,194	$0.0035

In this benchmark, the smaller model with DeltaStream context was more accurate than the larger model with raw runtime assembly, used 55% fewer tokens, required 73% fewer tool calls, and had an estimated token cost about 96% lower.

That is the production point.

Better context can matter more than a bigger model.

Why the Raw Runtime Agent Failed

The raw-runtime agent often lacked the stateful facts required to make the correct decision.

For example:

Misconnect recovery

The raw GPT-5.5 agent correctly saw that PAX-1001 would likely misconnect at LGA. But because it only had LGA-BOS inventory and did not have alternate DFW-BOS recovery context, it recommended not proactively rebooking yet.

DeltaStream had already computed:

will_misconnect = true
last_same_day_lga_bos_missed = true
alternate_option = AA1802 DFW-BOS nonstop
alternate_seats_available = 3
bag_reroute_feasible = true
rebook_decision = PROACTIVELY_REBOOK

The DeltaStream-context agent gave the correct answer: rebook proactively.

Crew legality

The raw agent saw the delayed departure and assigned crew schedule, but did not have projected duty minutes or the applicable legal duty limit. It said legality could not be confirmed.

DeltaStream had already computed:

projected_duty_minutes = 1003
legal_duty_limit_minutes = 960
crew_legal = false
reserve_crew_available = true

The DeltaStream-context agent correctly recommended assigning the LAX reserve crew.

Connection hold decision

The raw agent suggested a conditional short hold for AA4321. That sounds reasonable, but it was wrong. The required hold was 28 minutes, the maximum allowed hold was 15 minutes, and holding would create downstream crew and ATC slot risk.

DeltaStream had already computed:

required_hold_minutes = 28
max_allowed_hold_minutes = 15
downstream_crew_legality_risk_if_hold = true
atc_slot_loss_risk = true
alternate_reaccommodation_available = true
hold_decision = DO_NOT_HOLD

The DeltaStream-context agent correctly recommended departing AA4321 on time and reaccommodating the misconnecting passengers.

Passenger priority scoring

The raw agent prioritized passengers by loyalty tier alone. That is a common but incomplete heuristic.

DeltaStream had already computed policy-aware disruption priority:

P3: unaccompanied minor policy priority
P5: EXP plus missed international long-haul
P1: EXP plus no same-day alternative

The DeltaStream-context agent correctly assigned the top three by disruption priority score, not loyalty alone.

Controllability classification

The raw agent saw a WX delay code and nearby thunderstorms, so it marked AA610 as weather-related and not airline-controllable.

DeltaStream had already fused the delay history and aircraft rotation chain:

initial_delay_code = WX
final_ops_delay_driver =
LATE_INBOUND_AIRCRAFT_FROM_MAINTENANCE_RECOVERY
maintenance_recovery_chain = true
controllable_disruption = true

The DeltaStream-context agent correctly applied controllable-delay customer-care rules.

What Makes Flight Disruption Context Hard?

The hard part is not fetching a flight status.

The hard part is computing the state that no single airline system directly stores.

1. Misconnect and reaccommodation context

A delay does not automatically mean a passenger should be rebooked. The decision depends on:

estimated arrival
gate arrival time
minimum connection time
boarding cutoff
last same-day option
alternate routing inventory
passenger priority
bag reroute feasibility

DeltaStream computes this continuously.

2. Crew legality

Crew legality is not just assigned crew plus flight delay. It requires:

duty start
projected duty end
flight time
delays
contractual and regulatory limits
reserve availability
required rest

This is stateful, time-sensitive compute.

3. Aircraft swap safety

An aircraft can look available but still be unusable.

The context must include:

tail availability
aircraft type
seat compatibility
MEL/CDL restrictions
mission profile
maintenance-control approval
gate and tow timing
downstream rotation impact

The model should not discover this at inference time.

4. Baggage reaccommodation

A passenger rebooking decision is incomplete if the bag cannot move.

Bag context requires:

last scan location
current airport
sortation state
bag cutoff
runner availability
load-control acceptance
new flight compatibility

This changes minute by minute.

5. Passenger priority

During disruptions, not all passengers are ranked by loyalty alone.

A realistic priority score may include:

unaccompanied minor status
medical or accessibility needs
loyalty tier
cabin
missed international connection
last same-day option
party split policy
customer value
regulatory or service obligations

DeltaStream can precompute that score and expose it to the agent.

6. Integrated next-best action

The final operational decision often combines everything:

rebook top impacted passengers
do not hold the connection
reroute feasible bags
do not use unsafe aircraft swap
monitor crew legality
do not issue vouchers yet

This is exactly the kind of context that should be continuously computed and served to the agent.

Why Smaller Models Become Viable

One of the most important findings from the benchmark is that GPT-5.4-mini with DeltaStream context outperformed GPT-5.5 with raw runtime assembly.

This matters for production.

Teams often assume that using a larger model will fix agent accuracy. But if the model receives incomplete context, a larger model may simply produce a more polished wrong answer.

DeltaStream changes the model’s job.

Without DeltaStream, the model must act like:

flight operations analyst
crew legality engine
aircraft maintenance validator
baggage operations coordinator
passenger priority scorer
customer-care policy engine

With DeltaStream, the model only needs to explain the already-computed operational decision.

That means smaller, cheaper models can become viable for many production workflows.

The benchmark showed exactly that:

GPT-5.5 + raw runtime assembly:
2 / 10 correct
37 tool calls
9,326 tokens

GPT-5.4-mini + DeltaStream context:
10 / 10 correct
10 tool calls
4,194 tokens

Better context beat a bigger model.

Why DeltaStream Is the Right Platform

Airline disruption management is a real-time data problem before it is an AI problem.

DeltaStream continuously performs:

stream ingestion
schema normalization
event-time ordering
stateful joins
rolling-window aggregations
passenger-to-flight-to-bag correlation
aircraft-to-maintenance validation
crew legality computation
inventory and reaccommodation scoring
policy evaluation
materialized context serving

That is the work agents should not do at inference time.

The agent should answer from trusted, fresh context.

Final Takeaway

For Flight Disruption Manager Agents, fresh context is not optional.

When the answer depends on passengers, flights, crews, aircraft, bags, inventory, airport constraints, weather, and policy, the agent should not assemble context from raw data at runtime.

DeltaStream should.

DeltaStream turns raw airline operations streams into fresh, stateful, decision-ready context. That context improves correctness, reduces token and tool-call cost, and can make smaller, cheaper models viable for production workflows.

If you are building airline operations agents, customer reaccommodation agents, baggage recovery assistants, crew recovery copilots, or airport operations agents, DeltaStream can provide the fresh context layer your agents need to operate correctly and cost-effectively.

Hojjat Jafarpour

Founder & CEO

Productionizing Flight Disruption AI Agents: Why Fresh Context Must Be Prebuilt

Table of contents

The Use Case: Flight Disruption Manager Agent

Why Runtime Raw-Data Assembly Fails

What DeltaStream Builds

The Benchmark

Benchmark Summary

Detailed Benchmark Results

Cost and Tool-Call Comparison

Why the Raw Runtime Agent Failed

What Makes Flight Disruption Context Hard?

Why Smaller Models Become Viable

Why DeltaStream Is the Right Platform

Final Takeaway

Productionizing Retail AI Agents: Why Fresh Context Must Be Prebuilt

Productionizing Logistics AI Agents: Why Fresh Context Must Be Prebuilt

Productionizing Cybersecurity AI Agents: Why Fresh Context Must Be Prebuilt

Table of contents

The Use Case: Flight Disruption Manager Agent

Why Runtime Raw-Data Assembly Fails

What DeltaStream Builds

The Benchmark

Benchmark Summary

Detailed Benchmark Results

Cost and Tool-Call Comparison

Why the Raw Runtime Agent Failed

What Makes Flight Disruption Context Hard?

Why Smaller Models Become Viable

Why DeltaStream Is the Right Platform

Final Takeaway

Productionizing Retail AI Agents: Why Fresh Context Must Be Prebuilt

Productionizing Retail AI Agents: Why Fresh Context Must Be Prebuilt

Productionizing Logistics AI Agents: Why Fresh Context Must Be Prebuilt

Productionizing Logistics AI Agents: Why Fresh Context Must Be Prebuilt

Productionizing Cybersecurity AI Agents: Why Fresh Context Must Be Prebuilt

Productionizing Cybersecurity AI Agents: Why Fresh Context Must Be Prebuilt

Request Submitted

Share this blog post