A spot purchase can clear in under five minutes. A competitive auction can stay open for five business days, and an award can wait ninety-six hours on a human approver. The system has to be fast where speed matters and patient where it doesn't — and across that entire range, a process will die, a Runtime will recycle, a deploy will roll. This post is about the property that holds all of it together: durability. How a negotiation survives crashes, restarts, and long deliberate pauses without ever losing state, redoing work, or quietly corrupting itself.
Most of what we've written about Buyer Team treats a negotiation as a thing that happens — agents classify, suppliers bid, a winner is picked. That framing hides the hard part. A negotiation isn't an event; it's a process whose duration is set by the deal, not by the system: a spot buy finishes in minutes, a strategic partnership negotiates over weeks. For almost all of the long ones, there is no agent running and no compute burning — the system is waiting for suppliers to respond before a deadline, for an auction round to converge, for a human to click approve. The durability question isn't "how do we make this fast?" (it already is, where the deal allows). It's "how do we keep a multi-step workflow coherent across idle time we don't control, and across the failures that idle time guarantees you'll hit?"
Get that right and duration stops being a liability. A negotiation that can be killed and restarted at any node, paused for four days and resumed in five seconds, and carried across a deploy without a hiccup is more agile, not less — because none of those events forces a restart from zero, and none of them is something an operator has to babysit. Durability is what lets the system be patient without being fragile.
We've gestured at the pieces in other posts — DynamoDB checkpoints here, a 600-second recovery lock there, "resumes from checkpoint, no duplicate bids" in a test matrix. This post owns the whole story end to end: what the state machine actually persists, how resume reconstructs a negotiation, why every node has to be idempotent for that to be safe, and how we pause a deterministic DAG for four days without holding a process open. The governing constraint behind all of it is a number from the timeout table: a negotiation can stay non-terminal for up to 168 hours (REQ-R351). Everything below exists to make those seven days durable — and to keep the sub-five-minute path (REQ-G403) just as crash-safe as the seven-day one.
The first decision that makes durability tractable is one we made before any of the recovery machinery: the orchestration layer is a deterministic seven-node DAG, not an autonomous planner that decides its own next step. We've argued the case for that choice on auditability and governance grounds elsewhere. The reliability argument is just as load-bearing, and it's the one most relevant here: failure recovery requires known checkpoints, not full workflow restarts.
If the next action is whatever an LLM decides in the moment, then "where was I?" has no deterministic answer — you'd have to replay the model's reasoning to find out, and the model isn't guaranteed to reason the same way twice. With a fixed DAG, the question collapses to a lookup: which node completed last? The negotiation's status is its position in the graph. EVALUATING means Node 5 is running; PENDING_APPROVAL means we're parked at the Node 6 interrupt. Recovery becomes "read the status, resume the next node," and that only works because the graph topology is known ahead of time.
The DAG has exactly one branch (a four-way split at the Strategy Router, Node 3, where a negotiation is routed to Spot, Leverage, Bottleneck, or Strategic) and exactly one governed loop (a single permitted cycle-back from Node 6 to a Node 4 variant when a human rejects an award but asks for another round). Bounded branching and a max-one retry loop mean the set of reachable states is small and enumerable — which is precisely what lets us reason about recovery exhaustively rather than hopefully.
The rule is simple and absolute: checkpoint after every node completion (REQ-G002). Not mid-node, not opportunistically — after each node produces its output and before the graph advances. The checkpoint is written to the {env}-checkpoints DynamoDB table under a composite key of (tenant_id#negotiation_id, checkpoint#version), with a 30-day TTL and optimistic concurrency control on the write (REQ-G301, REQ-R500, REQ-R503).
Each node persists the entities it produced, so the cumulative checkpoint trail is a complete reconstruction of the negotiation's progress:
The checkpoint carries enough to rebuild execution context without re-running anything: negotiation_id, tenant_id, step_name, step_status, a serialized state-machine snapshot, completed_at, and a monotonic version. There's also a coarser source of truth running alongside it — the Step Functions execution history — which the recovery flow reads in addition to the latest checkpoint. The checkpoint tells us the negotiation-domain state; the execution history tells us where in the state machine we were. Recovery uses both (REQ-R501).
A checkpoint only earns its keep if a crash between two checkpoints loses a bounded, affordable amount of work. The expensive nodes here are the strategy agents at Node 4x — they run multi-round supplier outreach, TCO and risk computation, sometimes over days. Checkpointing only at the end would mean a crash during Node 4x throws away the entire round of supplier communication and forces a re-run, which isn't just slow — re-sending auction invitations or proposals has real-world side effects. The per-node boundary caps the blast radius of any single crash at one node's worth of work, and the idempotency layer (next section) shrinks even that to near-zero.
Here's the subtlety that makes "resume from checkpoint" harder than it sounds. When a new instance picks up a negotiation, it knows the last node that completed. What it doesn't know is how far the next node got before the crash. Did Node 4x send three of five auction invitations and then die? Did Node 7 write the purchase order but crash before logging the communication? Resume can't assume the next node is a clean slate. It has to assume the next node may have already done some — or all — of its work.
So the resume contract isn't "start the next node fresh." It's stronger: every node must be idempotent and detect its own previously-completed work (REQ-G005). Each node carries a dedup key that lets it recognise a side effect it already performed:
Node 1 → (tenant_id, category_id, hash(items), deadline)
Node 2 → semantic cache key
Node 4x → (negotiation_id, supplier_id, action, round_number)
Node 5 → existing evaluation_score on the Bid
Node 7 → existing CommunicationLog entry
The Node 4x key is the important one, because Node 4x is where the side effects live — every message to a supplier is an irreversible act. Before any side-effecting tool call, the agent computes that deterministic idempotency key and checks the {env}-idempotency table (7-day TTL). On a hit, it returns the cached result instead of re-executing; on a miss, it executes and writes the key (REQ-R200–R202). A new instance resuming a half-finished auction round therefore re-walks the round, but each invitation it tries to re-send hits a cached key and becomes a no-op. The supplier never sees a duplicate.
The Leverage Auction agent makes this concrete with two layers of round-level idempotency. Before sending an invitation, it checks the CommunicationLog for an existing AUCTION_INVITATION on (negotiation_id, supplier_id, type) — found means skip. Before executing round N, it checks for an existing AUCTION_ROUND_FEEDBACK entry on (negotiation_id, round_number, type) — found means that round already ran, skip to the next. A crash three rounds into a five-round auction resumes at round four, not round one, and re-sends nothing.
The one governed loop in the graph — Node 6 rejecting an award back to Node 4x — is itself a resume hazard. When the graph cycles back, it first marks every Bid from the rejected round SUPERSEDED so stale bids don't leak into the next evaluation. That marking has to be idempotent as well: if recovery interrupts the cycle-back, re-marking an already-SUPERSEDED bid is a no-op, and Node 5 filters SUPERSEDED bids unconditionally. The retry loop and the recovery loop have to compose cleanly, because in production they will eventually overlap.
Idempotency makes a single resuming instance safe. It does not, on its own, make two resuming instances safe. After a Runtime disruption, the recovery flow queries DynamoDB for every negotiation whose status isn't AWARDED or CANCELLED and tries to resume each one. If two instances run that sweep concurrently, they'll both find the same in-progress negotiation. Idempotency means they won't duplicate side effects — but they'll waste compute racing each other, and they can interleave checkpoint writes in confusing ways. We'd rather exactly one instance own a recovery.
The mechanism is a DynamoDB lock on {env}-recovery-locks, scoped to (tenant_id, negotiation_id), acquired with a conditional write:
ConditionExpression = "attribute_not_exists(pk) OR expires_at < :now_epoch"
No lock row means acquisition succeeds. A lock row whose expires_at has already passed means the previous holder is presumed dead and the new instance takes over. Crucially, lock expiry is driven by the expires_at comparison, not by DynamoDB's TTL deletion. DynamoDB TTL is best-effort and can lag deletion by many minutes — far too imprecise to gate failover on. We keep a ttl attribute set to expires_at + 3600s purely for background row cleanup; it is never load-bearing for correctness.
The lock's lifetime is 600 seconds, and that number isn't arbitrary — it's sized to exceed the worst-case execution time of a single node so a legitimately-slow node can't have its lock yanked mid-flight:
A2A timeout (120s) + cold-start budget (30s) = 150s per attempt
× max_retries (3) = 450s
+ backoff ceiling (~45s) = 495s
+ checkpoint write budget (~5s) ≈ 500s
→ TTL set to 600s (≈1.2× safety margin)
There's a second-order failure to handle: what if the recovery attempt itself crashes while holding the lock? Then the expires_at comparison gives a deterministic upper bound — 600 seconds — on when the next instance can step in. No reliance on TTL timing, no indefinite stall. Ops watches a recovery_lock_acquisition_failures metric with a CloudWatch alarm that fires at one failure in five minutes; sustained elevation points at a systemic problem (a node that genuinely can't complete) rather than a transient one, and the right response is to investigate instance health, not to keep retrying recovery into the same wall.
This is the question we kept asking ourselves, and the answer is that the two mechanisms guard different things. Idempotency guarantees correctness under concurrent execution — no duplicate invitation, no double-counted bid. The recovery lock guarantees efficiency and clean ownership — one instance does the work, checkpoint writes don't interleave, and the recovery-time target (30 seconds, REQ-G300) is met by one instance moving forward rather than several colliding. Idempotency is the safety net; the lock is the thing that keeps you off the net most of the time. Drop the lock and the system is still correct, just wasteful and harder to reason about. Drop idempotency and the lock becomes load-bearing for correctness — exactly the coupling we didn't want.
Everything so far is about surviving crashes. The harder durability problem is the deliberate pause. A STRATEGIC or above-threshold negotiation hits the Node 6 Approval Gate and has to stop and wait for a human. That human has up to 96 hours. You cannot hold a process — or a microVM session — open for four days waiting on a click. So the interrupt doesn't hold anything open at all.
At Node 6, the gate evaluates whether approval is required (spend above threshold, or a quadrant that always requires it). If it is, the graph executes a true interrupt: persist the full state snapshot, set Negotiation.status = PENDING_APPROVAL and Award.approval_status = PENDING_HUMAN, return a session_id and checkpoint_id, publish the approval request, and pause (REQ-G200). The execution stops. No compute is held. The negotiation now lives entirely as a row in DynamoDB and an open span in the trace.
Resume is an inbound API call, not a polling loop. When the approver submits {decision, reason, alternative_bid_id}, the graph validates the decision and resumes from the checkpoint within five seconds (REQ-G201). That resume endpoint is also where user-level authorisation is enforced — a separate plane from the agent→tool policies — evaluating the caller's JWT claims against the action before the state transition is accepted (REQ-G251). The agent never carries a user JWT and never evaluates user claims; entity access control is purely an orchestration-layer concern, enforced at exactly this boundary.
When 96 hours elapse with no decision, the negotiation does not auto-approve or auto-reject. It transitions to REQUIRES_ATTENTION (trigger #3) with full state preserved, and a human compliance reviewer makes the call (REQ-G203). The reasoning: a stale approval past 96 hours signals either an unavailable approver or a broken escalation path — both of which are conditions where guessing is worse than escalating. There is a final backstop: if the negotiation then sits in REQUIRES_ATTENTION for more than seven further days, it auto-cancels with approval_stale_7day_limit_exceeded (REQ-G203a). Auto-cancel is safe in a way auto-approve never is — cancelling commits no money and signs no contract.
Stacking up the waits — a five-business-day Leverage auction, a 96-hour approval window, retry backoffs, a governed cycle-back — a negotiation's wall-clock lifespan can stretch. The outer bound is 168 hours (seven days) from created_at. A periodic Graph-Orchestrator sweep flags any negotiation that's been non-terminal that long and routes it to REQUIRES_ATTENTION with entry_trigger = "negotiation_total_timeout" (trigger #16, REQ-R351).
The design choice worth calling out is that the system does not auto-cancel at 168 hours. It escalates and waits for an operator to decide between cancel and extend. A negotiation that's been alive for a week is either genuinely stuck — and a human needs to know why — or legitimately long-running, in which case silently killing it would destroy days of supplier engagement. Either way the correct action is a human decision, so the ceiling is a tripwire that summons attention, not a guillotine.
This is the same philosophy that runs through the rest of the failure surface: every terminal-ish event is observable and categorised, never swallowed. The escalation itself is built to survive the failure that triggered it. The authoritative signal is always the DynamoDB status write to REQUIRES_ATTENTION; the SQS DLQ publish that follows is fire-and-forget (REQ-G306). If the DLQ is down, the negotiation is already visibly in REQUIRES_ATTENTION on the ops dashboard and in CloudWatch, and a dlq_publish_failed metric surfaces the DLQ's own degradation independently. A failure in the notification path can never mask the failure it was supposed to report.
None of this is free, and most of it is a deliberate trade against an alternative that looked simpler on paper. Five decisions shaped the durability model; each bought a reliability property at a real, named price.
Chose: a fixed seven-node graph. Trade-off: the system can't creatively re-plan around a novel situation — every path is one we anticipated. Gain: "where was I?" is a status lookup, not a reasoning replay, so recovery is deterministic and the reachable-state set is small enough to test exhaustively. A planner's state lives in an LLM's head and can't be reconstructed byte-for-byte; a DAG's state is a row.
Chose: persist after each of the seven nodes. Trade-off: a DynamoDB write on every node boundary — more writes, more latency, a 30-day storage cost per negotiation. Gain: a crash loses at most one node's work instead of the whole negotiation. For the expensive Node 4x (days of supplier outreach with real-world side effects), that's the difference between a 5-second recovery and re-sending invitations.
Chose: every side-effecting node owns a dedup key; recovery stays dumb. Trade-off: every node author must define a deterministic key and an idempotency check as part of "done" — discipline that's easy to skip and painful to retrofit. Gain: resume is provably safe even when the crashed node already produced side effects, and the retry loop and recovery loop compose without special-casing.
Chose: a conditional-write lock on top of idempotency. Trade-off: a second mechanism to maintain, plus a TTL that must be re-derived whenever node timeouts change. Gain: exactly one instance owns a recovery — no wasted compute racing, no interleaved checkpoint writes, and the 30-second recovery target is met by one instance moving forward. Idempotency keeps the system correct under a race; the lock keeps it efficient and legible.
Chose: serialise state to DynamoDB and tear the execution down during a human wait. Trade-off: resume is an inbound API call with its own auth and validation path, not a paused thread that just wakes up — more moving parts at the boundary. Gain: a four-day pause costs zero compute and survives any deploy or recycle that happens during it. Holding a microVM session open for 96 hours isn't just wasteful — it makes the pause itself a thing that can crash.
Chose: 168h non-terminal → REQUIRES_ATTENTION, human decides. Trade-off: a stuck negotiation needs an operator instead of self-clearing. Gain: a week-old negotiation is either genuinely broken (someone should know) or legitimately long (killing it destroys days of engagement). Both resolve to a human decision, so the ceiling summons attention rather than silently discarding state.
The thread running through all six is the same: push correctness into deterministic, replayable mechanisms and out of anything that has to be reasoned about live. The DAG, the checkpoints, the dedup keys, the conditional-write lock, the serialised interrupt — each one converts a "hope it works" into a "lookup that can't lie." That's the same instinct behind treating the agent as untrusted infrastructure on the security side: the less the live reasoning loop is load-bearing for a critical property, the more durable that property is.
Putting the pieces together, here's what durability looks like for a Leverage auction that runs the full distance and survives a crash and a long human wait on the way:
At no point in those five days was a process held open across the idle stretches, and at no point did the round-3 crash cost more than the seconds it took a fresh instance to acquire the lock and reload the checkpoint. The auction that the supplier experienced was continuous; the execution behind it was anything but.
Three things that took longer to land than they should have.
Size the lock TTL from the execution budget, not a round number. An earlier version of the recovery lock used a five-minute TTL because five minutes felt safe. It wasn't — a Node 4x execution that exhausts its retries (three A2A attempts at 150s each, plus backoff) can legitimately run past 495 seconds, and a five-minute lock would expire under a slow-but-healthy node, letting a second instance barge in. We re-derived the TTL from the worst-case single-node execution time and landed at 600 seconds with a 1.2× margin. The lesson is general: any TTL that gates failover has to be a function of the thing it's protecting, and it should be written down next to the derivation, because the inputs (timeouts, retry counts) drift over time and the TTL has to drift with them.
Don't trust DynamoDB TTL for anything time-sensitive. We initially leaned on TTL deletion to expire recovery locks, and discovered the hard way that TTL is best-effort and can lag by many minutes. A crashed recovery holding a lock would block the next instance for far longer than intended. The fix was to make expires_at a value we compare in the conditional write itself, and demote TTL to background cleanup only. If correctness depends on when something expires, expiry has to be in your control path, not in a background process you don't schedule.
Make resume idempotency a node-authoring requirement, not a recovery-layer afterthought. The instinct is to build resume as a clever recovery routine that figures out what to skip. That's backwards. The recovery routine should be dumb — read status, resume next node — and the cleverness should live in each node's own contract: "I detect my previously-completed work via this dedup key." We retrofitted idempotency into the auction agent after an audit found a duplicate-invitation path on cycle-back, and retrofitting it was much more painful than it would have been to require it up front. Every node that has a side effect needs a dedup key as part of its definition of done, the same way it needs a checkpoint.
Durable execution in Buyer Team rests on four mechanisms that each close a gap the others leave open. The deterministic DAG makes "where was I?" a lookup instead of a replay. Per-node DynamoDB checkpoints (REQ-G002) cap the blast radius of any crash at one node's work. Per-node idempotency with dedup keys (REQ-G005, REQ-R200) makes resume safe even when the crashed node had already produced side effects. And the conditional-write recovery lock (REQ-R502) with an execution-budget-derived 600s TTL ensures exactly one instance owns a recovery, hitting the 30-second target without colliding.
Layered on top, the interrupt-resume pattern lets a deterministic workflow pause for up to 96 hours of human deliberation without holding a process open — the negotiation lives as a DynamoDB row and resumes on an inbound API call in under five seconds. And the 168-hour ceiling (REQ-R351) is a tripwire that escalates a stuck workflow to a human rather than auto-deciding, because a week-old negotiation carries enough ambiguity that the right move is attention, not automation.
The principle underneath all of it: a long-running negotiation is not a long-running process. Treat the workflow as a sequence of durable, idempotent, checkpointed steps over a store of record, and the days of idle time and the crashes that punctuate them stop being a reliability problem. The agents do the negotiating. The state machine just refuses to forget where it was.