Engineering · Buyer Team · May 2026

Why we chose a deterministic DAG
over a pure LLM planner

Autonomous negotiation sounds like a job for a powerful LLM — give it the requisition, the supplier list, and a strategy, and let it figure out the rest. We tried that. It didn't work well enough for production. The reason is governance and recovery, not capability.

Three things a pure agent system can't give you

When you let an LLM plan and execute a negotiation end-to-end, three problems compound. First, auditability becomes a property of a chain-of-thought trace rather than of the workflow itself — and a regulator can't audit reasoning that was never structurally constrained. Second, governance lives in the prompt, where it can be misread, deprioritised against other instructions, or quietly skipped on an edge case the prompt didn't anticipate. Third, failure recovery means re-running the plan — there are no named checkpoints to resume from, so a crashed runtime means restarting the negotiation from the beginning, with all the duplicate-invitation and double-bid hazards that implies.

None of these are acceptable in a production procurement system where a missed approval gate on a $500k order creates real liability, and where 30 seconds of recovery latency is the difference between resuming a negotiation and re-running it.

Orchestration before intelligence

The principle that organises our architecture is orchestration before intelligence: a deterministic workflow DAG provides the structural backbone, while LLM-powered agents supply adaptive intelligence at specific decision points within governed guardrails. The DAG owns governance, routing, checkpointing, and recovery. The agents own classification nuance, supplier communication, and multi-constraint scoring. Neither tries to do the other's job.

Concretely: every negotiation flows through a seven-node graph with a four-way conditional branch at Node 3 (Strategy Router), converging at Node 5 (Bid Evaluation) and terminating at Node 7 (Award & Communications). Exactly one Node 4 variant executes per negotiation, selected by the Kraljic quadrant determined upstream. A single governed cycle-back from Node 6 to Node 4x is permitted — at most one retry, enforced in code.

The seven-node Buyer Team workflow. Deterministic nodes (grey, amber) own structure and governance; LLM-backed nodes (blue, green, purple) own adaptive intelligence under steering hooks.

The Step Functions approach

Our orchestrator is a deterministic state machine built on AWS Step Functions. Each state is a typed, versioned task with explicit inputs, outputs, failure modes, and idempotency keys. The state machine encodes the governance rules; LLMs enter as Strands A2A agents — each deployed as an independent AgentCore Runtime — invoked at specific states, executing within the flow, never controlling it. The human approval gate is a Step Functions Task Token callback, not an agent decision: execution pauses until a decision is submitted to the callback API, with negotiation state persisted to DynamoDB throughout.

Architecture note

The Approval Gate is a pure function over policy and state. No LLM is involved in the decision to require human review — the rule cannot be reasoned around, retried out of, or convinced to skip itself for an unusual edge case.

Governance as code, not as prompt

The Approval Gate at Node 6 is the clearest example. It runs two structural checks before any auto-approval, and the second one fires on the actual award outcome — not on the requisition's budget limit. That distinction matters: flagged bids are allowed to exceed budget_limit (REQ-305), so a budget check at requisition time isn't sufficient. The Amount Gate re-evaluates after the bid is chosen.

Every threshold the gate reads is per-tenant configuration. quality_minimum, auto_award_below_threshold_usd, the ESG cutoffs, the approval timeouts — all resolved from the {env}-system-config DynamoDB table (with per-tenant overrides in {env}-tenant-evaluation-config), read at agent instantiation. One tenant runs a $25k auto-approval ceiling; another runs $250k; a third disables auto-approval for a specific category entirely. The Approval Gate doesn't care — it's the same function, with the same structural shape, evaluated against whichever policy the tenant currently has in effect. Changing the rule is a config change, not a prompt rewrite — it takes effect on the next agent instantiation, no redeploy required — and the OTEL span emitted by the gate records the policy values that fired, so a regulator auditing a decision sees the exact thresholds in force at the moment of the award.

ESG compliance has a similar two-tier structure, also code-enforced. Tier 1 is a binary filter at Node 3: suppliers below require_min_esg_score never enter the candidate pool and never receive invitations. Tier 2 is a weighted dimension (10–15% depending on quadrant) in the bid evaluation formula at Node 5. Bids that pass Tier 1 but score sub-optimally are flagged for reviewer attention, not auto-rejected. Both tiers are versioned in code; both pull their cutoff values from per-tenant DynamoDB config; both produce OTEL spans with the policy values that fired.

The same pattern extends across the spec. There are 16 distinct REQUIRES_ATTENTION triggers, each with a typed condition, a machine-readable entry_trigger code, an escalation path, and an SLA — from supplier-availability expiry to negotiation total-timeout to token-ceiling cost-harvesting. Ops dashboards filter on the entry_trigger code; alarms wire to each. None of them are LLM judgements.

Where LLMs actually live

Node 2 Kraljic Classify SimpleLLM · semantic cache 24h
profit_impact_score supply_risk_score kraljic_quadrant classification.schema_fallback
Nodes 4a–4d Strategy execution agents 4a SimpleLLM · 4b–4d DefaultLLM · A2A · steering hooks
supplier_comms bid_rounds convergence_detect TCO_analysis budget_ceiling_hook
Node 5 Bid Evaluation DefaultLLM · structured output
score_cost score_delivery score_quality score_esg score_history award_recommendation
Node 7 Award & Communications SimpleLLM · WinnerDisclosureGuard
award_notification rejection_notifications comms_before_PO_assembly

Each agent sits behind steering hooks — PRE-CALL and POST-CALL interceptors that enforce invariants the agent itself shouldn't be trusted to enforce. BidConfidentiality sanitizes outgoing supplier messages, stripping competitor pricing, budget figures, and supplier names before they leave. BudgetCeiling runs after check_bid_responses and flags collected bids that exceed budget_limit, annotating them for Bid Evaluation review rather than silently rejecting (a flagged bid may legitimately win — REQ-305). WinnerDisclosure scrubs rejection notifications of any reference to the winning supplier. AuctionIntegrity blocks the auction agent from leaking competitor prices in ranking feedback. Eight hooks across the graph — seven PRE-CALL, one POST-CALL — all deterministic, all versioned, all logged. A hook crash suppresses its target tool call rather than letting the tool run with unguarded inputs.

Recovery is the other half of the argument

Governance is the headline reason for the DAG, but recovery is the quieter one. Every node checkpoints to DynamoDB on completion — Negotiation entity, all child entities, the routing decision, the discarded-suppliers list, the bid records, the evaluation scores. Every node is idempotent: re-running it with the same inputs detects previously-completed work via a dedup key. On Runtime restart, the orchestrator reads the last status from DynamoDB and resumes from the next node in the DAG within 30 seconds.

The concurrent-recovery story is also a code property, not a prompt one. We hold a DynamoDB lock on (tenant_id, negotiation_id) with a 600s TTL — sized to cover worst-case single-node execution (A2A timeout 120s × max_retries 3 + backoff ceiling + checkpoint write budget). If the lock holder crashes before releasing, TTL expiry lets a failover instance acquire and resume. If a cycle-back fires after a human reject-with-retry, we mark prior bids SUPERSEDED in a write wrapped in three exponential-backoff retries — and if all three fail, the graph halts the cycle-back rather than risk reinvoking the strategy agent with unconfirmed state, escalating to REQUIRES_ATTENTION instead. Halting is safer than proceeding.

A pure LLM planner doesn't give you any of this. It gives you "resume from a chain-of-thought" — and chains of thought aren't checkpoints.

Testable governance

The separation between deterministic structure and adaptive intelligence gives us something a pure planner cannot: three orthogonal testing oracles, three CI gates.

Deterministic unit tests assert routing correctness — given a Kraljic quadrant, the Strategy Router selects the right Node 4 variant; given a flagged bid above threshold, the Approval Gate requires human review; given an empty candidate pool after the delivery gate, the graph routes to TERMINATED. These are pure functions; the tests are pure functions on them.

Integration tests assert flow adherence — happy-path spot bid reaches AWARDED, STRATEGIC negotiation triggers the interrupt and resumes cleanly, all-bids-fail terminates with the right cancellation reason, recovery from a mid-node crash produces no duplicate bids, 50 concurrent negotiations clear within SLA.

Evals assert agent quality independently — Kraljic classification accuracy against a labelled dataset, bid evaluation rationale against an LLM-as-judge rubric for governance compliance, communication outputs against confidentiality criteria. These run on agent outputs, not on graph routing.

Three test types, three different oracles, all running in CI before merge. The point isn't that LLMs are unreliable — they're plenty reliable for what we ask them to do. The point is that reliability of a structural property (every $500k award routes to human review, no exceptions) and quality of an adaptive output (this Kraljic classification is right) are different claims with different proofs. Conflating them is what makes pure-LLM-planner systems hard to ship into regulated workflows.

✓

Deterministic DAG

Governance rules are code. Every approval gate, every spend threshold, every policy constraint, every recovery semantic is a typed function you can unit-test, version-control, and audit by reading the source.

⚡

LLM agents at the leaves

Adaptive intelligence where it earns its place — classification nuance, supplier communication tone, multi-constraint bid scoring — bounded by steering hooks and evaluated against quality benchmarks independent of flow correctness.