Engineering · Architecture Decision

How we decided to build Buyer Team

Everyone asks why we chose Strands. It's the wrong first question. The SDK was the third decision in a tree — after "build from scratch or adopt?" and "which language?" — and by the time we reached it, the architecture had deliberately made it the smallest, most reversible choice of the three.

Buyer Team runs seven specialized procurement-negotiation agents in production. They classify spend, run auctions, evaluate bids, draft counter-offers, and award purchase orders — under guardrails, with a full audit trail, across tenants whose data must never touch. When engineers see the stack, the question is almost always the same: why the Strands Agents SDK and not LangGraph, CrewAI, or your own loop?

It's a fair question, but it skips the two decisions that came before it and that did most of the work. The real question we sat down to answer in the design phase was broader: what is the most effective and efficient way to implement this system at all? Hand-rolled or SDK-based? Python, TypeScript, Go, or the JVM? This post is the full decision record — and the pattern that emerges is the actual lesson: each architectural choice we'd already made shrank the next decision below it.

Decision 1 — the agent layer Build from scratch · Adopt an SDK → adopt

↓ narrows to

Decision 2 — the language Python · TypeScript · Go / Rust · JVM → Python

↓ narrows to

Decision 3 — the SDK CrewAI · AutoGen · LangGraph · roll-your-own · Strands → Strands

The question behind the question

Before any of the three decisions, one principle was already fixed: orchestration before intelligence. A deterministic AWS Step Functions DAG owns the negotiation lifecycle — sequence, retries, audit, human-in-the-loop interrupts. LLM agents supply judgment at each decision point, then hand control back. Durable state lives in portable services: DynamoDB and Step Functions, not inside any agent process.

And the runtime question was already settled too — we covered that decision in why we chose Bedrock AgentCore. AgentCore owns tenant isolation, burst concurrency, outbound OAuth, and evaluation coverage as managed infrastructure.

Why this matters for the implementation choice

Those two prior decisions removed the hardest problems from the agent code entirely. What was left to implement was the inner agent loop: the reasoning cycle, tool dispatch, and the behavioral guardrails inside a single negotiation turn. That is a thin layer by design — which means the build-vs-buy, language, and SDK decisions were all scoped to something deliberately small and replaceable. The architecture shrank the decision before we made it.

Fig. 1 — Two prior decisions removed the hard questions; the implementation choice was scoped to the thin layer that remained.

Decision 1 — build from scratch, or adopt an SDK?

The case for hand-rolling is real. An agent loop is conceptually simple: call the model, parse tool requests, execute, append results, repeat. A competent engineer has a v1 in days. No dependency churn, no abstractions fighting you, total control. For a team whose product is the agent infrastructure, that's often the right call.

But "the loop" is the cheap part. What we'd actually own, forever, is everything around it:

⊖

What building from scratch really costs

Streaming and retry semantics across model round-trips. A2A server plumbing so seven agents interoperate. OTEL span emission on every tool call. Prompt-cache management against a <$0.10-per-negotiation cost target. And the security-critical piece: a guardrail chokepoint that every tool invocation must pass through. That last one is exactly the kind of code that's easy to get subtly wrong — and a chokepoint that fails open is worse than no chokepoint, because you trust it.

⊕

What adopting really buys

Not convenience — reviewed, exercised infrastructure for the parts where a subtle bug is a security incident. An SDK-owned event loop means the guardrail hook sits below the model's reach by construction, hardened by every other production user. Our engineering goes into negotiation logic and guardrail policy — the things only we can write — instead of re-deriving dispatch plumbing.

Fig. 2 — Decision 1: what building from scratch actually puts on the team's books.

We adopted — with one condition that runs through this whole series: the SDK must stay replaceable. Durable state lives outside the agents; orchestration belongs to Step Functions; the SDK is confined to the inner loop. If we ever rip it out, we rewrite a thin layer, not the system. That condition is what made Decision 1 safe to take quickly.

Decision 2 — which language?

This one generates the most heat and deserved the least. But first, a constraint worth ruling out explicitly: the runtime didn't force the choice. AgentCore's first-party tooling is Python-first, but the Runtime executes containers — it's language-agnostic by construction. And this wasn't theoretical for us: our architect had already built and run JVM agents (Spring AI, deployed as ARM64 containers with OTEL/ADOT tracing) fully adapted to AgentCore Runtime. We knew from experience that any language on this list could ship. The field was genuinely open, which means what follows is a real decision, not a rationalized constraint. Three observations decided it.

First, the workload shape. A negotiation agent is I/O-bound: it waits on Bedrock, on supplier MCP servers, on DynamoDB. Raw compute throughput — the headline argument for Go or Rust — is solving a problem we don't have. And the concurrency story that does matter (one auction fanning out to 200 concurrent bids) is handled by AgentCore's microVM-per-session model, not by the language runtime. The platform absorbed the argument for a systems language before it could be made.

Second, ecosystem gravity. The agentic ecosystem is overwhelmingly Python-first: SDKs (or frameworks, as most brand themselves), eval tooling, guardrail libraries, reference implementations, and the hiring pool that knows them. Going against that gravity is a permanent tax paid on every integration.

Third — and decisively — what each language unlocked in the SDK shortlist. At decision time, the capability matrix was lopsided: the SDKs we were evaluating carried their full production surface — multi-agent graph patterns, SDK-owned hook systems, the OpenTelemetry stack — in Python. TypeScript coverage, where it existed at all, was preview-grade and missing exactly the capabilities Buyer Team depends on: graph-based human-in-the-loop interrupts and OTEL end-to-end. The language decision wasn't a taste decision. It was a capability matrix.

Why language before SDK, if they're coupled?

Honestly: decisions 2 and 3 were evaluated jointly. A language is only as good as the SDKs it gives you, and an SDK shortlist presupposes a language — you can't score one axis without peeking at the other. The tree is ordered not by evaluation chronology but by size and irreversibility of commitment. The language touches hiring, tooling, and every line of code we write; the SDK is confined to the inner loop and contractually replaceable (Decision 1's condition). Bigger, harder-to-reverse decisions sit higher in the tree — so when the joint evaluation resolved, we recorded the language verdict first.

Option	Agentic ecosystem	SDK capability (at decision time)	Workload fit	Verdict
Python	Deepest by far	Production SDK, full surface	I/O-bound: fine	chosen
TypeScript	Growing fast	Preview — no Graph, no OTEL stack	I/O-bound: fine	capability gap
Go / Rust	Thin	No first-party agent SDK	Solves a problem we don't have	wrong axis
JVM / Spring AI	Maturing	Proven on AgentCore (containers); separate SDK lineage	Credible, heavier	credible, declined

The honest caveat: the TypeScript SDK has since reached 1.0, and the gap is closing. If we were deciding today, the language column would be a closer contest — though ecosystem gravity and our existing eval tooling would likely still tip it. Python's real costs — packaging friction, optional typing — we paid down deliberately with modern tooling: uv for environments, Pydantic models on every tool boundary, and strict type-checking in CI. The weaknesses are manageable; they're just not free.

Decision 3 — which SDK?

Only now does the familiar question arrive — and notice how much smaller it has become. We're not choosing an orchestrator (Step Functions owns that), not choosing a runtime (AgentCore owns that), not choosing where state lives (DynamoDB owns that). We're choosing a library for the inner loop, against one non-negotiable: guardrails the LLM cannot bypass. Buyer Team moves real money under real governance. Confidentiality and policy controls must be a property of the architecture, not of the prompt.

CrewAI

Excellent for role-based collaborative crews — but it wants to be the orchestrator. Its strengths duplicate the layer Step Functions already owns, and its guardrail story is prompt- and convention-level, not a structural chokepoint.

AutoGen

Built for conversational, emergent multi-agent reasoning. Powerful where emergent behavior is the goal; in a governed negotiation, emergence is precisely what the deterministic DAG exists to prevent.

LangGraph

The serious contender. But its core strength — graph-based orchestration with checkpointing — again duplicates Step Functions, and its guardrail guarantee is topological: a guard holds only if every path through the graph routes through the guard node. One stray edge added in a refactor and a guardrail silently disappears. That's a guarantee maintained by discipline, not by construction.

Roll-your-own loop

Already resolved by Decision 1 — but it re-enters here as the baseline every SDK must beat: if an SDK doesn't give us something we couldn't responsibly build, adopting it is pure dependency risk.

Strands

What won it is one structural property: the hook system is an SDK-owned chokepoint. Every tool invocation — every one — passes through BeforeToolCallEvent, where our steering hooks inspect, modify, or cancel_tool before any side effect occurs. The model cannot route around it, because the route doesn't exist; there is one door, and the guard is bolted to it. It fails closed. On top of that: A2A deployment as a first-class primitive, native Bedrock AgentCore alignment (OTEL spans, Evaluations as the quality gate, per-tenant model selection), and an agent loop that preserves prompt-cache prefix purity — which is what keeps us under the $0.10-per-negotiation cost KPI.

Fig. 3 — Decision 3's non-negotiable: a guardrail guarantee maintained by review vs. one that holds by construction.

Option	Guardrail chokepoint	Orchestration posture	AgentCore / A2A fit	Verdict
CrewAI	Prompt-level	Wants to own it	Partial	layer overlap
AutoGen	Convention	Emergent by design	Partial	wrong shape
LangGraph	Topological — edge-dependent	Duplicates Step Functions	Portable, generic	discipline-bound
Roll-your-own	Ours to get right, forever	Neutral	Build everything	resolved in D1
Strands	SDK-owned, fails closed	Library, not orchestrator	Native	chosen

What "effective and efficient" actually meant

◎

Effective — the guarantees

Guardrails that survive a misbehaving model, by construction. A complete OTEL audit trail on every negotiation. Hard tenant isolation below the agent layer. And replaceability: every layer of the tree — SDK, even language — can be swapped without touching durable state or orchestration.

⚡

Efficient — the economics

A small team shipping seven production agents in weeks, not quarters, because we built only the layer that differentiates us. Engineering hours spent on negotiation logic and guardrail policy. AI cost held under $0.10 per negotiation through prompt-cache purity the SDK preserves rather than fights.

The honest verdict

Not "Strands is best," and not even "Python is best." Rather: for a small team building governed, multi-tenant negotiation agents, the effective and efficient implementation was the one that made every decision as small as possible before making it. Orchestration before intelligence took the workflow off the table. AgentCore took the runtime off the table. What remained — adopt over build, Python over the field, Strands over its peers — was a sequence of progressively smaller bets, each reversible in inverse order of its importance.

If your system needs the SDK to be the orchestrator, or you're building exploratory multi-agent reasoning where emergent behavior is the goal, the tree resolves differently — CrewAI, AutoGen, or LangGraph may well win, and TypeScript's gap has narrowed since we chose. SDKs are decisions you should be willing to revisit as the system and the ecosystem evolve. But decide them in order, and let the architecture shrink each one first. For this boundary, this constraint set, this team: adopt, Python, Strands.