Wintermute Framework, Part 5: A First Agentic Flow
Across Parts 1–4 we built up the plumbing: the engagement data model, the console, the AI router, the cartridges and MCP. Now we build the first actual agent.
The intentional constraint of this post is one prompt, one tool-calling loop, one test case. We are not yet looping over the whole test plan (Part 6) or dispatching sub-agents (Part 7). The point is to make the single-step pattern bulletproof so the orchestrator above it has a known primitive to compose.
What “Agentic” Means in Wintermute
For our purposes, an agent is the standard loop:
1. Build a ChatRequest with messages + tool specs2. provider.chat(req) → ChatResponse3. If response has tool_calls: for each call: result = ToolsRuntime.run_tool(name, args) append a tool message; goto 24. Otherwise: stop. The response.content is the answer.
That is the loop we want. The framework gives us most of it for free —
Router.choose, tool_calling_chat, and ToolsRuntime.run_tool. What we
have to write is:
- the system prompt that explains the engagement context,
- the tool selection (a curated subset of the global registry),
- the post-loop interpretation that turns the model’s final answer into
mutations on the liveOperation.
The Test Case We Will Automate
Continuing the IoT-camera engagement from Part 2, we automate a single
case: IOT-HW-I2C-001:iot-cam-01:mcio-eeprom-1 — “extract the I²C EEPROM
on the MCIO board and find anything sensitive.”
In Part 4 we wrote I2CCartridge and walked through the calls manually.
Here, the LLM picks the calls itself.
The Code, Top to Bottom
"""Part 5: One-shot agent that executes a single TestCaseRun on theIoT-camera operation. Uses Wintermute's tool_calling_chat againstBedrock Claude with the i2c + firmware_analysis cartridges loaded.Pre-conditions: - operation `acme-iotcam-2026-Q2` already saved (Part 2). - JsonFileBackend registered. - AWS_REGION + Bedrock model configured."""from __future__ import annotationsimport loggingfrom datetime import datetime, timezonefrom wintermute.ai.bootstrap import init_routerfrom wintermute.ai.tools_runtime import tools as registry, spec_from_toolfrom wintermute.ai.types import ChatRequest, Message, ToolSpecfrom wintermute.ai.use import tool_calling_chatfrom wintermute.backends.json_storage import JsonFileBackendfrom wintermute.cartridges.manager import CartridgeManagerfrom wintermute.core import Operation, RunStatusfrom wintermute.findings import ReproductionStep, Vulnerabilitylog = logging.getLogger("wintermute.agent.part5")logging.basicConfig(level=logging.INFO)# ---------------------------------------------------------------------------# 0. Engagement load# ---------------------------------------------------------------------------Operation.register_backend( "json", JsonFileBackend(base_path="./.wintermute_data"), make_default=True)op = Operation("acme-iotcam-2026-Q2")op.load()# Find the run we want to drive.target_run_id = "IOT-HW-I2C-001:iot-cam-01:mcio-eeprom-1"run = next(r for r in op.test_runs if r.run_id == target_run_id)tc = next(t for t in op.iterTestCases() if t.code == run.test_case_code)peripheral = next(p for p in op.getDeviceByHostname("iot-cam-01").peripherals if p.name == "mcio-eeprom-1")# ---------------------------------------------------------------------------# 1. AI bootstrap + cartridge load (capability surface)# ---------------------------------------------------------------------------router = init_router() # Bedrock defaultmgr = CartridgeManager()for name in ("i2c", "firmware_analysis"): if name not in mgr.list_loaded(): mgr.load(name)# A curated tool list — we DO NOT hand the agent the full global registry.# The 80+ MCP tools would balloon the prompt and tempt the model into# unrelated work.allowed_tools = ("detect", "dump_eeprom", "extract_strings", "scan_for_secrets")tool_specs: list[ToolSpec] = [ spec_from_tool(t) for name, t in registry._tools.items() if name in allowed_tools]log.info("Tools available to agent: %s", [t.name for t in tool_specs])# ---------------------------------------------------------------------------# 2. System prompt — engagement-grounded# ---------------------------------------------------------------------------system = f"""You are an autonomous hardware security analyst working asanctioned penetration test for ACME on engagement{op.operation_name!r}.Scope (the only thing you may touch in this turn): - Device : {peripheral.__dict__.get('name', 'unknown')!r} on host iot-cam-01 - Bus : I2C-2 - Address: 0x{0x50:02x} - Test case: {tc.code} — {tc.name}Available tools: detect() -> list of int (responding I2C addresses on bus 2) dump_eeprom(address: int, size: int) -> blob descriptor (file_path/sha256/size) extract_strings(file_path: str, min_length: int) -> top printable strings + counts scan_for_secrets(file_path: str) -> regex+byte-pattern hits (PEM, AWS keys, ...)Process:1. detect() to confirm the EEPROM is at 0x50.2. dump_eeprom(0x50, 256). Pass the returned `file_path` to subsequent tools — never embed bytes in your reply.3. Run extract_strings (min_length=8) and scan_for_secrets on the dump.4. Decide whether the dump contains a security-relevant artifact: credentials, private keys, hardcoded URLs, command tokens.5. Reply with a JSON object of the form: {{ "verdict": "passed" | "failed", "title": "<concise vuln title or empty>", "description": "<2-3 sentences>", "cvss": <int 0-10>, "evidence_offsets": ["0x<hex>", ...], "interesting_strings": ["<str>", ...] }} `passed` means no finding; `failed` means a vulnerability is recorded.Constraints:- Do not call any tool outside the four above.- Do not invent reproduction steps; the framework records them from your tool calls.- Stop after one diagnostic round-trip."""# ---------------------------------------------------------------------------# 3. Run — start the test run, dispatch the loop, parse the verdict# ---------------------------------------------------------------------------run.status = RunStatus.in_progressrun.start()run.executed_by = "wintermute-agent-part5"messages = [ Message(role="system", content=system), Message(role="user", content=f"Execute test case {tc.code} now. Reply with the JSON verdict only."),]# Loop: tool_calling_chat handles one round; we drive the round-trip until# the model returns content with no tool_calls.import jsonwhile True: resp = tool_calling_chat(router, messages, tool_specs, response_format="text") if not resp.tool_calls: verdict = json.loads(resp.content) break # Execute every tool the model requested, append tool messages, loop. messages.append( Message(role="assistant", content=resp.content or "", tool_call_id=None, tool_name=None) ) for call in resp.tool_calls: result = registry.call(call.name, dict(call.arguments)) messages.append( Message(role="tool", content=json.dumps(result, default=str), tool_name=call.name, tool_call_id=call.id) )# ---------------------------------------------------------------------------# 4. Write the verdict back into the live operation# ---------------------------------------------------------------------------if verdict["verdict"] == "failed": vuln = Vulnerability( title=verdict["title"], description=verdict["description"], cvss=int(verdict["cvss"]), threat="unauthorized device access via static artifact", verified=True, reproduction_steps=[ ReproductionStep( title="Probe I2C bus 2", tool="i2c.detect", action="probe", confidence=85, arguments=[]), ReproductionStep( title="Read 256 bytes from 0x50", tool="i2c.dump_eeprom", action="read", confidence=90, arguments=["0x50", "256"]), ReproductionStep( title="Static analysis: strings + secret scan", tool="firmware_analysis", action="analyze", confidence=80, arguments=verdict.get("evidence_offsets", [])), ], ) run.findings.append(vuln) run.status = RunStatus.failedelse: run.status = RunStatus.passedrun.notes = ( f"agent={run.executed_by} ts={datetime.now(timezone.utc).isoformat()} " f"strings_seen={len(verdict.get('interesting_strings', []))} " f"verdict={verdict['verdict']}")run.finish()op.save()log.info("Test run %s -> %s", run.run_id, run.status.value)
Run this against a real EEPROM (or the in-memory transport pattern from
examples/07-Programmatic-Hardware-Cartridges.ipynb) and the agent loop
will:
- call
detect()→ see0x50, - call
dump_eeprom(0x50, 256)→ get a blob descriptor, - call
extract_strings(file_path=..., min_length=8)→ see something like["admin:hunter2", "http://10.0.0.1/api/v1/login"], - call
scan_for_secrets(file_path=...)→ maybe see no secret, maybe see
PEM offsets, - reply with JSON (e.g.,
{"verdict": "failed", "title": "Hardcoded credentials in MCIO EEPROM", "cvss": 8, ...}),
and the post-loop block writes a Vulnerability with ReproductionSteps
into the TestCaseRun, marks the run failed, and persists.
This is the first usable agent. It has scope, structure, real side-effects on the operation, and reproduction steps. It is also the exact pattern the orchestrator (Part 6) and the per-run sub-agents (Part 7) call internally — one agent, one run.
What This Agent Already Does Right
A few decisions in the code above are not obvious but matter:
- Curated tool list. Only the four cartridge methods relevant to this
test case are intool_specs. This is critical for cost (smaller prompts),
determinism (the model can’t “improvise” withrun_ssh_command), and
blast radius (a misfire can’t reach Surgeon’sstart_fuzzing). Part 7’s
sub-agent generalizes this withselect_tools_for_test_case(tc). - Engagement-grounded system prompt. The prompt includes the actual
device, peripheral, address, and test-case code from the live operation —
not a generic “you are a pentester” preamble. The agent’s output is
concrete because its input is. - JSON verdict, not free text. We parse the verdict deterministically.
The framework cares about typed fields (cvss: int,status: RunStatus),
so the agent must produce typed output. This is what makes the result
attachable to aTestCaseRuninstead of pasted into a notes field. reproduction_stepsreflect the actual tool calls, not the model’s
prose. A reproduction step is an artifact a different analyst replays
six months later — itstool,action,argumentsare the cartridge
call shape, exactly the thing Part 7 will replay for retest passes.
Failure Modes and How the Framework Helps
A few things go wrong with single-loop agents in practice. Each has a corresponding Wintermute primitive:
- Tool-call infinite loops. The model keeps calling tools without
emitting a JSON verdict. Mitigation: cap iterations (afor _ in range(max_iters): ...around thewhile Trueis one line); the agent
loop in Part 7 also enforces a wall-clock budget. - Model returns junk JSON. Wrap
json.loadsin try/except, push a
retry message (“Your previous reply was not valid JSON. Reply only with
the schema.”). One retry is usually enough; two is the upper bound. - Tool returns are huge. Already handled —
tool_factory._maybe_offload_payload
routes raw bytes and >1 KiB strings into the workspace and hands the LLM
a compact descriptor. You almost never need to truncate manually. - Provider rate limit. Wintermute ships
wintermute/ai/retry.py
(Backoff); the cloud providers wrap theirchat()in it. Backoff
defaults are conservative; tune inbedrock_provider.py/groq_provider.pyper engagement.
A Useful Variant: Read-Only Recon Agent
Same loop, different tool list, no mutation. This is the agent you run during recon when you don’t want any state writes:
tool_specs = [ spec_from_tool(t) for name, t in registry._tools.items() if name in ("detect", "extract_strings", "scan_for_secrets", "ai_list_test_runs") # console-bound read-only AI tool]system = "You are a recon analyst. Only read. ..."# ... same loop, no `run.findings.append(...)`.
The model still gets the engagement context, still chooses tools, still emits a JSON verdict — but the post-loop simply prints the verdict instead of mutating the run. Useful for “what should we test next?”-style brainstorming where letting the LLM commit to a finding is premature.
What’s Next
Part 6 lifts this pattern up. We will read a Bugzilla/Jira ticket through Ticket.read(...), derive scope, attach a TestPlan, generate TestCaseRuns, and enqueue one Part-5-style agent per run. The orchestrator is the loop above this loop; the per-run agent is exactly what we wrote here.
Part 7 takes the next step: the orchestrator dispatches sub-agents per TestCaseRun, each with its own curated tool surface tailored to the test case (UART → JTAG cartridge + SSH; I²C → I2CCartridge + firmware_analysis; cloud IAM → AWS-only tools), and the orchestrator reasons over the merged results.






Leave a Reply