Wintermute Framework, Part 7: Per-Test-Case Sub-Agents
This is the closing post. Across the series we built up:
- the engagement model and architecture (Part 1),
- driving Wintermute by hand (Part 2),
- the AI router, RAG, and the tool registry (Part 3),
- cartridges, the MCP runtime, and Surgeon (Part 4),
- a single-test-case agent (Part 5),
- the orchestrator that reads a ticket and fans out runs (Part 6).
Now we build the body of run_subagent_for_test_case_run — the function the
orchestrator calls per TestCaseRun. Each invocation:
- fetches the reproduction steps from the parent
TestCase, - picks the right cartridges and MCP tools for the run’s bound target
kind (UART → JTAG; I²C → I²C cartridge + firmware analysis; TPM → tpm20
cartridge; AWS-IAM → AWS MCP), - runs a Part-5-style loop with a curated tool surface,
- verifies the result (re-reads what changed, captures
reproduction artifacts), - mutates the live
TestCaseRun(status, notes, findings).
By the end of this post the IoT-camera engagement runs autonomously from the ticket on the operator’s desk to the DOCX report on the customer’s inbox.
The Sub-Agent Contract
def run_subagent_for_test_case_run( op: Operation, run: TestCaseRun, *, router: Router, wall_clock_budget_s: int = 180, max_iterations: int = 10,) -> TestCaseRun: """Drive a single TestCaseRun to a terminal status. Picks a tool surface tailored to run.bound, builds a system prompt with the parent test case's reproduction steps, and runs a tool- calling loop until either the agent emits a JSON verdict or the iteration / wall-clock budget is exhausted. Always returns with `run.status` in a terminal state (passed, failed, blocked, or not_applicable). Always sets run.executed_by and run.notes. """
A few invariants that matter when this is invoked from the orchestrator:
- Idempotent on
passed/failed. If the run is already terminal,
return immediately. Lets the operator retry the orchestrator without
re-executing completed runs. - Side-effects in one place. Every mutation to
runhappens inside
this function. The orchestrator never reaches in. - No silent failures. A timeout or tool exception sets
run.status = blockedandrun.notesdescribing the cause. The DOCX report shows the
block reason directly.
The Tool-Surface Selector
The single most important design decision for sub-agents is which tools each one sees. Hand the agent everything (80+ MCP tools + 16+ cartridge methods) and the prompt explodes, the model wanders, and runs cost more.
Selection is driven by the BoundObjectRef.kind and the bound objects’
types — i.e., what the test case is actually bound to.
from typing import Iterablefrom wintermute.ai.types import ToolSpecfrom wintermute.ai.tools_runtime import tools as registry, spec_from_toolfrom wintermute.cartridges.manager import CartridgeManager# Mapping: cartridge to load + which of its methods belong on the surface.# Entries are ordered: the first match against the bound objects wins._TOOL_PROFILES = { # IoT / hardware "uart": ( ["jtag"], # cartridges to ensure loaded ("halt_core", "resume_core", # JTAG primitives "read_memory", "read_registers", "open_ssh_session", "run_ssh_session_command", # via MCP runtime "run_ssh_session_background", "poll_ssh_background_job"), ), "jtag": ( ["jtag", "firmware_analysis"], ("halt_core", "resume_core", "read_memory", "read_registers", "dump_firmware", "analyze_entropy", "scan_for_secrets", "extract_strings", "find_base_address"), ), "i2c": ( ["i2c", "firmware_analysis"], ("detect", "dump_eeprom", "extract_strings", "scan_for_secrets"), ), "tpm": ( ["tpm20"], ("get_random", "test_pcr_state", "test_da_lockout", "read_public", "fuzz_command"), ), "spi-flash": ( ["firmware_analysis"], ("analyze_entropy", "scan_for_secrets", "extract_strings", "find_base_address"), ), # Cloud / red team "aws-iam-role": ( [], # no cartridges; tools come from external MCP ("aws_iam_get_role", "aws_iam_simulate_principal_policy", "aws_sts_assume_role", "aws_iam_list_attached_policies"), ), "aws-s3-bucket": ( [], ("aws_s3_list_objects", "aws_s3_get_bucket_policy", "aws_s3_get_object_acl", "aws_s3_get_public_access_block"), ), "burp-target": ( [], ("burp_active_scan", "burp_get_issues", "burp_get_sitemap"), ),}def _peripheral_kind(p: object) -> str: """Map a Peripheral / cloud object to a profile key.""" p_type = (getattr(p, "pType", None) or "").lower() if p_type: return p_type # "uart" / "i2c" / "spi" / "jtag" / ... cls = type(p).__name__.lower() if "tpm" in cls: return "tpm" if "uart" in cls: return "uart" if "jtag" in cls: return "jtag" if "iamrole" in cls: return "aws-iam-role" if "s3" in cls or "bucket" in cls: return "aws-s3-bucket" return "spi-flash" # safe analyzer-only fallbackdef select_tool_surface(run: TestCaseRun, op: Operation) -> list[ToolSpec]: """Resolve run.bound to objects, classify, and return the tool surface.""" profile_keys: list[str] = [] for b in run.bound: # Find the actual object in the operation obj = None for d in op.devices: for p in d.peripherals: if p.name == b.object_id or getattr(p, "device_path", "") == b.object_id: obj = p; break if obj: break if obj is None: for acc in op.cloud_accounts: for lst_name in ("iamroles", "iamusers", "services"): for x in getattr(acc, lst_name, []): if getattr(x, "role_name", "") == b.object_id \ or getattr(x, "username", "") == b.object_id \ or getattr(x, "name", "") == b.object_id: obj = x; break if obj: break if obj is not None: profile_keys.append(_peripheral_kind(obj)) # Load required cartridges + collect tools mgr = CartridgeManager() wanted_tools: list[str] = [] for k in profile_keys: cartridges, tool_names = _TOOL_PROFILES.get(k, ([], ())) for c in cartridges: if c not in mgr.list_loaded(): try: mgr.load(c) except Exception: pass wanted_tools.extend(tool_names) seen, surface = set(), [] for name in wanted_tools: if name in seen or name not in registry._tools: continue seen.add(name) surface.append(spec_from_tool(registry._tools[name])) return surface
This is the lever the framework gives you: we pick the tool surface
from the live operation’s actual objects, not from a static config. Add an
SPI flash to a device → next sub-agent for that device sees the firmware
analysis cartridge. Replace tpm20 with a fork → swap one cartridge name
and the dispatch is automatic.
The Sub-Agent Body
import jsonimport timefrom datetime import datetime, timezonefrom wintermute.ai.use import tool_calling_chatfrom wintermute.ai.types import ChatRequest, Messagefrom wintermute.core import RunStatus, TestCaseRun, Operationfrom wintermute.findings import ReproductionStep, VulnerabilitySUBAGENT_SYSTEM = """You are an autonomous sub-agent for one TestCaseRun ona sanctioned penetration test. You operate ONLY on the bound target.Output contract — your final reply must be a single JSON object: { "verdict": "passed" | "failed" | "blocked" | "not_applicable", "title": "<concise vulnerability title or empty>", "description": "<2-4 sentence finding description>", "cvss": <int 0-10>, "evidence": {<key>: <value>, ...}, // tool outputs you want recorded "tool_trace": [{"tool": "<name>", "args": {...}}, ...] }You may only call the tools listed in your tool spec. Do not invent tools.Do not exceed your iteration budget. If a destructive operation would berequired to confirm the finding, return "blocked" and put the reason indescription."""def _bound_summary(run: TestCaseRun, op: Operation) -> dict: items = [] for b in run.bound: items.append({"alias": b.alias, "kind": b.kind, "object_id": b.object_id}) return {"run_id": run.run_id, "test_case": run.test_case_code, "bound": items}def run_subagent_for_test_case_run( op, run, *, router, wall_clock_budget_s: int = 180, max_iterations: int = 10,): if run.status.value not in ("not_run", "in_progress"): return run # idempotent surface = select_tool_surface(run, op) if not surface: run.status = RunStatus.blocked run.notes = "sub-agent: no tool surface for bound target" run.executed_by = "wintermute-subagent" run.finish() return run tc = next(t for t in op.iterTestCases() if t.code == run.test_case_code) steps_text = "\n".join( f"- {i+1}. {s.title}: {s.description} (tool={s.tool}, action={s.action})" for i, s in enumerate(tc.steps) ) user_prompt = ( f"{json.dumps(_bound_summary(run, op), indent=2)}\n\n" f"Test case: {tc.code} — {tc.name}\n" f"Description: {tc.description}\n\n" f"Reproduction steps from the test plan (use these as guidance, " f"adapt to the available tools):\n{steps_text}\n\n" f"Available tools: {[s.name for s in surface]}\n\n" "Execute the test now. Reply with the JSON verdict only." ) messages = [ Message(role="system", content=SUBAGENT_SYSTEM), Message(role="user", content=user_prompt), ] run.status = RunStatus.in_progress run.start() run.executed_by = f"wintermute-subagent ({datetime.now(timezone.utc).isoformat()})" deadline = time.monotonic() + wall_clock_budget_s tool_trace: list[dict] = [] verdict: dict | None = None for _ in range(max_iterations): if time.monotonic() > deadline: run.status = RunStatus.blocked run.notes = (run.notes + "\n" if run.notes else "") + "wall-clock budget exhausted" run.finish() return run # task_tag="cheap" — sub-agents go to Groq when registered. resp = tool_calling_chat( router, messages, surface, response_format="text", task_tag="cheap", ) if not resp.tool_calls: try: verdict = json.loads(resp.content) except json.JSONDecodeError: messages.append(Message(role="user", content="Your last reply did not parse as JSON. Reply ONLY with the schema.")) continue break # Append the assistant message + all tool results messages.append(Message(role="assistant", content=resp.content or "")) for call in resp.tool_calls: try: result = registry.call(call.name, dict(call.arguments)) except Exception as exc: result = {"error": str(exc)} tool_trace.append({"tool": call.name, "args": dict(call.arguments)}) messages.append(Message( role="tool", content=json.dumps(result, default=str), tool_name=call.name, tool_call_id=call.id, )) else: run.status = RunStatus.blocked run.notes = (run.notes + "\n" if run.notes else "") + "iteration budget exhausted" run.finish() return run # ----- VERIFICATION & WRITE-BACK --------------------------------------- if verdict is None: run.status = RunStatus.blocked run.notes = "sub-agent: no verdict produced" run.finish() return run if verdict["verdict"] == "failed": repro = [] for entry in (verdict.get("tool_trace") or tool_trace): repro.append(ReproductionStep( title=entry["tool"], description=f"agent invocation: {entry['tool']}({entry.get('args', {})})", tool=entry["tool"], action="agent", confidence=80, arguments=[json.dumps(entry.get("args", {}), default=str)], )) run.findings.append(Vulnerability( title=verdict.get("title") or f"Finding from {tc.code}", description=verdict.get("description", ""), cvss=int(verdict.get("cvss", 0)), verified=True, reproduction_steps=repro, )) run.status = RunStatus.failed elif verdict["verdict"] == "passed": run.status = RunStatus.passed elif verdict["verdict"] == "not_applicable": run.status = RunStatus.not_applicable else: run.status = RunStatus.blocked run.notes = (run.notes + "\n" if run.notes else "") + ( f"verdict={verdict['verdict']} " f"evidence_keys={list((verdict.get('evidence') or {}).keys())} " f"tool_calls={len(tool_trace)}" ) run.finish() return run
A few things this body does that matter for real engagements:
- Idempotent re-entry. First
ifreturns immediately for already-
terminal runs. - Wall-clock + iteration budgets. Both are enforced. Either expiring
marks the runblocked, neverpassedby accident. - Reproduction steps are derived from the actual tool calls the
agent made. Six months later a different analyst replaysi2c.dump_eeprom({"address": 0x50, "size": 256}) → extract_strings(...).
That is a real artifact, not LLM prose. - Cheap routing.
task_tag="cheap"sends each sub-agent to Groq
Llama 3.3 70B when registered. The orchestrator above stays on Sonnet. registry.call(...)per tool. That goes throughToolsRuntime-style
dispatch — local cartridge methods, MCP tools, Surgeon, all looked up by
name.
End-to-End Trace: The IoT Camera Ticket
Operator drops a ticket:
Title: I2C EEPROM extraction on MCIO Tags:
#i2c #eeprom #blackbox #hardwareDescription: Analyze I2C EEPROM on MCIO port. Suspected hardcoded credentials in flash dump. Target: iot-cam-01 (10.0.0.5). Bus: I2C-2, Address: 0x50.
Operator runs (over MCP from Claude Desktop, or from the REPL once we ship
orchestrator run):
$ wintermuteonoSendai > operation load acme-iotcam-2026-Q2onoSendai [acme-iotcam-2026-Q2] > orchestrator run T000001
Trace:
[orchestrator] read_ticket: title="I2C EEPROM extraction on MCIO" tags=["i2c","eeprom","blackbox","hardware"] scope.target_host="iot-cam-01" scope.bus="I2C-2" scope.address="0x50"[orchestrator] plan_node: matched scope.tags ⊂ {hardware, blackbox} → attaching TP-HW-BLACKBOX-001 generated 17 TestCaseRuns[orchestrator] dispatch_node: IOT-HW-GEN-001:iot-cam-01 → defer (manual constraint review) IOT-HW-DISC-001:iot-cam-01 → defer (manual board photos) IOT-HW-UART-001:iot-cam-01:debug-uart → execute IOT-HW-I2C-001:iot-cam-01:mcio-eeprom-1 → execute IOT-HW-JTAG-001:iot-cam-01:main-jtag → execute IOT-HW-FAULT-001:iot-cam-01 → defer (destructive) IOT-HW-TPM-001:iot-cam-01 → skip (no TPM peripheral on device) ... 9 execute / 6 defer / 2 skip[orchestrator] execute_runs_node: dispatching 9 sub-agents[subagent IOT-HW-I2C-001:iot-cam-01:mcio-eeprom-1] surface = [detect, dump_eeprom, extract_strings, scan_for_secrets] call detect() -> [80, 81] call dump_eeprom(0x50, 256) -> {file_path: ".../blob-3a4f..bin", sha256: ...} call extract_strings(file=...,8) -> top: ["admin:hunter2", "http://10.0.0.1/api/v1/login", "/bin/sh -c reboot"] call scan_for_secrets(file=...) -> {pem_block: [], aws_key: []} verdict = {"verdict":"failed", "title":"Hardcoded credentials in MCIO I²C EEPROM", "description":"Recovered admin string and login URL from 256-byte EEPROM dump. The dump is unencrypted...", "cvss":8, "evidence": {...}, "tool_trace":[...]} run.findings += Vulnerability(...) run.status = failed[subagent IOT-HW-UART-001:iot-cam-01:debug-uart] surface = [halt_core, resume_core, read_memory, read_registers, open_ssh_session, run_ssh_session_command, ...] call open_ssh_session(host=..., user=..., password=...) -> session=... call run_ssh_session_command(session, "cat /proc/cmdline") -> "console=ttyS0,115200 init=/bin/sh ..." call halt_core() -> True (target halted) call read_registers() -> {pc: 0x80008034, lr: ..., sp: ...} call read_memory("0x80008000", 4) -> "0x80008000: e3a01000 e58d1000 ..." verdict = {"verdict":"failed", "title":"Interruptible U-Boot via UART; init=/bin/sh in cmdline", "cvss":9, ...} run.status = failed[subagent IOT-HW-JTAG-001:iot-cam-01:main-jtag] surface = [halt_core, resume_core, read_memory, read_registers, dump_firmware, analyze_entropy, scan_for_secrets, extract_strings, find_base_address] call halt_core() -> True call dump_firmware("0x08000000", 1048576, "iotcam-flash.bin") -> blob descriptor (1 MiB) call analyze_entropy(file=...) -> {overall: 7.85, high_entropy_blocks: [...]} call extract_strings(file=...,8) -> top: ["root:$1$...", "wpa_supplicant ...", "BEGIN OPENSSH PRIVATE KEY"] call scan_for_secrets(file=...) -> {pem_block: ["0x40123", "0x40e80"], aws_key: []} call find_base_address(file=..., arch="arm", min_addr=0x08000000, max_addr=0x40000000) -> {top_5: {"0x08000000": 412, ...}} verdict = {"verdict":"failed", "title":"Firmware contains private keys + crypted root pwd hash; flash is openly dumpable via JTAG", "cvss":9, ...} run.status = failed[orchestrator] checkpointed after each run -> .wintermute_data/...[orchestrator] report_node: ./reports/acme-iotcam-2026-Q2.docx - 9 runs completed (3 failed with vulnerabilities, 6 passed) - 8 runs blocked/deferred (operator review) - 2 runs not_applicable
What the operator gets, all from a Bugzilla ticket:
- 3 high-CVSS findings, each with reproduction steps the JTAG / I²C
cartridge can replay, - a checkpointed
Operationthey can re-open, modify, and re-render, - a polished DOCX with their template branding,
- 8 deferred runs with explicit reasons — exactly the items needing the
operator’s attention.
This is what the framework was built for. The orchestrator and sub-agents above are roughly 350 lines of Python; everything else (the engagement model, persistence, peripherals, cartridges, MCP, RAG, reports) is Wintermute itself.
Where to Take This Next
A few extensions worth noting; each is one or two posts of follow-up.
- Cross-run reasoning. Have the orchestrator re-summarize after every
N runs and re-prioritize the queue. (“We just found JTAG dumpable, escalate
any test that benefits from the dump.”) The data is all on theOperation; only the prompt shape changes. - Surgeon-driven hypothesis testing. When a finding suggests a
parsing bug (e.g., a field on an I²C-loaded config triggers an__assert_fail), spawn a Surgeon-backed sub-agent that callscreate_hook_skeleton+build_firmware+start_fuzzingto confirm
reproducibility from the emulator. This is exactly what the integration
in Part 4 was set up for. - Multi-engagement portfolio orchestrator. Same orchestrator, parallel
engagements. Wintermute’sOperation.register_backend(... make_default=True)
is global — for portfolio mode, you instead want a per-engagement
context (anOperationContextstack would be a small refactor) so two
runs against different operations don’t trample each other’s defaults. - Retest mode. Re-run only the runs whose
findingsexist by replaying
theirReproductionSteprecords. The tool/argument arrays are precisely
whatregistry.callconsumes; retest isfor step in run.findings[*].reproduction_steps: registry.call(step.tool, ...)and a comparison against original output.
Series Wrap
The series has built up the case that Wintermute is not “a Python LLM client with hardware drivers” — it is a framework for engagement state, with deliberately decoupled subsystems (router, RAG, tools, cartridges, MCP, storage, tickets, reports), and the AI sits as a peer of the operator on top of all of it. The orchestrator + sub-agent split we just walked through is genuinely small because the framework underneath does most of the work.
If you build on this, two things will keep you out of trouble:
- Always write through the operation graph. Anything you discover
should land onOperationsomewhere —Vulnerability,TestCaseRun.findings,Device.peripherals. Findings off the graph are findings the report
cannot render and the retest cannot replay. - Curate every agent’s tool surface. The four-tool surface in
the I²C sub-agent above is not a constraint — it is the reason the
sub-agent terminates correctly. Resist the temptation to hand the LLM
the entire global registry.
Source: https://github.com/nahualito/wintermute. Documentation: https://nahualito.github.io/wintermute/. Series home: https://exploit.ninja.
— fin.





Leave a Reply