Wintermute Framework, Part 7: Per-Test-Case Sub-Agents

Wintermute Framework, Part 7: Per-Test-Case Sub-Agents

This is the closing post. Across the series we built up:

  • the engagement model and architecture (Part 1),
  • driving Wintermute by hand (Part 2),
  • the AI router, RAG, and the tool registry (Part 3),
  • cartridges, the MCP runtime, and Surgeon (Part 4),
  • a single-test-case agent (Part 5),
  • the orchestrator that reads a ticket and fans out runs (Part 6).

Now we build the body of run_subagent_for_test_case_run — the function the orchestrator calls per TestCaseRun. Each invocation:

  1. fetches the reproduction steps from the parent TestCase,
  2. picks the right cartridges and MCP tools for the run’s bound target
    kind (UART → JTAG; I²C → I²C cartridge + firmware analysis; TPM → tpm20
    cartridge; AWS-IAM → AWS MCP),
  3. runs a Part-5-style loop with a curated tool surface,
  4. verifies the result (re-reads what changed, captures
    reproduction artifacts),
  5. mutates the live TestCaseRun (status, notes, findings).

By the end of this post the IoT-camera engagement runs autonomously from the ticket on the operator’s desk to the DOCX report on the customer’s inbox.

The Sub-Agent Contract

def run_subagent_for_test_case_run(
op: Operation,
run: TestCaseRun,
*,
router: Router,
wall_clock_budget_s: int = 180,
max_iterations: int = 10,
) -> TestCaseRun:
"""Drive a single TestCaseRun to a terminal status.
Picks a tool surface tailored to run.bound, builds a system prompt
with the parent test case's reproduction steps, and runs a tool-
calling loop until either the agent emits a JSON verdict or the
iteration / wall-clock budget is exhausted.
Always returns with `run.status` in a terminal state (passed,
failed, blocked, or not_applicable). Always sets run.executed_by
and run.notes.
"""

A few invariants that matter when this is invoked from the orchestrator:

  • Idempotent on passed/failed. If the run is already terminal,
    return immediately. Lets the operator retry the orchestrator without
    re-executing completed runs.
  • Side-effects in one place. Every mutation to run happens inside
    this function. The orchestrator never reaches in.
  • No silent failures. A timeout or tool exception sets run.status = blocked and run.notes describing the cause. The DOCX report shows the
    block reason directly.

The Tool-Surface Selector

The single most important design decision for sub-agents is which tools each one sees. Hand the agent everything (80+ MCP tools + 16+ cartridge methods) and the prompt explodes, the model wanders, and runs cost more.

Selection is driven by the BoundObjectRef.kind and the bound objects’ types — i.e., what the test case is actually bound to.

from typing import Iterable
from wintermute.ai.types import ToolSpec
from wintermute.ai.tools_runtime import tools as registry, spec_from_tool
from wintermute.cartridges.manager import CartridgeManager
# Mapping: cartridge to load + which of its methods belong on the surface.
# Entries are ordered: the first match against the bound objects wins.
_TOOL_PROFILES = {
# IoT / hardware
"uart": (
["jtag"], # cartridges to ensure loaded
("halt_core", "resume_core", # JTAG primitives
"read_memory", "read_registers",
"open_ssh_session", "run_ssh_session_command", # via MCP runtime
"run_ssh_session_background", "poll_ssh_background_job"),
),
"jtag": (
["jtag", "firmware_analysis"],
("halt_core", "resume_core", "read_memory", "read_registers",
"dump_firmware",
"analyze_entropy", "scan_for_secrets",
"extract_strings", "find_base_address"),
),
"i2c": (
["i2c", "firmware_analysis"],
("detect", "dump_eeprom",
"extract_strings", "scan_for_secrets"),
),
"tpm": (
["tpm20"],
("get_random", "test_pcr_state", "test_da_lockout",
"read_public", "fuzz_command"),
),
"spi-flash": (
["firmware_analysis"],
("analyze_entropy", "scan_for_secrets",
"extract_strings", "find_base_address"),
),
# Cloud / red team
"aws-iam-role": (
[], # no cartridges; tools come from external MCP
("aws_iam_get_role", "aws_iam_simulate_principal_policy",
"aws_sts_assume_role", "aws_iam_list_attached_policies"),
),
"aws-s3-bucket": (
[],
("aws_s3_list_objects", "aws_s3_get_bucket_policy",
"aws_s3_get_object_acl", "aws_s3_get_public_access_block"),
),
"burp-target": (
[],
("burp_active_scan", "burp_get_issues", "burp_get_sitemap"),
),
}
def _peripheral_kind(p: object) -> str:
"""Map a Peripheral / cloud object to a profile key."""
p_type = (getattr(p, "pType", None) or "").lower()
if p_type:
return p_type # "uart" / "i2c" / "spi" / "jtag" / ...
cls = type(p).__name__.lower()
if "tpm" in cls: return "tpm"
if "uart" in cls: return "uart"
if "jtag" in cls: return "jtag"
if "iamrole" in cls: return "aws-iam-role"
if "s3" in cls or "bucket" in cls: return "aws-s3-bucket"
return "spi-flash" # safe analyzer-only fallback
def select_tool_surface(run: TestCaseRun, op: Operation) -> list[ToolSpec]:
"""Resolve run.bound to objects, classify, and return the tool surface."""
profile_keys: list[str] = []
for b in run.bound:
# Find the actual object in the operation
obj = None
for d in op.devices:
for p in d.peripherals:
if p.name == b.object_id or getattr(p, "device_path", "") == b.object_id:
obj = p; break
if obj: break
if obj is None:
for acc in op.cloud_accounts:
for lst_name in ("iamroles", "iamusers", "services"):
for x in getattr(acc, lst_name, []):
if getattr(x, "role_name", "") == b.object_id \
or getattr(x, "username", "") == b.object_id \
or getattr(x, "name", "") == b.object_id:
obj = x; break
if obj: break
if obj is not None:
profile_keys.append(_peripheral_kind(obj))
# Load required cartridges + collect tools
mgr = CartridgeManager()
wanted_tools: list[str] = []
for k in profile_keys:
cartridges, tool_names = _TOOL_PROFILES.get(k, ([], ()))
for c in cartridges:
if c not in mgr.list_loaded():
try:
mgr.load(c)
except Exception:
pass
wanted_tools.extend(tool_names)
seen, surface = set(), []
for name in wanted_tools:
if name in seen or name not in registry._tools: continue
seen.add(name)
surface.append(spec_from_tool(registry._tools[name]))
return surface

This is the lever the framework gives you: we pick the tool surface from the live operation’s actual objects, not from a static config. Add an SPI flash to a device → next sub-agent for that device sees the firmware analysis cartridge. Replace tpm20 with a fork → swap one cartridge name and the dispatch is automatic.

The Sub-Agent Body

import json
import time
from datetime import datetime, timezone
from wintermute.ai.use import tool_calling_chat
from wintermute.ai.types import ChatRequest, Message
from wintermute.core import RunStatus, TestCaseRun, Operation
from wintermute.findings import ReproductionStep, Vulnerability
SUBAGENT_SYSTEM = """You are an autonomous sub-agent for one TestCaseRun on
a sanctioned penetration test. You operate ONLY on the bound target.
Output contract — your final reply must be a single JSON object:
{
"verdict": "passed" | "failed" | "blocked" | "not_applicable",
"title": "<concise vulnerability title or empty>",
"description": "<2-4 sentence finding description>",
"cvss": <int 0-10>,
"evidence": {<key>: <value>, ...}, // tool outputs you want recorded
"tool_trace": [{"tool": "<name>", "args": {...}}, ...]
}
You may only call the tools listed in your tool spec. Do not invent tools.
Do not exceed your iteration budget. If a destructive operation would be
required to confirm the finding, return "blocked" and put the reason in
description.
"""
def _bound_summary(run: TestCaseRun, op: Operation) -> dict:
items = []
for b in run.bound:
items.append({"alias": b.alias, "kind": b.kind, "object_id": b.object_id})
return {"run_id": run.run_id, "test_case": run.test_case_code, "bound": items}
def run_subagent_for_test_case_run(
op, run, *, router, wall_clock_budget_s: int = 180,
max_iterations: int = 10,
):
if run.status.value not in ("not_run", "in_progress"):
return run # idempotent
surface = select_tool_surface(run, op)
if not surface:
run.status = RunStatus.blocked
run.notes = "sub-agent: no tool surface for bound target"
run.executed_by = "wintermute-subagent"
run.finish()
return run
tc = next(t for t in op.iterTestCases() if t.code == run.test_case_code)
steps_text = "\n".join(
f"- {i+1}. {s.title}: {s.description} (tool={s.tool}, action={s.action})"
for i, s in enumerate(tc.steps)
)
user_prompt = (
f"{json.dumps(_bound_summary(run, op), indent=2)}\n\n"
f"Test case: {tc.code}{tc.name}\n"
f"Description: {tc.description}\n\n"
f"Reproduction steps from the test plan (use these as guidance, "
f"adapt to the available tools):\n{steps_text}\n\n"
f"Available tools: {[s.name for s in surface]}\n\n"
"Execute the test now. Reply with the JSON verdict only."
)
messages = [
Message(role="system", content=SUBAGENT_SYSTEM),
Message(role="user", content=user_prompt),
]
run.status = RunStatus.in_progress
run.start()
run.executed_by = f"wintermute-subagent ({datetime.now(timezone.utc).isoformat()})"
deadline = time.monotonic() + wall_clock_budget_s
tool_trace: list[dict] = []
verdict: dict | None = None
for _ in range(max_iterations):
if time.monotonic() > deadline:
run.status = RunStatus.blocked
run.notes = (run.notes + "\n" if run.notes else "") + "wall-clock budget exhausted"
run.finish()
return run
# task_tag="cheap" — sub-agents go to Groq when registered.
resp = tool_calling_chat(
router, messages, surface,
response_format="text", task_tag="cheap",
)
if not resp.tool_calls:
try:
verdict = json.loads(resp.content)
except json.JSONDecodeError:
messages.append(Message(role="user",
content="Your last reply did not parse as JSON. Reply ONLY with the schema."))
continue
break
# Append the assistant message + all tool results
messages.append(Message(role="assistant", content=resp.content or ""))
for call in resp.tool_calls:
try:
result = registry.call(call.name, dict(call.arguments))
except Exception as exc:
result = {"error": str(exc)}
tool_trace.append({"tool": call.name, "args": dict(call.arguments)})
messages.append(Message(
role="tool",
content=json.dumps(result, default=str),
tool_name=call.name,
tool_call_id=call.id,
))
else:
run.status = RunStatus.blocked
run.notes = (run.notes + "\n" if run.notes else "") + "iteration budget exhausted"
run.finish()
return run
# ----- VERIFICATION & WRITE-BACK ---------------------------------------
if verdict is None:
run.status = RunStatus.blocked
run.notes = "sub-agent: no verdict produced"
run.finish()
return run
if verdict["verdict"] == "failed":
repro = []
for entry in (verdict.get("tool_trace") or tool_trace):
repro.append(ReproductionStep(
title=entry["tool"],
description=f"agent invocation: {entry['tool']}({entry.get('args', {})})",
tool=entry["tool"],
action="agent",
confidence=80,
arguments=[json.dumps(entry.get("args", {}), default=str)],
))
run.findings.append(Vulnerability(
title=verdict.get("title") or f"Finding from {tc.code}",
description=verdict.get("description", ""),
cvss=int(verdict.get("cvss", 0)),
verified=True,
reproduction_steps=repro,
))
run.status = RunStatus.failed
elif verdict["verdict"] == "passed":
run.status = RunStatus.passed
elif verdict["verdict"] == "not_applicable":
run.status = RunStatus.not_applicable
else:
run.status = RunStatus.blocked
run.notes = (run.notes + "\n" if run.notes else "") + (
f"verdict={verdict['verdict']} "
f"evidence_keys={list((verdict.get('evidence') or {}).keys())} "
f"tool_calls={len(tool_trace)}"
)
run.finish()
return run

A few things this body does that matter for real engagements:

  • Idempotent re-entry. First if returns immediately for already-
    terminal runs.
  • Wall-clock + iteration budgets. Both are enforced. Either expiring
    marks the run blocked, never passed by accident.
  • Reproduction steps are derived from the actual tool calls the
    agent made. Six months later a different analyst replays
    i2c.dump_eeprom({"address": 0x50, "size": 256}) → extract_strings(...).
    That is a real artifact, not LLM prose.
  • Cheap routing. task_tag="cheap" sends each sub-agent to Groq
    Llama 3.3 70B when registered. The orchestrator above stays on Sonnet.
  • registry.call(...) per tool. That goes through ToolsRuntime-style
    dispatch — local cartridge methods, MCP tools, Surgeon, all looked up by
    name.

End-to-End Trace: The IoT Camera Ticket

Operator drops a ticket:

Title: I2C EEPROM extraction on MCIO Tags: #i2c #eeprom #blackbox #hardware Description: Analyze I2C EEPROM on MCIO port. Suspected hardcoded credentials in flash dump. Target: iot-cam-01 (10.0.0.5). Bus: I2C-2, Address: 0x50.

Operator runs (over MCP from Claude Desktop, or from the REPL once we ship orchestrator run):

$ wintermute
onoSendai > operation load acme-iotcam-2026-Q2
onoSendai [acme-iotcam-2026-Q2] > orchestrator run T000001

Trace:

[orchestrator] read_ticket: title="I2C EEPROM extraction on MCIO"
tags=["i2c","eeprom","blackbox","hardware"]
scope.target_host="iot-cam-01"
scope.bus="I2C-2"
scope.address="0x50"
[orchestrator] plan_node: matched scope.tags ⊂ {hardware, blackbox}
→ attaching TP-HW-BLACKBOX-001
generated 17 TestCaseRuns
[orchestrator] dispatch_node:
IOT-HW-GEN-001:iot-cam-01 → defer (manual constraint review)
IOT-HW-DISC-001:iot-cam-01 → defer (manual board photos)
IOT-HW-UART-001:iot-cam-01:debug-uart → execute
IOT-HW-I2C-001:iot-cam-01:mcio-eeprom-1 → execute
IOT-HW-JTAG-001:iot-cam-01:main-jtag → execute
IOT-HW-FAULT-001:iot-cam-01 → defer (destructive)
IOT-HW-TPM-001:iot-cam-01 → skip (no TPM peripheral on device)
... 9 execute / 6 defer / 2 skip
[orchestrator] execute_runs_node: dispatching 9 sub-agents
[subagent IOT-HW-I2C-001:iot-cam-01:mcio-eeprom-1]
surface = [detect, dump_eeprom, extract_strings, scan_for_secrets]
call detect() -> [80, 81]
call dump_eeprom(0x50, 256) -> {file_path: ".../blob-3a4f..bin", sha256: ...}
call extract_strings(file=...,8) -> top: ["admin:hunter2",
"http://10.0.0.1/api/v1/login",
"/bin/sh -c reboot"]
call scan_for_secrets(file=...) -> {pem_block: [], aws_key: []}
verdict = {"verdict":"failed",
"title":"Hardcoded credentials in MCIO I²C EEPROM",
"description":"Recovered admin string and login URL from
256-byte EEPROM dump. The dump is unencrypted...",
"cvss":8, "evidence": {...}, "tool_trace":[...]}
run.findings += Vulnerability(...)
run.status = failed
[subagent IOT-HW-UART-001:iot-cam-01:debug-uart]
surface = [halt_core, resume_core, read_memory, read_registers,
open_ssh_session, run_ssh_session_command, ...]
call open_ssh_session(host=..., user=..., password=...) -> session=...
call run_ssh_session_command(session, "cat /proc/cmdline")
-> "console=ttyS0,115200 init=/bin/sh ..."
call halt_core() -> True (target halted)
call read_registers() -> {pc: 0x80008034, lr: ..., sp: ...}
call read_memory("0x80008000", 4) -> "0x80008000: e3a01000 e58d1000 ..."
verdict = {"verdict":"failed",
"title":"Interruptible U-Boot via UART; init=/bin/sh in cmdline",
"cvss":9, ...}
run.status = failed
[subagent IOT-HW-JTAG-001:iot-cam-01:main-jtag]
surface = [halt_core, resume_core, read_memory, read_registers,
dump_firmware, analyze_entropy, scan_for_secrets,
extract_strings, find_base_address]
call halt_core() -> True
call dump_firmware("0x08000000", 1048576, "iotcam-flash.bin")
-> blob descriptor (1 MiB)
call analyze_entropy(file=...) -> {overall: 7.85,
high_entropy_blocks: [...]}
call extract_strings(file=...,8) -> top: ["root:$1$...", "wpa_supplicant ...",
"BEGIN OPENSSH PRIVATE KEY"]
call scan_for_secrets(file=...) -> {pem_block: ["0x40123", "0x40e80"],
aws_key: []}
call find_base_address(file=..., arch="arm",
min_addr=0x08000000, max_addr=0x40000000)
-> {top_5: {"0x08000000": 412, ...}}
verdict = {"verdict":"failed",
"title":"Firmware contains private keys + crypted root pwd
hash; flash is openly dumpable via JTAG",
"cvss":9, ...}
run.status = failed
[orchestrator] checkpointed after each run -> .wintermute_data/...
[orchestrator] report_node:
./reports/acme-iotcam-2026-Q2.docx
- 9 runs completed (3 failed with vulnerabilities, 6 passed)
- 8 runs blocked/deferred (operator review)
- 2 runs not_applicable

What the operator gets, all from a Bugzilla ticket:

  • 3 high-CVSS findings, each with reproduction steps the JTAG / I²C
    cartridge can replay,
  • a checkpointed Operation they can re-open, modify, and re-render,
  • a polished DOCX with their template branding,
  • 8 deferred runs with explicit reasons — exactly the items needing the
    operator’s attention.

This is what the framework was built for. The orchestrator and sub-agents above are roughly 350 lines of Python; everything else (the engagement model, persistence, peripherals, cartridges, MCP, RAG, reports) is Wintermute itself.

Where to Take This Next

A few extensions worth noting; each is one or two posts of follow-up.

  • Cross-run reasoning. Have the orchestrator re-summarize after every
    N runs and re-prioritize the queue. (“We just found JTAG dumpable, escalate
    any test that benefits from the dump.”) The data is all on the
    Operation; only the prompt shape changes.
  • Surgeon-driven hypothesis testing. When a finding suggests a
    parsing bug (e.g., a field on an I²C-loaded config triggers an
    __assert_fail), spawn a Surgeon-backed sub-agent that calls
    create_hook_skeleton + build_firmware + start_fuzzing to confirm
    reproducibility from the emulator. This is exactly what the integration
    in Part 4 was set up for.
  • Multi-engagement portfolio orchestrator. Same orchestrator, parallel
    engagements. Wintermute’s Operation.register_backend(... make_default=True)
    is global — for portfolio mode, you instead want a per-engagement
    context (an OperationContext stack would be a small refactor) so two
    runs against different operations don’t trample each other’s defaults.
  • Retest mode. Re-run only the runs whose findings exist by replaying
    their ReproductionStep records. The tool/argument arrays are precisely
    what registry.call consumes; retest is for step in run.findings[*].reproduction_steps: registry.call(step.tool, ...) and a comparison against original output.

Series Wrap

The series has built up the case that Wintermute is not “a Python LLM client with hardware drivers” — it is a framework for engagement state, with deliberately decoupled subsystems (router, RAG, tools, cartridges, MCP, storage, tickets, reports), and the AI sits as a peer of the operator on top of all of it. The orchestrator + sub-agent split we just walked through is genuinely small because the framework underneath does most of the work.

If you build on this, two things will keep you out of trouble:

  1. Always write through the operation graph. Anything you discover
    should land on Operation somewhere — Vulnerability, TestCaseRun.findings,
    Device.peripherals. Findings off the graph are findings the report
    cannot render and the retest cannot replay.
  2. Curate every agent’s tool surface. The four-tool surface in
    the I²C sub-agent above is not a constraint — it is the reason the
    sub-agent terminates correctly. Resist the temptation to hand the LLM
    the entire global registry.

Source: https://github.com/nahualito/wintermute. Documentation: https://nahualito.github.io/wintermute/. Series home: https://exploit.ninja.

fin.

Leave a Reply

Hey!

I’m Bedrock. Discover the ultimate Minetest resource – your go-to guide for expert tutorials, stunning mods, and exclusive stories. Elevate your game with insider knowledge and tips from seasoned Minetest enthusiasts.

Join the club

Stay updated with our latest tips and other news by joining our newsletter.

Discover more from Exploit.Ninja

Subscribe now to keep reading and get access to the full archive.

Continue reading