Wintermute Framework, Part 7: Per-Test-Case Sub-Agents

This is the closing post. Across the series we built up:

the engagement model and architecture (Part 1),
driving Wintermute by hand (Part 2),
the AI router, RAG, and the tool registry (Part 3),
cartridges, the MCP runtime, and Surgeon (Part 4),
a single-test-case agent (Part 5),
the orchestrator that reads a ticket and fans out runs (Part 6).

Now we build the body of run_subagent_for_test_case_run — the function the orchestrator calls per TestCaseRun. Each invocation:

fetches the reproduction steps from the parent TestCase,
picks the right cartridges and MCP tools for the run’s bound target
kind (UART → JTAG; I²C → I²C cartridge + firmware analysis; TPM → tpm20
cartridge; AWS-IAM → AWS MCP),
runs a Part-5-style loop with a curated tool surface,
verifies the result (re-reads what changed, captures
reproduction artifacts),
mutates the live TestCaseRun (status, notes, findings).

By the end of this post the IoT-camera engagement runs autonomously from the ticket on the operator’s desk to the DOCX report on the customer’s inbox.

The Sub-Agent Contract

			
def run_subagent_for_test_case_run(
    op: Operation,
    run: TestCaseRun,
    *,
    router: Router,
    wall_clock_budget_s: int = 180,
    max_iterations: int = 10,
) -> TestCaseRun:
    """Drive a single TestCaseRun to a terminal status.
    Picks a tool surface tailored to run.bound, builds a system prompt
    with the parent test case's reproduction steps, and runs a tool-
    calling loop until either the agent emits a JSON verdict or the
    iteration / wall-clock budget is exhausted.
    Always returns with `run.status` in a terminal state (passed,
    failed, blocked, or not_applicable). Always sets run.executed_by
    and run.notes.
    """

		

A few invariants that matter when this is invoked from the orchestrator:

Idempotent on passed/failed. If the run is already terminal,
return immediately. Lets the operator retry the orchestrator without
re-executing completed runs.
Side-effects in one place. Every mutation to run happens inside
this function. The orchestrator never reaches in.
No silent failures. A timeout or tool exception sets run.status = blocked and run.notes describing the cause. The DOCX report shows the
block reason directly.

The Tool-Surface Selector

The single most important design decision for sub-agents is which tools each one sees. Hand the agent everything (80+ MCP tools + 16+ cartridge methods) and the prompt explodes, the model wanders, and runs cost more.

Selection is driven by the BoundObjectRef.kind and the bound objects’ types — i.e., what the test case is actually bound to.

			
from typing import Iterable
from wintermute.ai.types import ToolSpec
from wintermute.ai.tools_runtime import tools as registry, spec_from_tool
from wintermute.cartridges.manager import CartridgeManager
# Mapping: cartridge to load + which of its methods belong on the surface.
# Entries are ordered: the first match against the bound objects wins.
_TOOL_PROFILES = {
    # IoT / hardware
    "uart": (
        ["jtag"],                                   # cartridges to ensure loaded
        ("halt_core", "resume_core",                # JTAG primitives
         "read_memory", "read_registers",
         "open_ssh_session", "run_ssh_session_command",  # via MCP runtime
         "run_ssh_session_background", "poll_ssh_background_job"),
    ),
    "jtag": (
        ["jtag", "firmware_analysis"],
        ("halt_core", "resume_core", "read_memory", "read_registers",
         "dump_firmware",
         "analyze_entropy", "scan_for_secrets",
         "extract_strings", "find_base_address"),
    ),
    "i2c": (
        ["i2c", "firmware_analysis"],
        ("detect", "dump_eeprom",
         "extract_strings", "scan_for_secrets"),
    ),
    "tpm": (
        ["tpm20"],
        ("get_random", "test_pcr_state", "test_da_lockout",
         "read_public", "fuzz_command"),
    ),
    "spi-flash": (
        ["firmware_analysis"],
        ("analyze_entropy", "scan_for_secrets",
         "extract_strings", "find_base_address"),
    ),
    # Cloud / red team
    "aws-iam-role": (
        [],                          # no cartridges; tools come from external MCP
        ("aws_iam_get_role", "aws_iam_simulate_principal_policy",
         "aws_sts_assume_role", "aws_iam_list_attached_policies"),
    ),
    "aws-s3-bucket": (
        [],
        ("aws_s3_list_objects", "aws_s3_get_bucket_policy",
         "aws_s3_get_object_acl", "aws_s3_get_public_access_block"),
    ),
    "burp-target": (
        [],
        ("burp_active_scan", "burp_get_issues", "burp_get_sitemap"),
    ),
}
def _peripheral_kind(p: object) -> str:
    """Map a Peripheral / cloud object to a profile key."""
    p_type = (getattr(p, "pType", None) or "").lower()
    if p_type:
        return p_type                  # "uart" / "i2c" / "spi" / "jtag" / ...
    cls = type(p).__name__.lower()
    if "tpm" in cls:           return "tpm"
    if "uart" in cls:          return "uart"
    if "jtag" in cls:          return "jtag"
    if "iamrole" in cls:       return "aws-iam-role"
    if "s3" in cls or "bucket" in cls: return "aws-s3-bucket"
    return "spi-flash"                  # safe analyzer-only fallback
def select_tool_surface(run: TestCaseRun, op: Operation) -> list[ToolSpec]:
    """Resolve run.bound to objects, classify, and return the tool surface."""
    profile_keys: list[str] = []
    for b in run.bound:
        # Find the actual object in the operation
        obj = None
        for d in op.devices:
            for p in d.peripherals:
                if p.name == b.object_id or getattr(p, "device_path", "") == b.object_id:
                    obj = p; break
            if obj: break
        if obj is None:
            for acc in op.cloud_accounts:
                for lst_name in ("iamroles", "iamusers", "services"):
                    for x in getattr(acc, lst_name, []):
                        if getattr(x, "role_name", "") == b.object_id \
                           or getattr(x, "username", "") == b.object_id \
                           or getattr(x, "name", "") == b.object_id:
                            obj = x; break
                    if obj: break
        if obj is not None:
            profile_keys.append(_peripheral_kind(obj))
    # Load required cartridges + collect tools
    mgr = CartridgeManager()
    wanted_tools: list[str] = []
    for k in profile_keys:
        cartridges, tool_names = _TOOL_PROFILES.get(k, ([], ()))
        for c in cartridges:
            if c not in mgr.list_loaded():
                try:
                    mgr.load(c)
                except Exception:
                    pass
        wanted_tools.extend(tool_names)
    seen, surface = set(), []
    for name in wanted_tools:
        if name in seen or name not in registry._tools: continue
        seen.add(name)
        surface.append(spec_from_tool(registry._tools[name]))
    return surface

		

This is the lever the framework gives you: we pick the tool surface from the live operation’s actual objects, not from a static config. Add an SPI flash to a device → next sub-agent for that device sees the firmware analysis cartridge. Replace tpm20 with a fork → swap one cartridge name and the dispatch is automatic.

The Sub-Agent Body

			
import json
import time
from datetime import datetime, timezone
from wintermute.ai.use import tool_calling_chat
from wintermute.ai.types import ChatRequest, Message
from wintermute.core import RunStatus, TestCaseRun, Operation
from wintermute.findings import ReproductionStep, Vulnerability
SUBAGENT_SYSTEM = """You are an autonomous sub-agent for one TestCaseRun on
a sanctioned penetration test. You operate ONLY on the bound target.
Output contract — your final reply must be a single JSON object:
  {
    "verdict": "passed" | "failed" | "blocked" | "not_applicable",
    "title": "<concise vulnerability title or empty>",
    "description": "<2-4 sentence finding description>",
    "cvss": <int 0-10>,
    "evidence": {<key>: <value>, ...},  // tool outputs you want recorded
    "tool_trace": [{"tool": "<name>", "args": {...}}, ...]
  }
You may only call the tools listed in your tool spec. Do not invent tools.
Do not exceed your iteration budget. If a destructive operation would be
required to confirm the finding, return "blocked" and put the reason in
description.
"""
def _bound_summary(run: TestCaseRun, op: Operation) -> dict:
    items = []
    for b in run.bound:
        items.append({"alias": b.alias, "kind": b.kind, "object_id": b.object_id})
    return {"run_id": run.run_id, "test_case": run.test_case_code, "bound": items}
def run_subagent_for_test_case_run(
    op, run, *, router, wall_clock_budget_s: int = 180,
    max_iterations: int = 10,
):
    if run.status.value not in ("not_run", "in_progress"):
        return run                                  # idempotent
    surface = select_tool_surface(run, op)
    if not surface:
        run.status = RunStatus.blocked
        run.notes = "sub-agent: no tool surface for bound target"
        run.executed_by = "wintermute-subagent"
        run.finish()
        return run
    tc = next(t for t in op.iterTestCases() if t.code == run.test_case_code)
    steps_text = "\n".join(
        f"- {i+1}. {s.title}: {s.description} (tool={s.tool}, action={s.action})"
        for i, s in enumerate(tc.steps)
    )
    user_prompt = (
        f"{json.dumps(_bound_summary(run, op), indent=2)}\n\n"
        f"Test case: {tc.code} — {tc.name}\n"
        f"Description: {tc.description}\n\n"
        f"Reproduction steps from the test plan (use these as guidance, "
        f"adapt to the available tools):\n{steps_text}\n\n"
        f"Available tools: {[s.name for s in surface]}\n\n"
        "Execute the test now. Reply with the JSON verdict only."
    )
    messages = [
        Message(role="system", content=SUBAGENT_SYSTEM),
        Message(role="user", content=user_prompt),
    ]
    run.status = RunStatus.in_progress
    run.start()
    run.executed_by = f"wintermute-subagent ({datetime.now(timezone.utc).isoformat()})"
    deadline = time.monotonic() + wall_clock_budget_s
    tool_trace: list[dict] = []
    verdict: dict | None = None
    for _ in range(max_iterations):
        if time.monotonic() > deadline:
            run.status = RunStatus.blocked
            run.notes = (run.notes + "\n" if run.notes else "") + "wall-clock budget exhausted"
            run.finish()
            return run
        # task_tag="cheap" — sub-agents go to Groq when registered.
        resp = tool_calling_chat(
            router, messages, surface,
            response_format="text", task_tag="cheap",
        )
        if not resp.tool_calls:
            try:
                verdict = json.loads(resp.content)
            except json.JSONDecodeError:
                messages.append(Message(role="user",
                    content="Your last reply did not parse as JSON. Reply ONLY with the schema."))
                continue
            break
        # Append the assistant message + all tool results
        messages.append(Message(role="assistant", content=resp.content or ""))
        for call in resp.tool_calls:
            try:
                result = registry.call(call.name, dict(call.arguments))
            except Exception as exc:
                result = {"error": str(exc)}
            tool_trace.append({"tool": call.name, "args": dict(call.arguments)})
            messages.append(Message(
                role="tool",
                content=json.dumps(result, default=str),
                tool_name=call.name,
                tool_call_id=call.id,
            ))
    else:
        run.status = RunStatus.blocked
        run.notes = (run.notes + "\n" if run.notes else "") + "iteration budget exhausted"
        run.finish()
        return run
    # ----- VERIFICATION & WRITE-BACK ---------------------------------------
    if verdict is None:
        run.status = RunStatus.blocked
        run.notes = "sub-agent: no verdict produced"
        run.finish()
        return run
    if verdict["verdict"] == "failed":
        repro = []
        for entry in (verdict.get("tool_trace") or tool_trace):
            repro.append(ReproductionStep(
                title=entry["tool"],
                description=f"agent invocation: {entry['tool']}({entry.get('args', {})})",
                tool=entry["tool"],
                action="agent",
                confidence=80,
                arguments=[json.dumps(entry.get("args", {}), default=str)],
            ))
        run.findings.append(Vulnerability(
            title=verdict.get("title") or f"Finding from {tc.code}",
            description=verdict.get("description", ""),
            cvss=int(verdict.get("cvss", 0)),
            verified=True,
            reproduction_steps=repro,
        ))
        run.status = RunStatus.failed
    elif verdict["verdict"] == "passed":
        run.status = RunStatus.passed
    elif verdict["verdict"] == "not_applicable":
        run.status = RunStatus.not_applicable
    else:
        run.status = RunStatus.blocked
    run.notes = (run.notes + "\n" if run.notes else "") + (
        f"verdict={verdict['verdict']} "
        f"evidence_keys={list((verdict.get('evidence') or {}).keys())} "
        f"tool_calls={len(tool_trace)}"
    )
    run.finish()
    return run

		

A few things this body does that matter for real engagements:

Idempotent re-entry. First if returns immediately for already-
terminal runs.
Wall-clock + iteration budgets. Both are enforced. Either expiring
marks the run blocked, never passed by accident.
Reproduction steps are derived from the actual tool calls the
agent made. Six months later a different analyst replays
i2c.dump_eeprom({"address": 0x50, "size": 256}) → extract_strings(...).
That is a real artifact, not LLM prose.
Cheap routing. task_tag="cheap" sends each sub-agent to Groq
Llama 3.3 70B when registered. The orchestrator above stays on Sonnet.
registry.call(...) per tool. That goes through ToolsRuntime-style
dispatch — local cartridge methods, MCP tools, Surgeon, all looked up by
name.

End-to-End Trace: The IoT Camera Ticket

Operator drops a ticket:

Title: I2C EEPROM extraction on MCIO Tags: #i2c #eeprom #blackbox #hardware Description: Analyze I2C EEPROM on MCIO port. Suspected hardcoded credentials in flash dump. Target: iot-cam-01 (10.0.0.5). Bus: I2C-2, Address: 0x50.

Operator runs (over MCP from Claude Desktop, or from the REPL once we ship orchestrator run):

			
$ wintermute
onoSendai > operation load acme-iotcam-2026-Q2
onoSendai [acme-iotcam-2026-Q2] > orchestrator run T000001

Trace:

			
[orchestrator] read_ticket: title="I2C EEPROM extraction on MCIO"
                            tags=["i2c","eeprom","blackbox","hardware"]
                            scope.target_host="iot-cam-01"
                            scope.bus="I2C-2"
                            scope.address="0x50"
[orchestrator] plan_node: matched scope.tags ⊂ {hardware, blackbox}
                          → attaching TP-HW-BLACKBOX-001
                          generated 17 TestCaseRuns
[orchestrator] dispatch_node:
   IOT-HW-GEN-001:iot-cam-01    → defer (manual constraint review)
   IOT-HW-DISC-001:iot-cam-01   → defer (manual board photos)
   IOT-HW-UART-001:iot-cam-01:debug-uart → execute
   IOT-HW-I2C-001:iot-cam-01:mcio-eeprom-1 → execute
   IOT-HW-JTAG-001:iot-cam-01:main-jtag → execute
   IOT-HW-FAULT-001:iot-cam-01  → defer (destructive)
   IOT-HW-TPM-001:iot-cam-01    → skip (no TPM peripheral on device)
   ... 9 execute / 6 defer / 2 skip
[orchestrator] execute_runs_node: dispatching 9 sub-agents
[subagent IOT-HW-I2C-001:iot-cam-01:mcio-eeprom-1]
   surface = [detect, dump_eeprom, extract_strings, scan_for_secrets]
   call detect()                     -> [80, 81]
   call dump_eeprom(0x50, 256)       -> {file_path: ".../blob-3a4f..bin", sha256: ...}
   call extract_strings(file=...,8)  -> top: ["admin:hunter2",
                                              "http://10.0.0.1/api/v1/login",
                                              "/bin/sh -c reboot"]
   call scan_for_secrets(file=...)   -> {pem_block: [], aws_key: []}
   verdict = {"verdict":"failed",
              "title":"Hardcoded credentials in MCIO I²C EEPROM",
              "description":"Recovered admin string and login URL from
                  256-byte EEPROM dump. The dump is unencrypted...",
              "cvss":8, "evidence": {...}, "tool_trace":[...]}
   run.findings += Vulnerability(...)
   run.status = failed
[subagent IOT-HW-UART-001:iot-cam-01:debug-uart]
   surface = [halt_core, resume_core, read_memory, read_registers,
              open_ssh_session, run_ssh_session_command, ...]
   call open_ssh_session(host=..., user=..., password=...)  -> session=...
   call run_ssh_session_command(session, "cat /proc/cmdline")
                                       -> "console=ttyS0,115200 init=/bin/sh ..."
   call halt_core()                    -> True (target halted)
   call read_registers()               -> {pc: 0x80008034, lr: ..., sp: ...}
   call read_memory("0x80008000", 4)   -> "0x80008000: e3a01000 e58d1000 ..."
   verdict = {"verdict":"failed",
              "title":"Interruptible U-Boot via UART; init=/bin/sh in cmdline",
              "cvss":9, ...}
   run.status = failed
[subagent IOT-HW-JTAG-001:iot-cam-01:main-jtag]
   surface = [halt_core, resume_core, read_memory, read_registers,
              dump_firmware, analyze_entropy, scan_for_secrets,
              extract_strings, find_base_address]
   call halt_core()                    -> True
   call dump_firmware("0x08000000", 1048576, "iotcam-flash.bin")
                                       -> blob descriptor (1 MiB)
   call analyze_entropy(file=...)      -> {overall: 7.85,
                                            high_entropy_blocks: [...]}
   call extract_strings(file=...,8)    -> top: ["root:$1$...", "wpa_supplicant ...",
                                                 "BEGIN OPENSSH PRIVATE KEY"]
   call scan_for_secrets(file=...)     -> {pem_block: ["0x40123", "0x40e80"],
                                            aws_key: []}
   call find_base_address(file=..., arch="arm",
                          min_addr=0x08000000, max_addr=0x40000000)
                                       -> {top_5: {"0x08000000": 412, ...}}
   verdict = {"verdict":"failed",
              "title":"Firmware contains private keys + crypted root pwd
                       hash; flash is openly dumpable via JTAG",
              "cvss":9, ...}
   run.status = failed
[orchestrator] checkpointed after each run -> .wintermute_data/...
[orchestrator] report_node:
   ./reports/acme-iotcam-2026-Q2.docx
   - 9 runs completed (3 failed with vulnerabilities, 6 passed)
   - 8 runs blocked/deferred (operator review)
   - 2 runs not_applicable

		

What the operator gets, all from a Bugzilla ticket:

3 high-CVSS findings, each with reproduction steps the JTAG / I²C
cartridge can replay,
a checkpointed Operation they can re-open, modify, and re-render,
a polished DOCX with their template branding,
8 deferred runs with explicit reasons — exactly the items needing the
operator’s attention.

This is what the framework was built for. The orchestrator and sub-agents above are roughly 350 lines of Python; everything else (the engagement model, persistence, peripherals, cartridges, MCP, RAG, reports) is Wintermute itself.

Where to Take This Next

A few extensions worth noting; each is one or two posts of follow-up.

Cross-run reasoning. Have the orchestrator re-summarize after every
N runs and re-prioritize the queue. (“We just found JTAG dumpable, escalate
any test that benefits from the dump.”) The data is all on the
Operation; only the prompt shape changes.
Surgeon-driven hypothesis testing. When a finding suggests a
parsing bug (e.g., a field on an I²C-loaded config triggers an
__assert_fail), spawn a Surgeon-backed sub-agent that calls
create_hook_skeleton + build_firmware + start_fuzzing to confirm
reproducibility from the emulator. This is exactly what the integration
in Part 4 was set up for.
Multi-engagement portfolio orchestrator. Same orchestrator, parallel
engagements. Wintermute’s Operation.register_backend(... make_default=True)
is global — for portfolio mode, you instead want a per-engagement
context (an OperationContext stack would be a small refactor) so two
runs against different operations don’t trample each other’s defaults.
Retest mode. Re-run only the runs whose findings exist by replaying
their ReproductionStep records. The tool/argument arrays are precisely
what registry.call consumes; retest is for step in run.findings[*].reproduction_steps: registry.call(step.tool, ...) and a comparison against original output.

Series Wrap

The series has built up the case that Wintermute is not “a Python LLM client with hardware drivers” — it is a framework for engagement state, with deliberately decoupled subsystems (router, RAG, tools, cartridges, MCP, storage, tickets, reports), and the AI sits as a peer of the operator on top of all of it. The orchestrator + sub-agent split we just walked through is genuinely small because the framework underneath does most of the work.

If you build on this, two things will keep you out of trouble:

Always write through the operation graph. Anything you discover
should land on Operation somewhere — Vulnerability, TestCaseRun.findings,
Device.peripherals. Findings off the graph are findings the report
cannot render and the retest cannot replay.
Curate every agent’s tool surface. The four-tool surface in
the I²C sub-agent above is not a constraint — it is the reason the
sub-agent terminates correctly. Resist the temptation to hand the LLM
the entire global registry.

Source: https://github.com/nahualito/wintermute. Documentation: https://nahualito.github.io/wintermute/. Series home: https://exploit.ninja.

— fin.

Leave a ReplyCancel reply

Hey!

Join the club

Categories

Tags

Recent Posts

Wintermute Framework, Part 7: Per-Test-Case Sub-Agents

Wintermute Framework, Part 6: The Orchestrator — Ticket to TestPlan to TestCaseRun Fan-Out

Wintermute Framework, Part 5: A First Agentic Flow — One Tool-Calling Loop

Blogroll

Wintermute Framework, Part 7: Per-Test-Case Sub-Agents

Wintermute Framework, Part 7: Per-Test-Case Sub-Agents

The Sub-Agent Contract

The Tool-Surface Selector

The Sub-Agent Body

End-to-End Trace: The IoT Camera Ticket

Where to Take This Next

Series Wrap

Share this:

Like this:

Leave a ReplyCancel reply

Hey!

Join the club

Categories

Tags

Recent Posts

Wintermute Framework, Part 7: Per-Test-Case Sub-Agents

Wintermute Framework, Part 6: The Orchestrator — Ticket to TestPlan to TestCaseRun Fan-Out

Wintermute Framework, Part 5: A First Agentic Flow — One Tool-Calling Loop

Blogroll

Discover more from Exploit.Ninja