Wintermute Framework, Part 5: A First Agentic Flow

Across Parts 1–4 we built up the plumbing: the engagement data model, the console, the AI router, the cartridges and MCP. Now we build the first actual agent.

The intentional constraint of this post is one prompt, one tool-calling loop, one test case. We are not yet looping over the whole test plan (Part 6) or dispatching sub-agents (Part 7). The point is to make the single-step pattern bulletproof so the orchestrator above it has a known primitive to compose.

What “Agentic” Means in Wintermute

For our purposes, an agent is the standard loop:

			
1. Build a ChatRequest with messages + tool specs
2. provider.chat(req) → ChatResponse
3. If response has tool_calls:
       for each call: result = ToolsRuntime.run_tool(name, args)
       append a tool message; goto 2
4. Otherwise: stop. The response.content is the answer.

		

That is the loop we want. The framework gives us most of it for free — Router.choose, tool_calling_chat, and ToolsRuntime.run_tool. What we have to write is:

the system prompt that explains the engagement context,
the tool selection (a curated subset of the global registry),
the post-loop interpretation that turns the model’s final answer into
mutations on the live Operation.

The Test Case We Will Automate

Continuing the IoT-camera engagement from Part 2, we automate a single case: IOT-HW-I2C-001:iot-cam-01:mcio-eeprom-1 — “extract the I²C EEPROM on the MCIO board and find anything sensitive.”

In Part 4 we wrote I2CCartridge and walked through the calls manually. Here, the LLM picks the calls itself.

The Code, Top to Bottom

			
"""
Part 5: One-shot agent that executes a single TestCaseRun on the
IoT-camera operation. Uses Wintermute's tool_calling_chat against
Bedrock Claude with the i2c + firmware_analysis cartridges loaded.
Pre-conditions:
  - operation `acme-iotcam-2026-Q2` already saved (Part 2).
  - JsonFileBackend registered.
  - AWS_REGION + Bedrock model configured.
"""
from __future__ import annotations
import logging
from datetime import datetime, timezone
from wintermute.ai.bootstrap import init_router
from wintermute.ai.tools_runtime import tools as registry, spec_from_tool
from wintermute.ai.types import ChatRequest, Message, ToolSpec
from wintermute.ai.use import tool_calling_chat
from wintermute.backends.json_storage import JsonFileBackend
from wintermute.cartridges.manager import CartridgeManager
from wintermute.core import Operation, RunStatus
from wintermute.findings import ReproductionStep, Vulnerability
log = logging.getLogger("wintermute.agent.part5")
logging.basicConfig(level=logging.INFO)
# ---------------------------------------------------------------------------
# 0. Engagement load
# ---------------------------------------------------------------------------
Operation.register_backend(
    "json", JsonFileBackend(base_path="./.wintermute_data"), make_default=True
)
op = Operation("acme-iotcam-2026-Q2")
op.load()
# Find the run we want to drive.
target_run_id = "IOT-HW-I2C-001:iot-cam-01:mcio-eeprom-1"
run = next(r for r in op.test_runs if r.run_id == target_run_id)
tc  = next(t for t in op.iterTestCases() if t.code == run.test_case_code)
peripheral = next(p for p in op.getDeviceByHostname("iot-cam-01").peripherals
                  if p.name == "mcio-eeprom-1")
# ---------------------------------------------------------------------------
# 1. AI bootstrap + cartridge load (capability surface)
# ---------------------------------------------------------------------------
router = init_router()                                # Bedrock default
mgr = CartridgeManager()
for name in ("i2c", "firmware_analysis"):
    if name not in mgr.list_loaded():
        mgr.load(name)
# A curated tool list — we DO NOT hand the agent the full global registry.
# The 80+ MCP tools would balloon the prompt and tempt the model into
# unrelated work.
allowed_tools = ("detect", "dump_eeprom",
                 "extract_strings", "scan_for_secrets")
tool_specs: list[ToolSpec] = [
    spec_from_tool(t)
    for name, t in registry._tools.items()
    if name in allowed_tools
]
log.info("Tools available to agent: %s", [t.name for t in tool_specs])
# ---------------------------------------------------------------------------
# 2. System prompt — engagement-grounded
# ---------------------------------------------------------------------------
system = f"""You are an autonomous hardware security analyst working a
sanctioned penetration test for ACME on engagement
{op.operation_name!r}.
Scope (the only thing you may touch in this turn):
  - Device : {peripheral.__dict__.get('name', 'unknown')!r} on host iot-cam-01
  - Bus    : I2C-2
  - Address: 0x{0x50:02x}
  - Test case: {tc.code} — {tc.name}
Available tools:
  detect()                              -> list of int (responding I2C addresses on bus 2)
  dump_eeprom(address: int, size: int)  -> blob descriptor (file_path/sha256/size)
  extract_strings(file_path: str, min_length: int) -> top printable strings + counts
  scan_for_secrets(file_path: str)      -> regex+byte-pattern hits (PEM, AWS keys, ...)
Process:
1. detect() to confirm the EEPROM is at 0x50.
2. dump_eeprom(0x50, 256). Pass the returned `file_path` to subsequent
   tools — never embed bytes in your reply.
3. Run extract_strings (min_length=8) and scan_for_secrets on the dump.
4. Decide whether the dump contains a security-relevant artifact:
   credentials, private keys, hardcoded URLs, command tokens.
5. Reply with a JSON object of the form:
     {{
       "verdict": "passed" | "failed",
       "title": "<concise vuln title or empty>",
       "description": "<2-3 sentences>",
       "cvss": <int 0-10>,
       "evidence_offsets": ["0x<hex>", ...],
       "interesting_strings": ["<str>", ...]
     }}
   `passed` means no finding; `failed` means a vulnerability is recorded.
Constraints:
- Do not call any tool outside the four above.
- Do not invent reproduction steps; the framework records them from your
  tool calls.
- Stop after one diagnostic round-trip.
"""
# ---------------------------------------------------------------------------
# 3. Run — start the test run, dispatch the loop, parse the verdict
# ---------------------------------------------------------------------------
run.status = RunStatus.in_progress
run.start()
run.executed_by = "wintermute-agent-part5"
messages = [
    Message(role="system", content=system),
    Message(role="user",
            content=f"Execute test case {tc.code} now. Reply with the JSON verdict only."),
]
# Loop: tool_calling_chat handles one round; we drive the round-trip until
# the model returns content with no tool_calls.
import json
while True:
    resp = tool_calling_chat(router, messages, tool_specs, response_format="text")
    if not resp.tool_calls:
        verdict = json.loads(resp.content)
        break
    # Execute every tool the model requested, append tool messages, loop.
    messages.append(
        Message(role="assistant",
                content=resp.content or "",
                tool_call_id=None, tool_name=None)
    )
    for call in resp.tool_calls:
        result = registry.call(call.name, dict(call.arguments))
        messages.append(
            Message(role="tool",
                    content=json.dumps(result, default=str),
                    tool_name=call.name,
                    tool_call_id=call.id)
        )
# ---------------------------------------------------------------------------
# 4. Write the verdict back into the live operation
# ---------------------------------------------------------------------------
if verdict["verdict"] == "failed":
    vuln = Vulnerability(
        title=verdict["title"],
        description=verdict["description"],
        cvss=int(verdict["cvss"]),
        threat="unauthorized device access via static artifact",
        verified=True,
        reproduction_steps=[
            ReproductionStep(
                title="Probe I2C bus 2",
                tool="i2c.detect", action="probe",
                confidence=85, arguments=[]),
            ReproductionStep(
                title="Read 256 bytes from 0x50",
                tool="i2c.dump_eeprom", action="read",
                confidence=90, arguments=["0x50", "256"]),
            ReproductionStep(
                title="Static analysis: strings + secret scan",
                tool="firmware_analysis", action="analyze",
                confidence=80,
                arguments=verdict.get("evidence_offsets", [])),
        ],
    )
    run.findings.append(vuln)
    run.status = RunStatus.failed
else:
    run.status = RunStatus.passed
run.notes = (
    f"agent={run.executed_by} ts={datetime.now(timezone.utc).isoformat()} "
    f"strings_seen={len(verdict.get('interesting_strings', []))} "
    f"verdict={verdict['verdict']}"
)
run.finish()
op.save()
log.info("Test run %s -> %s", run.run_id, run.status.value)

		

Run this against a real EEPROM (or the in-memory transport pattern from examples/07-Programmatic-Hardware-Cartridges.ipynb) and the agent loop will:

call detect() → see 0x50,
call dump_eeprom(0x50, 256) → get a blob descriptor,
call extract_strings(file_path=..., min_length=8) → see something like
["admin:hunter2", "http://10.0.0.1/api/v1/login"],
call scan_for_secrets(file_path=...) → maybe see no secret, maybe see
PEM offsets,
reply with JSON (e.g., {"verdict": "failed", "title": "Hardcoded credentials in MCIO EEPROM", "cvss": 8, ...}),

and the post-loop block writes a Vulnerability with ReproductionSteps into the TestCaseRun, marks the run failed, and persists.

This is the first usable agent. It has scope, structure, real side-effects on the operation, and reproduction steps. It is also the exact pattern the orchestrator (Part 6) and the per-run sub-agents (Part 7) call internally — one agent, one run.

What This Agent Already Does Right

A few decisions in the code above are not obvious but matter:

Curated tool list. Only the four cartridge methods relevant to this
test case are in tool_specs. This is critical for cost (smaller prompts),
determinism (the model can’t “improvise” with run_ssh_command), and
blast radius (a misfire can’t reach Surgeon’s start_fuzzing). Part 7’s
sub-agent generalizes this with select_tools_for_test_case(tc).
Engagement-grounded system prompt. The prompt includes the actual
device, peripheral, address, and test-case code from the live operation —
not a generic “you are a pentester” preamble. The agent’s output is
concrete because its input is.
JSON verdict, not free text. We parse the verdict deterministically.
The framework cares about typed fields (cvss: int, status: RunStatus),
so the agent must produce typed output. This is what makes the result
attachable to a TestCaseRun instead of pasted into a notes field.
reproduction_steps reflect the actual tool calls, not the model’s
prose. A reproduction step is an artifact a different analyst replays
six months later — its tool, action, arguments are the cartridge
call shape, exactly the thing Part 7 will replay for retest passes.

Failure Modes and How the Framework Helps

A few things go wrong with single-loop agents in practice. Each has a corresponding Wintermute primitive:

Tool-call infinite loops. The model keeps calling tools without
emitting a JSON verdict. Mitigation: cap iterations (a for _ in range(max_iters): ... around the while True is one line); the agent
loop in Part 7 also enforces a wall-clock budget.
Model returns junk JSON. Wrap json.loads in try/except, push a
retry message (“Your previous reply was not valid JSON. Reply only with
the schema.”). One retry is usually enough; two is the upper bound.
Tool returns are huge. Already handled — tool_factory._maybe_offload_payload
routes raw bytes and >1 KiB strings into the workspace and hands the LLM
a compact descriptor. You almost never need to truncate manually.
Provider rate limit. Wintermute ships wintermute/ai/retry.py
(Backoff); the cloud providers wrap their chat() in it. Backoff
defaults are conservative; tune in bedrock_provider.py /
groq_provider.py per engagement.

A Useful Variant: Read-Only Recon Agent

Same loop, different tool list, no mutation. This is the agent you run during recon when you don’t want any state writes:

			
tool_specs = [
    spec_from_tool(t) for name, t in registry._tools.items()
    if name in ("detect", "extract_strings", "scan_for_secrets",
                "ai_list_test_runs")    # console-bound read-only AI tool
]
system = "You are a recon analyst. Only read. ..."
# ... same loop, no `run.findings.append(...)`.

		

The model still gets the engagement context, still chooses tools, still emits a JSON verdict — but the post-loop simply prints the verdict instead of mutating the run. Useful for “what should we test next?”-style brainstorming where letting the LLM commit to a finding is premature.

What’s Next

Part 6 lifts this pattern up. We will read a Bugzilla/Jira ticket through Ticket.read(...), derive scope, attach a TestPlan, generate TestCaseRuns, and enqueue one Part-5-style agent per run. The orchestrator is the loop above this loop; the per-run agent is exactly what we wrote here.

Part 7 takes the next step: the orchestrator dispatches sub-agents per TestCaseRun, each with its own curated tool surface tailored to the test case (UART → JTAG cartridge + SSH; I²C → I2CCartridge + firmware_analysis; cloud IAM → AWS-only tools), and the orchestrator reasons over the merged results.

One response to “Wintermute Framework, Part 5: A First Agentic Flow — One Tool-Calling Loop”

Wintermute Framework, Part 6: The Orchestrator — Ticket to TestPlan to TestCaseRun Fan-Out – Exploit.Ninja

May 11, 2026 at 7:17 am

[…] Part 5 gave us a single-test-case agent. In this post we build the orchestrator: a higher-level agent that ingests a ticket, derives engagement scope, attaches a TestPlan, generates TestCaseRuns, picks a starting test, and sequences the work. This is the backbone the per-run sub-agents in Part 7 will plug into. […]

Loading…

One response to “Wintermute Framework, Part 5: A First Agentic Flow — One Tool-Calling Loop”

Leave a ReplyCancel reply

Hey!

Join the club

Categories

Tags

Recent Posts

Wintermute Framework, Part 9: Attacking U-Boot Over UART — init=/bin/bash via bootargs Injection

Wintermute Framework, Part 8: U-Boot Secure Boot Testing With the Depthcharge Backend

Wintermute Framework, Part 7: Per-Test-Case Sub-Agents

Blogroll

Wintermute Framework, Part 5: A First Agentic Flow — One Tool-Calling Loop

Wintermute Framework, Part 5: A First Agentic Flow

What “Agentic” Means in Wintermute

The Test Case We Will Automate

The Code, Top to Bottom

What This Agent Already Does Right

Failure Modes and How the Framework Helps

A Useful Variant: Read-Only Recon Agent

What’s Next

Share this:

Like this:

One response to “Wintermute Framework, Part 5: A First Agentic Flow — One Tool-Calling Loop”

Leave a ReplyCancel reply

Hey!

Join the club

Categories

Tags

Recent Posts

Wintermute Framework, Part 9: Attacking U-Boot Over UART — init=/bin/bash via bootargs Injection

Wintermute Framework, Part 8: U-Boot Secure Boot Testing With the Depthcharge Backend

Wintermute Framework, Part 7: Per-Test-Case Sub-Agents

Blogroll

Discover more from Exploit.Ninja