Wintermute Framework, Part 1: Architecture and the Engagement Data Model

This is the opening post of a multi-part series on Wintermute, an open-source agentic framework for hardware penetration testing and red team engagements (MIT-licensed, docs, source). The series is published at exploit.ninja.

The objective of the series is not “how to run an LLM,” it is how to drive a real engagement with one — from a Bugzilla ticket through hardware test plan generation, peripheral-level execution, and a DOCX deliverable. Every post adds to the previous one; by the end of the series we will have a working orchestrator that reads a ticket, derives scope, generates TestCaseRuns, and dispatches a sub-agent per test case to execute reproduction steps over UART/JTAG/MCP and write findings back into a live Operation.

This first post is the mental model. If you skim, skim this one — every later post assumes the vocabulary defined here.

What Wintermute Actually Is

Wintermute is a Python 3.11+ engagement-management library plus two binaries:

Binary	Module	Purpose
`wintermute`	`wintermute.WintermuteConsole:main`	Metasploit-style REPL with builder/context navigation.
`wintermute-mcp`	`wintermute.WintermuteMCP:main`	MCP server (SSE or stdio) exposing 80+ tools.

It is provider-agnostic — Router selects between AWS Bedrock (Claude / DeepSeek / Llama), OpenAI, Groq, and a local HuggingFace embedding provider — and backend-agnostic: storage, tickets, and reports are Python Protocols with at least two implementations each. You can run the entire stack offline with local_embedder + JsonFileBackend + InMemoryBackend + DocxTplPerVulnBackend, and swap any of those four for a cloud equivalent without touching the agent code.

The Architecture, From Top to Bottom

                ┌──────────────────────────────────────────────────────┐
                │   WintermuteConsole       │   WintermuteMCP server   │
                │   (REPL + builder ctx)    │   (SSE / stdio, 80+ tools)│
                └──────────────────────────┬┴──────────────────────────┘
                                           │
                                ┌──────────▼──────────┐
                                │      Operation      │  ← top-level aggregate
                                │  Devices · Users    │
                                │  Analysts · Cloud   │
                                │  TestPlans · Runs   │
                                │  Vulnerabilities    │
                                └──┬──────────┬───────┘
                                   │          │
        ┌──────────────────────────▼──┐    ┌──▼─────────────────────────┐
        │       AI Subsystem          │    │  Storage / Ticket / Report │
        │                             │    │   Backend Protocols        │
        │  Router → LLMRegistry       │    │                            │
        │     • bedrock               │    │  Json / DynamoDB           │
        │     • openai                │    │  InMemory / Bugzilla       │
        │     • groq                  │    │  DocxTplPerVuln            │
        │     • local_embedder        │    │                            │
        │     • rag-<kb_name>         │    └────────────────────────────┘
        │                             │
        │  ToolsRuntime               │
        │     • ToolRegistry (local)  │
        │     • MCPRuntime backends   │
        │     • SurgeonBackend        │
        │                             │
        │  CartridgeManager           │
        │     • tpm20                 │
        │     • jtag                  │
        │     • firmware_analysis     │
        └─────────────────────────────┘

Three things to internalize before going further:

Operation is the only persistent state. Devices, services, peripherals,
cloud accounts, vulnerabilities, test plans, and test runs are all attached
to one Operation object that knows how to serialize itself via to_dict()
/ from_dict(). Every backend (JSON, DynamoDB, MCP transport) round-trips
that single dict.
The AI is a peer of the operator, not a layer on top of the framework.
The same tools_runtime.ToolRegistry that an LLM hits via tool-calling is
the registry the human operator drives via cartridges run <name> <fn> in
the REPL. Cartridge methods are introspected and registered as AI tools
through wintermute.cartridges.manager.CartridgeManager._register_instance_methods
— there is no second tool surface for the AI.
Everything pluggable is a Protocol. StorageBackend, TicketBackend,
ReportBackend, ToolBackend, LLMProvider — all five are typing.Protocols.
New providers, new ticket systems, new vector stores plug in by implementing
the protocol and calling register_backend(...).

The Engagement Data Model — Verbatim From `core.py`

Wintermute does not invent a new schema. The objects are dataclass-style models with __schema__ attributes that describe how to deserialize nested types. Here is the canonical hierarchy, with line references in wintermute/core.py:

			
Operation                                       core.py:621
 ├── analysts:        list[Analyst]             core.py:365
 ├── devices:         list[Device]              core.py:161
 │     ├── services:       list[Service]        core.py:61
 │     │     └── vulnerabilities:               findings.py:114
 │     │           └── reproduction_steps:      findings.py:36
 │     ├── peripherals:    list[Peripheral]     peripherals.py
 │     │     └── (UART, JTAG, SPI, I2C, SWD, TPMPeripheral, USB, PCIe,
 │     │          Bluetooth, Wifi, Zigbee, Ethernet)
 │     ├── processor:      Processor            hardware.py:42
 │     │     └── architecture: Architecture     hardware.py:32
 │     └── memory:         Memory               hardware.py:58
 ├── users:           list[User]                core.py:274
 ├── cloud_accounts:  list[AWSAccount|...]      cloud/aws.py
 │     ├── iamusers, iamroles, services, vulnerabilities
 ├── test_plans:      list[TestPlan]            core.py:534
 │     ├── test_cases: list[TestCase]           core.py:495
 │     │     ├── target_scope: TargetScope      core.py:480
 │     │     │     └── bindings: list[ObjectSelector]   core.py:457
 │     │     ├── steps: list[ReproductionStep]
 │     │     ├── execution_mode: ExecutionMode
 │     │     │   ("once" | "per_device" | "per_binding")
 │     │     └── execution_binding: str
 │     └── test_plans: list[TestPlan]   ← test plans nest
 └── test_runs:       list[TestCaseRun]         core.py:572
       ├── status: RunStatus
       │   ("not_run" | "in_progress" | "passed" | "failed"
       │    | "blocked" | "not_applicable")
       ├── started_at / ended_at: datetime
       ├── bound: list[BoundObjectRef]
       └── findings: list[Vulnerability]

		

A few things stand out compared to typical pentest tooling:

Test plans are declarative JSON, with the schema implemented as Python
dataclasses. The
TestPlans/
directory in the repo ships seven plans; TP-HW-BLACKBOX-001.json is a
full hardware blackbox playbook (board survey → debug interfaces → buses
→ boot chain → TPM 2.0).
target_scope.bindings are selectors, not pointers, resolved against
the live Operation at run-generation time. A TestCase declares “I need
one device tagged dut and many UART peripherals on it”; Operation.resolveBindings
(core.py:805) walks the operation’s devices, services, peripherals,
IAM users, IAM roles, and cloud services to find matches and validates
cardinality.
ExecutionMode controls fan-out: once produces a single run, per_device
one run per matched DUT, and per_binding one run per matched peripheral.
This is the lever the orchestrator (Part 6 / 7) will pull to dispatch a
sub-agent per UART, per JTAG TAP, per IAM role.

Concretely, here is the same shape from the included examples/02-Operations-and-Storage.ipynb:

			
from wintermute.core import Operation
op = Operation("acme-pentest-2025")
op.addAnalyst("Jane Doe", "jdoe", "jane@acme.com")
op.addUser(uid="rsmith", name="Robert Smith",
           email="robert@acme.com", teams=["red-team"])
op.addDevice("web-server-01", "10.0.1.10", operatingsystem="Ubuntu 22.04")
dev = op.getDeviceByHostname("web-server-01")
dev.addService(portNumber=443, app="nginx",
               protocol="ipv4", transport_layer="HTTPS")
svc = dev.services[0]
svc.addVulnerability(
    title="SQL Injection in /api/search",
    cvss=9,
    description="Union-based SQLi in search parameter",
    risk={"likelihood": "High", "impact": "High", "severity": "Critical"},
)

		

Nothing AI yet — and that is the point. For the rest of the series, anything we build with the AI lands inside this same object graph.

How a Query Becomes a Tool Call

When a query enters the system (REPL ai <prompt>, MCP ai_chat, or programmatic simple_chat(...)), it follows this path. I’m walking through it because the orchestrator we build later overrides each of these stages.

Router.choose(req) (wintermute/ai/provider.py) selects a provider.
The default is Bedrock; if req.task_tag contains "cheap" and a Groq
provider is registered, it routes to Groq instead. router.set_default(provider="rag-<kb>")
pins the route to a RAG knowledge base for subsequent calls — this is how
the console’s ai rag use <name> toggles document grounding live.
If the chosen provider is a RAGProvider (any provider whose name
starts with rag-), the last user message is sent to the provider’s
LlamaIndex query engine. Retrieved chunks are injected as a preamble; the
augmented request is forwarded to the provider’s configured base_provider
(Bedrock, OpenAI, or Groq). RAG and tool-calling are independent — the
tools list is forwarded unchanged so the base LLM sees both retrieved
context and available tools.
provider.chat(req) → ChatResponse comes back via LiteLLM. If the
response includes tool_calls, the consumer enters the standard
“execute → reply with tool message → ask again” loop.
ToolsRuntime.run_tool(name, args) (wintermute/ai/tools_runtime.py:198)
is the dispatcher. It first asks every registered dynamic backend whether
it owns the tool (Surgeon over MCP, any MCPRuntime-managed stdio server,
etc.) and falls back to the local ToolRegistry only if no backend claims
it. This is what makes adding a new MCP server a runtime operation:
mcp register <name> <command> [args...] in the REPL plus mcp start <name>
and the LLM immediately sees the new tools.

The single most useful thing about this layout for an offensive engineer is that the LLM cannot tell apart a Python tool, a binary wrapped via tools.json path mapping, a cartridge method on a loaded plugin, or a tool sitting on a remote MCP server. They are all {"type": "function", "function": {...}} entries in ToolsRuntime.get_all_tools(). We exploit this in Part 4 to drop custom red-team tools next to vendor MCPs without writing any glue.

Why This Layout Matters for Pentests and Red Teams

Tools that wrap an LLM around bash and call it a “pentest agent” tend to fail on three fronts:

No engagement state. They run shell commands, dump output, ask another
question. There is no TestCaseRun to mark failed, no Vulnerability to
attach to a service, no DOCX to deliver.
No scoping primitives. Real engagements need scope; “test only IoT cameras
tagged dut with UART peripherals exposed on header J3″ is not something
you express with a system prompt. Wintermute’s ObjectSelector does.
No reproducibility. A finding without a ReproductionStep is folklore.
Wintermute attaches ReproductionStep objects to Vulnerability records,
each with tool, action, arguments, vulnOutput, and confidence —
structured enough for the same agent (or a human, or a re-test six months
later) to replay.

This is why every step of the orchestrator we will build by Part 7 writes back into the Operation. The DOCX produced at the end is not a separate artifact; it is Report.save(spec, [op], "out.docx") walking the operation tree we have been mutating the whole time.

What’s Next

Part	Topic
1	Architecture and the engagement data model ← you are here
2	Operations, storage backends, the console — driving the framework by hand
3	The AI subsystem: Router, providers, RAG, and the tool registry
4	Cartridges, MCP, Surgeon — turning hardware tools into LLM tools
5	A first agentic flow — single-prompt enrichment + tool calling
6	The orchestrator — ticket → scope → TestPlan → TestCaseRun fan-out
7	Per-test-case sub-agents — one agent per UART, per JTAG, per IAM role

Each post adds code to the previous one. By Part 7 we will have a working attacker-perspective workflow: drop a Bugzilla ticket “Audit MCIO I2C EEPROM on rasp1”, and the orchestrator pulls the ticket, builds a TestPlan from TestPlans/TP-HW-BLACKBOX-001.json, generates TestCaseRuns for every matching peripheral, dispatches a sub-agent per run that picks the right tools (JTAG cartridge, firmware analysis cartridge, Surgeon hooks, SSH session) to execute the reproduction steps, attaches Vulnerability objects with ReproductionStep records to the live Operation, marks each run passed/failed/blocked, and emits the DOCX report.

Bring a soldering iron to Part 4.

Leave a ReplyCancel reply

Hey!

Join the club

Categories

Tags

Recent Posts

Wintermute Framework, Part 9: Attacking U-Boot Over UART — init=/bin/bash via bootargs Injection

Wintermute Framework, Part 8: U-Boot Secure Boot Testing With the Depthcharge Backend

Wintermute Framework, Part 7: Per-Test-Case Sub-Agents

Blogroll