Wintermute Framework, Part 1: Architecture and the Engagement Data Model
This is the opening post of a multi-part series on Wintermute, an open-source agentic framework for hardware penetration testing and red team engagements (MIT-licensed, docs, source). The series is published at exploit.ninja.
The objective of the series is not “how to run an LLM,” it is how to drive a
real engagement with one — from a Bugzilla ticket through hardware test plan
generation, peripheral-level execution, and a DOCX deliverable. Every post adds
to the previous one; by the end of the series we will have a working orchestrator
that reads a ticket, derives scope, generates TestCaseRuns, and dispatches a
sub-agent per test case to execute reproduction steps over UART/JTAG/MCP and
write findings back into a live Operation.
This first post is the mental model. If you skim, skim this one — every later post assumes the vocabulary defined here.
What Wintermute Actually Is
Wintermute is a Python 3.11+ engagement-management library plus two binaries:
| Binary | Module | Purpose |
|---|---|---|
wintermute | wintermute.WintermuteConsole:main | Metasploit-style REPL with builder/context navigation. |
wintermute-mcp | wintermute.WintermuteMCP:main | MCP server (SSE or stdio) exposing 80+ tools. |
It is provider-agnostic — Router selects between AWS Bedrock (Claude /
DeepSeek / Llama), OpenAI, Groq, and a local HuggingFace embedding provider —
and backend-agnostic: storage, tickets, and reports are Python Protocols
with at least two implementations each. You can run the entire stack offline
with local_embedder + JsonFileBackend + InMemoryBackend + DocxTplPerVulnBackend,
and swap any of those four for a cloud equivalent without touching the agent code.
The Architecture, From Top to Bottom
┌──────────────────────────────────────────────────────┐
│ WintermuteConsole │ WintermuteMCP server │
│ (REPL + builder ctx) │ (SSE / stdio, 80+ tools)│
└──────────────────────────┬┴──────────────────────────┘
│
┌──────────▼──────────┐
│ Operation │ ← top-level aggregate
│ Devices · Users │
│ Analysts · Cloud │
│ TestPlans · Runs │
│ Vulnerabilities │
└──┬──────────┬───────┘
│ │
┌──────────────────────────▼──┐ ┌──▼─────────────────────────┐
│ AI Subsystem │ │ Storage / Ticket / Report │
│ │ │ Backend Protocols │
│ Router → LLMRegistry │ │ │
│ • bedrock │ │ Json / DynamoDB │
│ • openai │ │ InMemory / Bugzilla │
│ • groq │ │ DocxTplPerVuln │
│ • local_embedder │ │ │
│ • rag-<kb_name> │ └────────────────────────────┘
│ │
│ ToolsRuntime │
│ • ToolRegistry (local) │
│ • MCPRuntime backends │
│ • SurgeonBackend │
│ │
│ CartridgeManager │
│ • tpm20 │
│ • jtag │
│ • firmware_analysis │
└─────────────────────────────┘
Three things to internalize before going further:
Operationis the only persistent state. Devices, services, peripherals,
cloud accounts, vulnerabilities, test plans, and test runs are all attached
to oneOperationobject that knows how to serialize itself viato_dict()
/from_dict(). Every backend (JSON, DynamoDB, MCP transport) round-trips
that single dict.The AI is a peer of the operator, not a layer on top of the framework.
The sametools_runtime.ToolRegistrythat an LLM hits via tool-calling is
the registry the human operator drives viacartridges run <name> <fn>in
the REPL. Cartridge methods are introspected and registered as AI tools
throughwintermute.cartridges.manager.CartridgeManager._register_instance_methods
— there is no second tool surface for the AI.Everything pluggable is a
Protocol.StorageBackend,TicketBackend,ReportBackend,ToolBackend,LLMProvider— all five aretyping.Protocols.
New providers, new ticket systems, new vector stores plug in by implementing
the protocol and callingregister_backend(...).
The Engagement Data Model — Verbatim From core.py
Wintermute does not invent a new schema. The objects are dataclass-style models
with __schema__ attributes that describe how to deserialize nested types.
Here is the canonical hierarchy, with line references in
wintermute/core.py:
Operation core.py:621 ├── analysts: list[Analyst] core.py:365 ├── devices: list[Device] core.py:161 │ ├── services: list[Service] core.py:61 │ │ └── vulnerabilities: findings.py:114 │ │ └── reproduction_steps: findings.py:36 │ ├── peripherals: list[Peripheral] peripherals.py │ │ └── (UART, JTAG, SPI, I2C, SWD, TPMPeripheral, USB, PCIe, │ │ Bluetooth, Wifi, Zigbee, Ethernet) │ ├── processor: Processor hardware.py:42 │ │ └── architecture: Architecture hardware.py:32 │ └── memory: Memory hardware.py:58 ├── users: list[User] core.py:274 ├── cloud_accounts: list[AWSAccount|...] cloud/aws.py │ ├── iamusers, iamroles, services, vulnerabilities ├── test_plans: list[TestPlan] core.py:534 │ ├── test_cases: list[TestCase] core.py:495 │ │ ├── target_scope: TargetScope core.py:480 │ │ │ └── bindings: list[ObjectSelector] core.py:457 │ │ ├── steps: list[ReproductionStep] │ │ ├── execution_mode: ExecutionMode │ │ │ ("once" | "per_device" | "per_binding") │ │ └── execution_binding: str │ └── test_plans: list[TestPlan] ← test plans nest └── test_runs: list[TestCaseRun] core.py:572 ├── status: RunStatus │ ("not_run" | "in_progress" | "passed" | "failed" │ | "blocked" | "not_applicable") ├── started_at / ended_at: datetime ├── bound: list[BoundObjectRef] └── findings: list[Vulnerability]
A few things stand out compared to typical pentest tooling:
- Test plans are declarative JSON, with the schema implemented as Python
dataclasses. TheTestPlans/
directory in the repo ships seven plans;TP-HW-BLACKBOX-001.jsonis a
full hardware blackbox playbook (board survey → debug interfaces → buses
→ boot chain → TPM 2.0). target_scope.bindingsare selectors, not pointers, resolved against
the liveOperationat run-generation time. ATestCasedeclares “I need
one device taggeddutand many UART peripherals on it”;Operation.resolveBindings
(core.py:805) walks the operation’s devices, services, peripherals,
IAM users, IAM roles, and cloud services to find matches and validates
cardinality.ExecutionModecontrols fan-out:onceproduces a single run,per_device
one run per matched DUT, andper_bindingone run per matched peripheral.
This is the lever the orchestrator (Part 6 / 7) will pull to dispatch a
sub-agent per UART, per JTAG TAP, per IAM role.
Concretely, here is the same shape from the included
examples/02-Operations-and-Storage.ipynb:
from wintermute.core import Operationop = Operation("acme-pentest-2025")op.addAnalyst("Jane Doe", "jdoe", "jane@acme.com")op.addUser(uid="rsmith", name="Robert Smith", email="robert@acme.com", teams=["red-team"])op.addDevice("web-server-01", "10.0.1.10", operatingsystem="Ubuntu 22.04")dev = op.getDeviceByHostname("web-server-01")dev.addService(portNumber=443, app="nginx", protocol="ipv4", transport_layer="HTTPS")svc = dev.services[0]svc.addVulnerability( title="SQL Injection in /api/search", cvss=9, description="Union-based SQLi in search parameter", risk={"likelihood": "High", "impact": "High", "severity": "Critical"},)
Nothing AI yet — and that is the point. For the rest of the series, anything we build with the AI lands inside this same object graph.
How a Query Becomes a Tool Call
When a query enters the system (REPL ai <prompt>, MCP ai_chat, or
programmatic simple_chat(...)), it follows this path. I’m walking through it
because the orchestrator we build later overrides each of these stages.
Router.choose(req)(wintermute/ai/provider.py) selects a provider.
The default isBedrock; ifreq.task_tagcontains"cheap"and a Groq
provider is registered, it routes to Groq instead.router.set_default(provider="rag-<kb>")
pins the route to a RAG knowledge base for subsequent calls — this is how
the console’sai rag use <name>toggles document grounding live.If the chosen provider is a
RAGProvider(any provider whose name
starts withrag-), the last user message is sent to the provider’s
LlamaIndex query engine. Retrieved chunks are injected as a preamble; the
augmented request is forwarded to the provider’s configuredbase_provider
(Bedrock, OpenAI, or Groq). RAG and tool-calling are independent — the
tools list is forwarded unchanged so the base LLM sees both retrieved
context and available tools.provider.chat(req) → ChatResponsecomes back via LiteLLM. If the
response includestool_calls, the consumer enters the standard
“execute → reply with tool message → ask again” loop.ToolsRuntime.run_tool(name, args)(wintermute/ai/tools_runtime.py:198)
is the dispatcher. It first asks every registered dynamic backend whether
it owns the tool (Surgeon over MCP, anyMCPRuntime-managed stdio server,
etc.) and falls back to the localToolRegistryonly if no backend claims
it. This is what makes adding a new MCP server a runtime operation:mcp register <name> <command> [args...]in the REPL plusmcp start <name>
and the LLM immediately sees the new tools.
The single most useful thing about this layout for an offensive engineer is
that the LLM cannot tell apart a Python tool, a binary wrapped via tools.json
path mapping, a cartridge method on a loaded plugin, or a tool sitting on a
remote MCP server. They are all {"type": "function", "function": {...}}
entries in ToolsRuntime.get_all_tools(). We exploit this in Part 4 to drop
custom red-team tools next to vendor MCPs without writing any glue.
Why This Layout Matters for Pentests and Red Teams
Tools that wrap an LLM around bash and call it a “pentest agent” tend to fail on three fronts:
- No engagement state. They run shell commands, dump output, ask another
question. There is noTestCaseRunto markfailed, noVulnerabilityto
attach to a service, no DOCX to deliver. - No scoping primitives. Real engagements need scope; “test only IoT cameras
taggeddutwith UART peripherals exposed on header J3″ is not something
you express with a system prompt. Wintermute’sObjectSelectordoes. - No reproducibility. A finding without a
ReproductionStepis folklore.
Wintermute attachesReproductionStepobjects toVulnerabilityrecords,
each withtool,action,arguments,vulnOutput, andconfidence—
structured enough for the same agent (or a human, or a re-test six months
later) to replay.
This is why every step of the orchestrator we will build by Part 7 writes back
into the Operation. The DOCX produced at the end is not a separate artifact;
it is Report.save(spec, [op], "out.docx") walking the operation tree we have
been mutating the whole time.
What’s Next
| Part | Topic |
|---|---|
| 1 | Architecture and the engagement data model ← you are here |
| 2 | Operations, storage backends, the console — driving the framework by hand |
| 3 | The AI subsystem: Router, providers, RAG, and the tool registry |
| 4 | Cartridges, MCP, Surgeon — turning hardware tools into LLM tools |
| 5 | A first agentic flow — single-prompt enrichment + tool calling |
| 6 | The orchestrator — ticket → scope → TestPlan → TestCaseRun fan-out |
| 7 | Per-test-case sub-agents — one agent per UART, per JTAG, per IAM role |
Each post adds code to the previous one. By Part 7 we will have a working
attacker-perspective workflow: drop a Bugzilla ticket “Audit MCIO I2C EEPROM
on rasp1”, and the orchestrator pulls the ticket, builds a TestPlan from
TestPlans/TP-HW-BLACKBOX-001.json, generates TestCaseRuns for every
matching peripheral, dispatches a sub-agent per run that picks the right tools
(JTAG cartridge, firmware analysis cartridge, Surgeon hooks, SSH session) to
execute the reproduction steps, attaches Vulnerability objects with
ReproductionStep records to the live Operation, marks each run
passed/failed/blocked, and emits the DOCX report.
Bring a soldering iron to Part 4.






Leave a Reply