Wintermute Framework, Part 1: Architecture and the Engagement Data Model

Wintermute Framework, Part 1: Architecture and the Engagement Data Model

This is the opening post of a multi-part series on Wintermute, an open-source agentic framework for hardware penetration testing and red team engagements (MIT-licensed, docs, source). The series is published at exploit.ninja.

The objective of the series is not “how to run an LLM,” it is how to drive a real engagement with one — from a Bugzilla ticket through hardware test plan generation, peripheral-level execution, and a DOCX deliverable. Every post adds to the previous one; by the end of the series we will have a working orchestrator that reads a ticket, derives scope, generates TestCaseRuns, and dispatches a sub-agent per test case to execute reproduction steps over UART/JTAG/MCP and write findings back into a live Operation.

This first post is the mental model. If you skim, skim this one — every later post assumes the vocabulary defined here.

What Wintermute Actually Is

Wintermute is a Python 3.11+ engagement-management library plus two binaries:

BinaryModulePurpose
wintermutewintermute.WintermuteConsole:mainMetasploit-style REPL with builder/context navigation.
wintermute-mcpwintermute.WintermuteMCP:mainMCP server (SSE or stdio) exposing 80+ tools.

It is provider-agnosticRouter selects between AWS Bedrock (Claude / DeepSeek / Llama), OpenAI, Groq, and a local HuggingFace embedding provider — and backend-agnostic: storage, tickets, and reports are Python Protocols with at least two implementations each. You can run the entire stack offline with local_embedder + JsonFileBackend + InMemoryBackend + DocxTplPerVulnBackend, and swap any of those four for a cloud equivalent without touching the agent code.

The Architecture, From Top to Bottom

                ┌──────────────────────────────────────────────────────┐
                │   WintermuteConsole       │   WintermuteMCP server   │
                │   (REPL + builder ctx)    │   (SSE / stdio, 80+ tools)│
                └──────────────────────────┬┴──────────────────────────┘
                                           │
                                ┌──────────▼──────────┐
                                │      Operation      │  ← top-level aggregate
                                │  Devices · Users    │
                                │  Analysts · Cloud   │
                                │  TestPlans · Runs   │
                                │  Vulnerabilities    │
                                └──┬──────────┬───────┘
                                   │          │
        ┌──────────────────────────▼──┐    ┌──▼─────────────────────────┐
        │       AI Subsystem          │    │  Storage / Ticket / Report │
        │                             │    │   Backend Protocols        │
        │  Router → LLMRegistry       │    │                            │
        │     • bedrock               │    │  Json / DynamoDB           │
        │     • openai                │    │  InMemory / Bugzilla       │
        │     • groq                  │    │  DocxTplPerVuln            │
        │     • local_embedder        │    │                            │
        │     • rag-<kb_name>         │    └────────────────────────────┘
        │                             │
        │  ToolsRuntime               │
        │     • ToolRegistry (local)  │
        │     • MCPRuntime backends   │
        │     • SurgeonBackend        │
        │                             │
        │  CartridgeManager           │
        │     • tpm20                 │
        │     • jtag                  │
        │     • firmware_analysis     │
        └─────────────────────────────┘





Three things to internalize before going further:

  1. Operation is the only persistent state. Devices, services, peripherals,
    cloud accounts, vulnerabilities, test plans, and test runs are all attached
    to one Operation object that knows how to serialize itself via to_dict()
    / from_dict(). Every backend (JSON, DynamoDB, MCP transport) round-trips
    that single dict.


  2. The AI is a peer of the operator, not a layer on top of the framework.
    The same tools_runtime.ToolRegistry that an LLM hits via tool-calling is
    the registry the human operator drives via cartridges run <name> <fn> in
    the REPL. Cartridge methods are introspected and registered as AI tools
    through wintermute.cartridges.manager.CartridgeManager._register_instance_methods
    — there is no second tool surface for the AI.


  3. Everything pluggable is a Protocol. StorageBackend, TicketBackend,
    ReportBackend, ToolBackend, LLMProvider — all five are typing.Protocols.
    New providers, new ticket systems, new vector stores plug in by implementing
    the protocol and calling register_backend(...).


The Engagement Data Model — Verbatim From core.py

Wintermute does not invent a new schema. The objects are dataclass-style models with __schema__ attributes that describe how to deserialize nested types. Here is the canonical hierarchy, with line references in wintermute/core.py:

Operation core.py:621
├── analysts: list[Analyst] core.py:365
├── devices: list[Device] core.py:161
│ ├── services: list[Service] core.py:61
│ │ └── vulnerabilities: findings.py:114
│ │ └── reproduction_steps: findings.py:36
│ ├── peripherals: list[Peripheral] peripherals.py
│ │ └── (UART, JTAG, SPI, I2C, SWD, TPMPeripheral, USB, PCIe,
│ │ Bluetooth, Wifi, Zigbee, Ethernet)
│ ├── processor: Processor hardware.py:42
│ │ └── architecture: Architecture hardware.py:32
│ └── memory: Memory hardware.py:58
├── users: list[User] core.py:274
├── cloud_accounts: list[AWSAccount|...] cloud/aws.py
│ ├── iamusers, iamroles, services, vulnerabilities
├── test_plans: list[TestPlan] core.py:534
│ ├── test_cases: list[TestCase] core.py:495
│ │ ├── target_scope: TargetScope core.py:480
│ │ │ └── bindings: list[ObjectSelector] core.py:457
│ │ ├── steps: list[ReproductionStep]
│ │ ├── execution_mode: ExecutionMode
│ │ │ ("once" | "per_device" | "per_binding")
│ │ └── execution_binding: str
│ └── test_plans: list[TestPlan] ← test plans nest
└── test_runs: list[TestCaseRun] core.py:572
├── status: RunStatus
│ ("not_run" | "in_progress" | "passed" | "failed"
│ | "blocked" | "not_applicable")
├── started_at / ended_at: datetime
├── bound: list[BoundObjectRef]
└── findings: list[Vulnerability]

A few things stand out compared to typical pentest tooling:

  • Test plans are declarative JSON, with the schema implemented as Python
    dataclasses. The
    TestPlans/
    directory in the repo ships seven plans; TP-HW-BLACKBOX-001.json is a
    full hardware blackbox playbook (board survey → debug interfaces → buses
    → boot chain → TPM 2.0).
  • target_scope.bindings are selectors, not pointers, resolved against
    the live Operation at run-generation time. A TestCase declares “I need
    one device tagged dut and many UART peripherals on it”; Operation.resolveBindings
    (core.py:805) walks the operation’s devices, services, peripherals,
    IAM users, IAM roles, and cloud services to find matches and validates
    cardinality.
  • ExecutionMode controls fan-out: once produces a single run, per_device
    one run per matched DUT, and per_binding one run per matched peripheral.
    This is the lever the orchestrator (Part 6 / 7) will pull to dispatch a
    sub-agent per UART, per JTAG TAP, per IAM role.

Concretely, here is the same shape from the included examples/02-Operations-and-Storage.ipynb:

from wintermute.core import Operation
op = Operation("acme-pentest-2025")
op.addAnalyst("Jane Doe", "jdoe", "jane@acme.com")
op.addUser(uid="rsmith", name="Robert Smith",
email="robert@acme.com", teams=["red-team"])
op.addDevice("web-server-01", "10.0.1.10", operatingsystem="Ubuntu 22.04")
dev = op.getDeviceByHostname("web-server-01")
dev.addService(portNumber=443, app="nginx",
protocol="ipv4", transport_layer="HTTPS")
svc = dev.services[0]
svc.addVulnerability(
title="SQL Injection in /api/search",
cvss=9,
description="Union-based SQLi in search parameter",
risk={"likelihood": "High", "impact": "High", "severity": "Critical"},
)

Nothing AI yet — and that is the point. For the rest of the series, anything we build with the AI lands inside this same object graph.

How a Query Becomes a Tool Call

When a query enters the system (REPL ai <prompt>, MCP ai_chat, or programmatic simple_chat(...)), it follows this path. I’m walking through it because the orchestrator we build later overrides each of these stages.

  1. Router.choose(req) (wintermute/ai/provider.py) selects a provider.
    The default is Bedrock; if req.task_tag contains "cheap" and a Groq
    provider is registered, it routes to Groq instead. router.set_default(provider="rag-<kb>")
    pins the route to a RAG knowledge base for subsequent calls — this is how
    the console’s ai rag use <name> toggles document grounding live.


  2. If the chosen provider is a RAGProvider (any provider whose name
    starts with rag-), the last user message is sent to the provider’s
    LlamaIndex query engine. Retrieved chunks are injected as a preamble; the
    augmented request is forwarded to the provider’s configured base_provider
    (Bedrock, OpenAI, or Groq). RAG and tool-calling are independent — the
    tools list is forwarded unchanged so the base LLM sees both retrieved
    context and available tools.


  3. provider.chat(req) → ChatResponse comes back via LiteLLM. If the
    response includes tool_calls, the consumer enters the standard
    “execute → reply with tool message → ask again” loop.


  4. ToolsRuntime.run_tool(name, args) (wintermute/ai/tools_runtime.py:198)
    is the dispatcher. It first asks every registered dynamic backend whether
    it owns the tool (Surgeon over MCP, any MCPRuntime-managed stdio server,
    etc.) and falls back to the local ToolRegistry only if no backend claims
    it. This is what makes adding a new MCP server a runtime operation:
    mcp register <name> <command> [args...] in the REPL plus mcp start <name>
    and the LLM immediately sees the new tools.


The single most useful thing about this layout for an offensive engineer is that the LLM cannot tell apart a Python tool, a binary wrapped via tools.json path mapping, a cartridge method on a loaded plugin, or a tool sitting on a remote MCP server. They are all {"type": "function", "function": {...}} entries in ToolsRuntime.get_all_tools(). We exploit this in Part 4 to drop custom red-team tools next to vendor MCPs without writing any glue.

Why This Layout Matters for Pentests and Red Teams

Tools that wrap an LLM around bash and call it a “pentest agent” tend to fail on three fronts:

  1. No engagement state. They run shell commands, dump output, ask another
    question. There is no TestCaseRun to mark failed, no Vulnerability to
    attach to a service, no DOCX to deliver.
  2. No scoping primitives. Real engagements need scope; “test only IoT cameras
    tagged dut with UART peripherals exposed on header J3″ is not something
    you express with a system prompt. Wintermute’s ObjectSelector does.
  3. No reproducibility. A finding without a ReproductionStep is folklore.
    Wintermute attaches ReproductionStep objects to Vulnerability records,
    each with tool, action, arguments, vulnOutput, and confidence
    structured enough for the same agent (or a human, or a re-test six months
    later) to replay.

This is why every step of the orchestrator we will build by Part 7 writes back into the Operation. The DOCX produced at the end is not a separate artifact; it is Report.save(spec, [op], "out.docx") walking the operation tree we have been mutating the whole time.

What’s Next

PartTopic
1Architecture and the engagement data modelyou are here
2Operations, storage backends, the console — driving the framework by hand
3The AI subsystem: Router, providers, RAG, and the tool registry
4Cartridges, MCP, Surgeon — turning hardware tools into LLM tools
5A first agentic flow — single-prompt enrichment + tool calling
6The orchestrator — ticket → scope → TestPlan → TestCaseRun fan-out
7Per-test-case sub-agents — one agent per UART, per JTAG, per IAM role

Each post adds code to the previous one. By Part 7 we will have a working attacker-perspective workflow: drop a Bugzilla ticket “Audit MCIO I2C EEPROM on rasp1”, and the orchestrator pulls the ticket, builds a TestPlan from TestPlans/TP-HW-BLACKBOX-001.json, generates TestCaseRuns for every matching peripheral, dispatches a sub-agent per run that picks the right tools (JTAG cartridge, firmware analysis cartridge, Surgeon hooks, SSH session) to execute the reproduction steps, attaches Vulnerability objects with ReproductionStep records to the live Operation, marks each run passed/failed/blocked, and emits the DOCX report.

Bring a soldering iron to Part 4.

3 responses to “Wintermute Framework, Part 1: Architecture and the Engagement Data Model”

  1. […] Part 1 we mapped the architecture and the engagement data model. In this post we drive Wintermute by hand: […]

  2. […] Part 1 sketched the architecture; Part 2 drove the framework by hand. Now we plug in the AI. By the end of this post we have: […]

  3. […] the engagement model and architecture (Part 1), […]

Leave a Reply

Hey!

I’m Bedrock. Discover the ultimate Minetest resource – your go-to guide for expert tutorials, stunning mods, and exclusive stories. Elevate your game with insider knowledge and tips from seasoned Minetest enthusiasts.

Join the club

Stay updated with our latest tips and other news by joining our newsletter.

Discover more from Exploit.Ninja

Subscribe now to keep reading and get access to the full archive.

Continue reading