OpenWinBot

A high-performance Windows UI automation engine designed as a native LLM skill.

OpenWinBot gives any LLM structured, token-efficient perception of a running Windows application and the ability to interact with it — without screenshots, without OCR, and without HTTP. It reads the application's accessibility tree directly from the OS via the Windows UIAutomation COM API and streams it as semantic JSON over ZeroMQ.

┌──────────────────────────────────────────────────────────────────┐
│  LLM (Claude, GPT-4, …)                                          │
│  "I see 47 elements. I will click the Nine button by its ID."    │
└────────┬─────────────────────────────────────┬────────────────────┘
         │ read_state (SUB)                    │ click_id / type (REQ)
         ▼                                     ▼
┌─────────────────────┐             ┌──────────────────────┐
│   win-observer.exe  │             │   win-actuator.exe   │
│                     │             │                      │
│  UIAutomation COM   │  ZMQ PUB    │  ZMQ REP             │
│  CacheRequest bulk  │ ──────────► │  State sync thread   │
│  fetch → JSON       │  5555       │  click_id resolver   │
│  --semantic pruning │             │  SendInput injection │
└──────────┬──────────┘             └──────────────────────┘
           │  Windows UIAutomation                │  Win32 SendInput
           ▼                                      ▼
┌──────────────────────────────────────────────────────────────────┐
│  Target Application  (Calculator, Notepad, Paint, …)             │
└──────────────────────────────────────────────────────────────────┘

Why ZeroMQ and UIAutomation

Concern	Approach	Why
Transport	ZeroMQ raw TCP, no HTTP	Eliminates REST overhead for high-FPS local IPC; PUB/SUB is fire-and-forget with no request latency
Perception	Windows UIAutomation COM API	Reads the app's own accessibility tree — structured, deterministic, no pixel heuristics or OCR fragility
Performance	`IUIAutomationCacheRequest`	All properties for the full subtree fetched in one cross-process call. ~250 IPC round-trips per tick → 1
Token efficiency	`--semantic` pruning	Strips layout containers (Pane, Group, TitleBar, …) that carry no LLM-useful signal. Typical reduction: 50+ elements → 18–47
LLM targeting	`click_id` action	LLM copies the element's stable ID; actuator resolves the pixel centre automatically. Zero coordinate arithmetic

Project structure

OpenWinBot/
├── common/
│   └── Protocol.hpp        # Shared JSON schema (StateMessage, DeltaMessage,
│                           #   ActionCommand, ActionResponse, UiElement)
├── win-observer/
│   ├── CMakeLists.txt
│   └── main.cpp            # UIA scanner → ZMQ PUB
├── win-actuator/
│   ├── CMakeLists.txt
│   └── main.cpp            # ZMQ REP → SendInput + state cache thread
├── cmake/
│   └── FetchDeps.cmake     # libzmq v4.3.5, cppzmq v4.10.0, nlohmann/json v3.11.3
├── agent.py                # Claude-powered LLM agent (perception-action loop)
├── test_suite.py           # End-to-end test runner (Calculator, Notepad, Paint)
├── test_client.py          # Minimal manual test helper
├── CMakeLists.txt
├── build.bat               # One-command build
└── README.md

Requirements

C++ build

Windows 10 / 11
CMake 3.21+
Visual Studio 2022 — Desktop development with C++ workload
Git (used by CMake FetchContent to pull dependencies)

Python agent & tests

Python 3.8+
pip install anthropic pyzmq
ANTHROPIC_API_KEY environment variable

Quick start

:: 1. Build both binaries (first run fetches ~40 MB of deps — takes 2–5 min)
build.bat

:: 2. Install Python deps
pip install anthropic pyzmq

:: 3. Set your Anthropic API key
set ANTHROPIC_API_KEY=sk-ant-...

:: 4. Open Calculator, then run the agent
python agent.py --window "Calculator" --task "Compute 9 + 3 and tell me the result"

The agent starts win-observer and win-actuator automatically through test_suite.py. For manual control, see the Running manually section below.

Running manually

For development, debugging, or non-Claude integrations you can run the three components in separate terminals.

Terminal 1 — Observer

:: Full mode — every element in the UIA tree
build\bin\Release\win-observer.exe --window "Calculator" --fps 5

:: Semantic mode — buttons, inputs, text only (recommended for LLM use)
build\bin\Release\win-observer.exe --window "Calculator" --fps 5 --semantic

:: Semantic + delta — minimal payload on static screens (~60 bytes vs ~2 KB)
build\bin\Release\win-observer.exe --window "Calculator" --fps 10 --semantic --delta

Expected output (semantic mode):

[win-observer] Window   : Calculator
[win-observer] FPS      : 5
[win-observer] Bind     : tcp://*:5555
[win-observer] Mode     : semantic
[win-observer] IUIAutomation + CacheRequest ready.
[win-observer] ZMQ PUB bound. Starting scan loop...
[win-observer] full   elements=47  bytes=4821
[win-observer] full   elements=47  bytes=4821

If elements=0 every tick, the window title fragment does not match any visible window. Use Task Manager to confirm the exact title, then pass a substring: --window "Calc".

Terminal 2 — Actuator

:: Without click_id support
build\bin\Release\win-actuator.exe --window "Calculator"

:: With click_id support (recommended — enables ID-based targeting)
build\bin\Release\win-actuator.exe --window "Calculator" --obs tcp://localhost:5555

Expected output:

[win-actuator] Window  : Calculator
[win-actuator] Bind    : tcp://*:5556
[win-actuator] Obs     : tcp://localhost:5555
[win-actuator] State sync connected to tcp://localhost:5555
[win-actuator] ZMQ REP bound. Waiting for commands...

Terminal 3 — Manual test client

:: Dump the live element tree (useful for finding button names and IDs)
python test_client.py --dump

:: Click a button by name
python test_client.py --click "Nine"

:: Type text
python test_client.py --type "42"

LLM agent

agent.py closes the full perception-action loop. It reads the semantic state from the observer, formats it into a structured prompt, calls Claude with four tool definitions, executes each tool call against the actuator, and loops until the task is complete or --max-iters is reached.

Run the agent

:: Calculator arithmetic
python agent.py --window "Calculator" --task "Compute 9 + 3"

:: Notepad text entry
python agent.py --window "Notepad" --task "Type 'Hello from OpenWinBot' into the editor"

:: Paint exploration
python agent.py --window "Paint" --task "List every toolbar button you can see, then click the Pencil tool"

:: Custom model or iteration limit
python agent.py --window "Calculator" --task "Compute 99 + 1" --model claude-opus-4-6 --max-iters 20

Tools available to the LLM

Tool	Input	Description
`click_id`	`id: string`	Preferred. Click an element by its stable ID. The actuator resolves the centre automatically from its state cache.
`click`	`x, y: int`	Click a raw screen coordinate. Use when no element ID is available.
`type_text`	`text: string`	Inject a Unicode string as keystrokes. Works for any language or emoji.
`read_state`	(none)	Fetch the latest full UI state. Call after every action to observe the result.

What the LLM receives

In semantic mode the agent formats the state as a compact block. For a Calculator window:

Window: 'Calculator'  (timestamp=1712055600123)

── DISPLAY / STATUS ──────────────────────────────
  Text           'Display is 0'

── ACTIONABLE ELEMENTS ───────────────────────────
  ID        TYPE            NAME                          CENTER
  0a1b2c3d  Button          Zero                          (140,590)
  1c2d3e4f  Button          One                           (140,530)
  2d3e4f5a  Button          Two                           (220,530)
  ...
  9e0f1a2b  Button          Nine                          (300,470)
  ab1c2d3e  Button          Plus                          (380,530)
  bc2d3e4f  Button          Equals                        (380,590)
  cd3e4f5a  Text            Display is 0                  (190,90)

The LLM copies the short ID directly into a click_id call — no pixel arithmetic, no coordinate guessing.

System prompt used by agent.py

You are a Windows desktop automation agent powered by the OpenWinBot framework.
You receive a structured view of a window's UI elements and can interact with
them using the provided tools.

Rules:
1. Only interact with elements where role is "actionable" and is_enabled is true.
2. Prefer click_id over click — paste the element's id field exactly as shown.
3. After every action call read_state to observe the result before deciding next step.
4. When the task is complete, describe exactly what you did and the final state.
5. If an element you need is not visible, say so clearly rather than guessing.

Import as a module

from agent import run_agent

result = run_agent(
    task     = "Compute 9 + 3",
    window   = "Calculator",
    obs_addr = "tcp://localhost:5555",
    act_addr = "tcp://localhost:5556",
)
# result = {"success": True, "summary": "Clicked Nine, Plus, Three, Equals. Display shows 12.", "iterations": 5}

Automated test suite

test_suite.py launches Calculator, Notepad, and Paint; starts the observer and actuator for each; runs the Claude agent; and validates the result.

:: Run all three tests
python test_suite.py

:: Run a single test
python test_suite.py --tests calculator

:: Run two tests, leave apps open for inspection
python test_suite.py --tests calculator notepad --no-close

:: Custom binary path (e.g. Debug build)
python test_suite.py --bin-dir build\bin\Debug

Expected output:

════════════════════════════════════════════════════════════════
  OpenWinBot Test Suite  (3 test(s))
  Binaries : D:\...\build\bin\Release
  Model    : claude-sonnet-4-6
════════════════════════════════════════════════════════════════

── Calculator — arithmetic (9 + 3 = 12) ────────────────────────
  [1/4] Launching calc.exe...
  [2/4] Starting win-observer (semantic, 3 fps)...
  [3/4] Starting win-actuator (with --obs for click_id)...
  [4/4] Running agent (Claude)...

  [agent] Tool call: click_id({"id": "9e0f1a2b..."})
  [agent] Tool call: click_id({"id": "ab1c2d3e..."})
  [agent] Tool call: click_id({"id": "2d3e4f5a..."})
  [agent] Tool call: click_id({"id": "bc2d3e4f..."})
  [agent] Tool call: read_state({})
  [agent] ✓ Done after 6 iterations.

  Result :  PASS  — display shows 12

════ Test Summary ════════════════════════════════════════════════
   PASS   calculator      display shows 12
   PASS   notepad         typed successfully
   PASS   paint           completed without error

  Total: 3/3 passed

CLI reference

win-observer

Flag	Default	Description
`--window` / `-w`	(required)	Partial window title fragment to scan
`--fps` / `-f`	`10`	Scan rate 1–120 fps
`--bind` / `-b`	`tcp://*:5555`	ZMQ PUB socket bind address
`--semantic` / `-s`	off	Emit only `actionable` and `informational` elements; drop all structural noise
`--delta` / `-d`	off	After the first full frame, publish only incremental diffs. Every 30 ticks a full frame is re-broadcast for late-joining subscribers

win-actuator

Flag	Default	Description
`--window` / `-w`	(required)	Partial window title fragment to inject input into
`--bind` / `-b`	`tcp://*:5556`	ZMQ REP socket bind address
`--obs` / `-o`	(none)	Observer PUB address to subscribe to. Enables the background state-cache thread and the `click_id` action

agent.py

Flag	Default	Description
`--window` / `-w`	(required)	Target window title
`--task` / `-t`	(required)	Natural-language task for Claude
`--obs`	`tcp://localhost:5555`	Observer address
`--act`	`tcp://localhost:5556`	Actuator address
`--model`	`claude-sonnet-4-6`	Anthropic model ID
`--max-iters`	`12`	Maximum agent loop iterations before giving up

test_suite.py

Flag	Default	Description
`--bin-dir`	`build/bin/Release`	Directory containing the compiled `.exe` files
`--tests`	(all)	Space-separated list of test names: `calculator`, `notepad`, `paint`
`--no-close`	off	Leave target apps running after each test
`--obs-port`	`5555`	PUB socket port
`--act-port`	`5556`	REP socket port

test_client.py

Flag	Description
`--dump`	Print element tree and exit
`--click NAME`	Click element whose name matches `NAME`
`--type TEXT`	Type `TEXT` into the window
`--obs ADDR`	Observer address override
`--act ADDR`	Actuator address override

JSON protocol

Full state message `"type": "full"`

Published by win-observer every tick (or every 30 ticks in delta mode as a re-sync).

{
  "type":      "full",
  "timestamp": 1712055600123,
  "window":    "Calculator",
  "elements": [
    {
      "id":         "9e0f1a2b3c4d5e6f",
      "type":       "Button",
      "name":       "Nine",
      "rect":       { "x": 260, "y": 470, "w": 80, "h": 60 },
      "role":       "actionable",
      "is_enabled": true
    },
    {
      "id":         "1b2c3d4e5f6a7b8c",
      "type":       "Text",
      "name":       "Display is 0",
      "rect":       { "x": 10, "y": 60, "w": 340, "h": 54 },
      "role":       "informational",
      "is_enabled": true
    }
  ]
}

Field	Description
`id`	16-char hex hash of `type + name + rect`. Stable across frames as long as the element doesn't move or change name. Safe to use as a long-lived reference.
`type`	UIAutomation control type: `Button`, `Edit`, `Text`, `CheckBox`, `ComboBox`, `ListItem`, `MenuItem`, `Slider`, `TabItem`, etc.
`role`	`"actionable"` — LLM can click/type/toggle. `"informational"` — LLM should read. `""` — structural noise, only present in full mode.
`is_enabled`	`false` when element is visible but greyed-out. LLMs must not attempt to interact with disabled elements.

Delta message `"type": "delta"`

Published when --delta is active and the screen has changed since the last tick.

{
  "type":      "delta",
  "timestamp": 1712055600456,
  "window":    "Calculator",
  "added":     [ { ...UiElement... } ],
  "removed":   [ "9e0f1a2b3c4d5e6f" ],
  "changed":   [ { ...UiElement... } ]
}

removed contains only the element IDs (strings), not the full objects.

Action command `client → actuator`

{ "action": "click",    "x": 300, "y": 530 }
{ "action": "type",     "text": "Hello World" }
{ "action": "click_id", "id": "9e0f1a2b3c4d5e6f" }

click_id requires win-actuator to have been started with --obs. The actuator's background thread caches the full state (applying deltas as they arrive) and resolves the element centre on every click_id request.

Action response `actuator → client`

{ "status": "ok",    "message": "" }
{ "status": "error", "message": "element is disabled: Clear" }

Building from source

One command

build.bat

Manual (specific generator)

cmake -S . -B build -G "Visual Studio 17 2022" -A x64
cmake --build build --config Release --parallel

Binaries are written to build\bin\Release\.

What gets fetched

Library	Version	Role
libzmq	v4.3.5	ZeroMQ C core (built as static lib)
cppzmq	v4.10.0	Header-only C++ wrapper
nlohmann/json	v3.11.3	JSON serialisation

All three are fetched by CMake FetchContent on first configure. No vcpkg or manual installation needed.

Troubleshooting

elements=0 every tick The window title fragment doesn't match any visible window. Check the exact title in Task Manager → Details and pass a substring: --window "Calc".

click_id returns "no state cached yet" The actuator was started without --obs, or the observer hasn't published a full frame yet. Add --obs tcp://localhost:5555 when starting the actuator and wait ~1 second for the cache to warm.

Clicks land in the wrong place on a high-DPI display UIAutomation reports physical pixel coordinates. The actuator normalises these with MulDiv(x, 65535, SM_CXSCREEN). If the target app is DPI-unaware and Windows is scaling it, coordinates may be off. Fix: right-click the target app's .exe → Properties → Compatibility → Override high DPI scaling behaviour → Application.

Clicks are silently ignored SendInput is blocked by UIPI when the target process runs at a higher integrity level than the actuator. Run win-actuator.exe as Administrator, or launch the target app from a non-elevated shell so both share the same integrity level.

agent.py times out waiting for a state frame The observer isn't running, or isn't publishing to the address the agent is subscribing to. Confirm both are using the same port (--bind tcp://*:5555 on the observer, --obs tcp://localhost:5555 on the agent).

Build fails on first configure CMake fetches dependencies over HTTPS using Git. Ensure Git is on PATH. Behind a corporate proxy, set HTTP_PROXY and HTTPS_PROXY before running CMake, or configure Git's proxy: git config --global http.proxy http://proxy:port.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
cmake		cmake
common		common
win-actuator		win-actuator
win-observer		win-observer
.gitignore		.gitignore
CMakeLists.txt		CMakeLists.txt
README.md		README.md
agent.py		agent.py
build.bat		build.bat
test_client.py		test_client.py
test_suite.py		test_suite.py

Folders and files

Latest commit

History

Repository files navigation

OpenWinBot

Why ZeroMQ and UIAutomation

Project structure

Requirements

C++ build

Python agent & tests

Quick start

Running manually

Terminal 1 — Observer

Terminal 2 — Actuator

Terminal 3 — Manual test client

LLM agent

Run the agent

Tools available to the LLM

What the LLM receives

System prompt used by agent.py

Import as a module

Automated test suite

CLI reference

win-observer

win-actuator

agent.py

test_suite.py

test_client.py

JSON protocol

Full state message "type": "full"

Delta message "type": "delta"

Action command client → actuator

Action response actuator → client

Building from source

One command

Manual (specific generator)

What gets fetched

Troubleshooting

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Full state message `"type": "full"`

Delta message `"type": "delta"`

Action command `client → actuator`

Action response `actuator → client`

Packages