A high-performance Windows UI automation engine designed as a native LLM skill.
OpenWinBot gives any LLM structured, token-efficient perception of a running Windows application and the ability to interact with it — without screenshots, without OCR, and without HTTP. It reads the application's accessibility tree directly from the OS via the Windows UIAutomation COM API and streams it as semantic JSON over ZeroMQ.
┌──────────────────────────────────────────────────────────────────┐
│ LLM (Claude, GPT-4, …) │
│ "I see 47 elements. I will click the Nine button by its ID." │
└────────┬─────────────────────────────────────┬────────────────────┘
│ read_state (SUB) │ click_id / type (REQ)
▼ ▼
┌─────────────────────┐ ┌──────────────────────┐
│ win-observer.exe │ │ win-actuator.exe │
│ │ │ │
│ UIAutomation COM │ ZMQ PUB │ ZMQ REP │
│ CacheRequest bulk │ ──────────► │ State sync thread │
│ fetch → JSON │ 5555 │ click_id resolver │
│ --semantic pruning │ │ SendInput injection │
└──────────┬──────────┘ └──────────────────────┘
│ Windows UIAutomation │ Win32 SendInput
▼ ▼
┌──────────────────────────────────────────────────────────────────┐
│ Target Application (Calculator, Notepad, Paint, …) │
└──────────────────────────────────────────────────────────────────┘
| Concern | Approach | Why |
|---|---|---|
| Transport | ZeroMQ raw TCP, no HTTP | Eliminates REST overhead for high-FPS local IPC; PUB/SUB is fire-and-forget with no request latency |
| Perception | Windows UIAutomation COM API | Reads the app's own accessibility tree — structured, deterministic, no pixel heuristics or OCR fragility |
| Performance | IUIAutomationCacheRequest |
All properties for the full subtree fetched in one cross-process call. ~250 IPC round-trips per tick → 1 |
| Token efficiency | --semantic pruning |
Strips layout containers (Pane, Group, TitleBar, …) that carry no LLM-useful signal. Typical reduction: 50+ elements → 18–47 |
| LLM targeting | click_id action |
LLM copies the element's stable ID; actuator resolves the pixel centre automatically. Zero coordinate arithmetic |
OpenWinBot/
├── common/
│ └── Protocol.hpp # Shared JSON schema (StateMessage, DeltaMessage,
│ # ActionCommand, ActionResponse, UiElement)
├── win-observer/
│ ├── CMakeLists.txt
│ └── main.cpp # UIA scanner → ZMQ PUB
├── win-actuator/
│ ├── CMakeLists.txt
│ └── main.cpp # ZMQ REP → SendInput + state cache thread
├── cmake/
│ └── FetchDeps.cmake # libzmq v4.3.5, cppzmq v4.10.0, nlohmann/json v3.11.3
├── agent.py # Claude-powered LLM agent (perception-action loop)
├── test_suite.py # End-to-end test runner (Calculator, Notepad, Paint)
├── test_client.py # Minimal manual test helper
├── CMakeLists.txt
├── build.bat # One-command build
└── README.md
- Windows 10 / 11
- CMake 3.21+
- Visual Studio 2022 — Desktop development with C++ workload
- Git (used by CMake FetchContent to pull dependencies)
- Python 3.8+
pip install anthropic pyzmqANTHROPIC_API_KEYenvironment variable
:: 1. Build both binaries (first run fetches ~40 MB of deps — takes 2–5 min)
build.bat
:: 2. Install Python deps
pip install anthropic pyzmq
:: 3. Set your Anthropic API key
set ANTHROPIC_API_KEY=sk-ant-...
:: 4. Open Calculator, then run the agent
python agent.py --window "Calculator" --task "Compute 9 + 3 and tell me the result"The agent starts win-observer and win-actuator automatically through test_suite.py. For manual control, see the Running manually section below.
For development, debugging, or non-Claude integrations you can run the three components in separate terminals.
:: Full mode — every element in the UIA tree
build\bin\Release\win-observer.exe --window "Calculator" --fps 5
:: Semantic mode — buttons, inputs, text only (recommended for LLM use)
build\bin\Release\win-observer.exe --window "Calculator" --fps 5 --semantic
:: Semantic + delta — minimal payload on static screens (~60 bytes vs ~2 KB)
build\bin\Release\win-observer.exe --window "Calculator" --fps 10 --semantic --deltaExpected output (semantic mode):
[win-observer] Window : Calculator
[win-observer] FPS : 5
[win-observer] Bind : tcp://*:5555
[win-observer] Mode : semantic
[win-observer] IUIAutomation + CacheRequest ready.
[win-observer] ZMQ PUB bound. Starting scan loop...
[win-observer] full elements=47 bytes=4821
[win-observer] full elements=47 bytes=4821
If
elements=0every tick, the window title fragment does not match any visible window. Use Task Manager to confirm the exact title, then pass a substring:--window "Calc".
:: Without click_id support
build\bin\Release\win-actuator.exe --window "Calculator"
:: With click_id support (recommended — enables ID-based targeting)
build\bin\Release\win-actuator.exe --window "Calculator" --obs tcp://localhost:5555Expected output:
[win-actuator] Window : Calculator
[win-actuator] Bind : tcp://*:5556
[win-actuator] Obs : tcp://localhost:5555
[win-actuator] State sync connected to tcp://localhost:5555
[win-actuator] ZMQ REP bound. Waiting for commands...
:: Dump the live element tree (useful for finding button names and IDs)
python test_client.py --dump
:: Click a button by name
python test_client.py --click "Nine"
:: Type text
python test_client.py --type "42"agent.py closes the full perception-action loop. It reads the semantic state from the observer, formats it into a structured prompt, calls Claude with four tool definitions, executes each tool call against the actuator, and loops until the task is complete or --max-iters is reached.
:: Calculator arithmetic
python agent.py --window "Calculator" --task "Compute 9 + 3"
:: Notepad text entry
python agent.py --window "Notepad" --task "Type 'Hello from OpenWinBot' into the editor"
:: Paint exploration
python agent.py --window "Paint" --task "List every toolbar button you can see, then click the Pencil tool"
:: Custom model or iteration limit
python agent.py --window "Calculator" --task "Compute 99 + 1" --model claude-opus-4-6 --max-iters 20| Tool | Input | Description |
|---|---|---|
click_id |
id: string |
Preferred. Click an element by its stable ID. The actuator resolves the centre automatically from its state cache. |
click |
x, y: int |
Click a raw screen coordinate. Use when no element ID is available. |
type_text |
text: string |
Inject a Unicode string as keystrokes. Works for any language or emoji. |
read_state |
(none) | Fetch the latest full UI state. Call after every action to observe the result. |
In semantic mode the agent formats the state as a compact block. For a Calculator window:
Window: 'Calculator' (timestamp=1712055600123)
── DISPLAY / STATUS ──────────────────────────────
Text 'Display is 0'
── ACTIONABLE ELEMENTS ───────────────────────────
ID TYPE NAME CENTER
0a1b2c3d Button Zero (140,590)
1c2d3e4f Button One (140,530)
2d3e4f5a Button Two (220,530)
...
9e0f1a2b Button Nine (300,470)
ab1c2d3e Button Plus (380,530)
bc2d3e4f Button Equals (380,590)
cd3e4f5a Text Display is 0 (190,90)
The LLM copies the short ID directly into a click_id call — no pixel arithmetic, no coordinate guessing.
You are a Windows desktop automation agent powered by the OpenWinBot framework.
You receive a structured view of a window's UI elements and can interact with
them using the provided tools.
Rules:
1. Only interact with elements where role is "actionable" and is_enabled is true.
2. Prefer click_id over click — paste the element's id field exactly as shown.
3. After every action call read_state to observe the result before deciding next step.
4. When the task is complete, describe exactly what you did and the final state.
5. If an element you need is not visible, say so clearly rather than guessing.
from agent import run_agent
result = run_agent(
task = "Compute 9 + 3",
window = "Calculator",
obs_addr = "tcp://localhost:5555",
act_addr = "tcp://localhost:5556",
)
# result = {"success": True, "summary": "Clicked Nine, Plus, Three, Equals. Display shows 12.", "iterations": 5}test_suite.py launches Calculator, Notepad, and Paint; starts the observer and actuator for each; runs the Claude agent; and validates the result.
:: Run all three tests
python test_suite.py
:: Run a single test
python test_suite.py --tests calculator
:: Run two tests, leave apps open for inspection
python test_suite.py --tests calculator notepad --no-close
:: Custom binary path (e.g. Debug build)
python test_suite.py --bin-dir build\bin\DebugExpected output:
════════════════════════════════════════════════════════════════
OpenWinBot Test Suite (3 test(s))
Binaries : D:\...\build\bin\Release
Model : claude-sonnet-4-6
════════════════════════════════════════════════════════════════
── Calculator — arithmetic (9 + 3 = 12) ────────────────────────
[1/4] Launching calc.exe...
[2/4] Starting win-observer (semantic, 3 fps)...
[3/4] Starting win-actuator (with --obs for click_id)...
[4/4] Running agent (Claude)...
[agent] Tool call: click_id({"id": "9e0f1a2b..."})
[agent] Tool call: click_id({"id": "ab1c2d3e..."})
[agent] Tool call: click_id({"id": "2d3e4f5a..."})
[agent] Tool call: click_id({"id": "bc2d3e4f..."})
[agent] Tool call: read_state({})
[agent] ✓ Done after 6 iterations.
Result : PASS — display shows 12
════ Test Summary ════════════════════════════════════════════════
PASS calculator display shows 12
PASS notepad typed successfully
PASS paint completed without error
Total: 3/3 passed
| Flag | Default | Description |
|---|---|---|
--window / -w |
(required) | Partial window title fragment to scan |
--fps / -f |
10 |
Scan rate 1–120 fps |
--bind / -b |
tcp://*:5555 |
ZMQ PUB socket bind address |
--semantic / -s |
off | Emit only actionable and informational elements; drop all structural noise |
--delta / -d |
off | After the first full frame, publish only incremental diffs. Every 30 ticks a full frame is re-broadcast for late-joining subscribers |
| Flag | Default | Description |
|---|---|---|
--window / -w |
(required) | Partial window title fragment to inject input into |
--bind / -b |
tcp://*:5556 |
ZMQ REP socket bind address |
--obs / -o |
(none) | Observer PUB address to subscribe to. Enables the background state-cache thread and the click_id action |
| Flag | Default | Description |
|---|---|---|
--window / -w |
(required) | Target window title |
--task / -t |
(required) | Natural-language task for Claude |
--obs |
tcp://localhost:5555 |
Observer address |
--act |
tcp://localhost:5556 |
Actuator address |
--model |
claude-sonnet-4-6 |
Anthropic model ID |
--max-iters |
12 |
Maximum agent loop iterations before giving up |
| Flag | Default | Description |
|---|---|---|
--bin-dir |
build/bin/Release |
Directory containing the compiled .exe files |
--tests |
(all) | Space-separated list of test names: calculator, notepad, paint |
--no-close |
off | Leave target apps running after each test |
--obs-port |
5555 |
PUB socket port |
--act-port |
5556 |
REP socket port |
| Flag | Description |
|---|---|
--dump |
Print element tree and exit |
--click NAME |
Click element whose name matches NAME |
--type TEXT |
Type TEXT into the window |
--obs ADDR |
Observer address override |
--act ADDR |
Actuator address override |
Published by win-observer every tick (or every 30 ticks in delta mode as a re-sync).
{
"type": "full",
"timestamp": 1712055600123,
"window": "Calculator",
"elements": [
{
"id": "9e0f1a2b3c4d5e6f",
"type": "Button",
"name": "Nine",
"rect": { "x": 260, "y": 470, "w": 80, "h": 60 },
"role": "actionable",
"is_enabled": true
},
{
"id": "1b2c3d4e5f6a7b8c",
"type": "Text",
"name": "Display is 0",
"rect": { "x": 10, "y": 60, "w": 340, "h": 54 },
"role": "informational",
"is_enabled": true
}
]
}| Field | Description |
|---|---|
id |
16-char hex hash of type + name + rect. Stable across frames as long as the element doesn't move or change name. Safe to use as a long-lived reference. |
type |
UIAutomation control type: Button, Edit, Text, CheckBox, ComboBox, ListItem, MenuItem, Slider, TabItem, etc. |
role |
"actionable" — LLM can click/type/toggle. "informational" — LLM should read. "" — structural noise, only present in full mode. |
is_enabled |
false when element is visible but greyed-out. LLMs must not attempt to interact with disabled elements. |
Published when --delta is active and the screen has changed since the last tick.
{
"type": "delta",
"timestamp": 1712055600456,
"window": "Calculator",
"added": [ { ...UiElement... } ],
"removed": [ "9e0f1a2b3c4d5e6f" ],
"changed": [ { ...UiElement... } ]
}removed contains only the element IDs (strings), not the full objects.
{ "action": "click", "x": 300, "y": 530 }
{ "action": "type", "text": "Hello World" }
{ "action": "click_id", "id": "9e0f1a2b3c4d5e6f" }click_id requires win-actuator to have been started with --obs. The actuator's background thread caches the full state (applying deltas as they arrive) and resolves the element centre on every click_id request.
{ "status": "ok", "message": "" }
{ "status": "error", "message": "element is disabled: Clear" }build.batcmake -S . -B build -G "Visual Studio 17 2022" -A x64
cmake --build build --config Release --parallelBinaries are written to build\bin\Release\.
| Library | Version | Role |
|---|---|---|
| libzmq | v4.3.5 | ZeroMQ C core (built as static lib) |
| cppzmq | v4.10.0 | Header-only C++ wrapper |
| nlohmann/json | v3.11.3 | JSON serialisation |
All three are fetched by CMake FetchContent on first configure. No vcpkg or manual installation needed.
elements=0 every tick
The window title fragment doesn't match any visible window. Check the exact title in Task Manager → Details and pass a substring: --window "Calc".
click_id returns "no state cached yet"
The actuator was started without --obs, or the observer hasn't published a full frame yet. Add --obs tcp://localhost:5555 when starting the actuator and wait ~1 second for the cache to warm.
Clicks land in the wrong place on a high-DPI display
UIAutomation reports physical pixel coordinates. The actuator normalises these with MulDiv(x, 65535, SM_CXSCREEN). If the target app is DPI-unaware and Windows is scaling it, coordinates may be off. Fix: right-click the target app's .exe → Properties → Compatibility → Override high DPI scaling behaviour → Application.
Clicks are silently ignored
SendInput is blocked by UIPI when the target process runs at a higher integrity level than the actuator. Run win-actuator.exe as Administrator, or launch the target app from a non-elevated shell so both share the same integrity level.
agent.py times out waiting for a state frame
The observer isn't running, or isn't publishing to the address the agent is subscribing to. Confirm both are using the same port (--bind tcp://*:5555 on the observer, --obs tcp://localhost:5555 on the agent).
Build fails on first configure
CMake fetches dependencies over HTTPS using Git. Ensure Git is on PATH. Behind a corporate proxy, set HTTP_PROXY and HTTPS_PROXY before running CMake, or configure Git's proxy: git config --global http.proxy http://proxy:port.