Desktop automation CLI for AI agents. Fast native Rust CLI.
Browser automation is out of scope. agent-ctrl drives native UI; for Chromium-via-CDP use the sibling agent-browser project. The two are designed to compose in the same agent loop.
npm install -g @agent-ctrl/cliThe package ships a Node launcher; the postinstall step downloads the matching native binary for your platform from the corresponding GitHub Release. Supported in v0.1.x: Windows x64, macOS arm64, macOS x64. Linux is on the roadmap.
For tagged releases, download the Windows zip from GitHub Releases or run:
powershell -ExecutionPolicy Bypass -File .\scripts\install-windows.ps1The installer downloads the latest agent-ctrl.exe, installs it under
%LOCALAPPDATA%\agent-ctrl\bin, and adds that directory to the user PATH
unless -NoPath is passed.
For tagged releases, download the macOS tarball from GitHub Releases (one
asset per arch: aarch64-apple-darwin for Apple Silicon, x86_64-apple-darwin
for Intel) or run:
curl -fsSL https://raw.githubusercontent.com/k4cper-g/agent-ctrl/main/scripts/install-macos.sh | bashThe installer detects the host arch, downloads the matching tarball, places
agent-ctrl at ~/.local/bin/agent-ctrl, and runs agent-ctrl info to
verify. Pass --install-dir <path> to install elsewhere or --no-path to
skip the PATH-update reminder.
After install, grant Accessibility permission in System Settings >
Privacy & Security > Accessibility (and Screen Recording in the same pane
if you'll use screenshot). Run agent-ctrl doctor to verify.
git clone https://github.com/k4cper-g/agent-ctrl
cd agent-ctrl
cargo build --release -p agent-ctrl-cli
# put target/release/agent-ctrl on your PATHThe Rust workspace crates are not published to crates.io in v0.1. The public
distribution paths are npm install -g @agent-ctrl/cli, the GitHub release
binaries (agent-ctrl-*), and source builds.
npm install @agent-ctrl/client
# expects `agent-ctrl` on PATH for the daemon transport- Windows 10/11 for UIA, macOS 12+ for AX. Both surfaces ship the full action vocabulary. Linux (AT-SPI) ships the snapshot-read path -
snapshot,find, inspect, andwindow-list; its action vocabulary, plus Android / iOS, are not implemented yet. Other OSes build cleanly with stub surfaces. - Rust 1.85+ (workspace MSRV; rustup will install it from
rust-toolchain.toml). - Node.js 20+ only when using the TypeScript client.
agent-ctrl info # OS, available surfaces, active sessions
agent-ctrl open uia # spawn a daemon (background)
agent-ctrl snapshot --target-process <name> # tree of refs (@e0, @e1, ...)
agent-ctrl click @e4 # click by ref
agent-ctrl get name @e4 # inspect cached snapshot fields
agent-ctrl is enabled @e4 # boolean state checks
agent-ctrl fill @e0 "hello from agent-ctrl" # set value via UIA ValuePattern
agent-ctrl press "Ctrl+S" # key chord via SendInput
agent-ctrl screenshot result.png # PNG of the pinned window
agent-ctrl close # stop the daemonEvery action follows the same pattern: snapshot once to learn what's on screen, then issue actions by ref. Refs are valid only for the snapshot that produced them - re-snapshot before acting on a tree that has changed.
agent-ctrl open <surface> # spawn a daemon (uia, mock, ...)
agent-ctrl close # stop the daemon
agent-ctrl list [--json] # active sessions
agent-ctrl info [--json] # static facts about this binary
agent-ctrl doctor [--json] [--fix] [--quick] # diagnose the install + live probe
agent-ctrl launch [--json] <path> [--wait MS] # spawn an app detached from this shellagent-ctrl snapshot # capture pinned window's a11y tree
agent-ctrl snapshot --target-process <name> # pin by process executable name
agent-ctrl snapshot --target-pid <pid> # pin by PID
agent-ctrl snapshot --target-title <substring> # pin by window title (locale-dependent)
agent-ctrl snapshot --settle # re-snapshot until the tree stabilizes
agent-ctrl snapshot --json # full JSON for programmatic consumption
agent-ctrl snapshot --compact false # disable compact-tree filteringThe first snapshot after open pins the session to a target window. Subsequent actions on the session target that window until a focus-window re-pins it.
--settle re-snapshots (every 200ms, ~8s cap) until the tree's structural signature holds steady, then prints it. Use it right after launch / switch-app against a Chromium/Electron app (Slack, Teams, VS Code, ...), whose accessibility tree is populated lazily on the first query - so the first plain snapshot often shows only the window frame.
agent-ctrl click @eN # primary-button click on a ref
agent-ctrl double-click @eN # double-click
agent-ctrl right-click @eN # secondary-button click
agent-ctrl hover @eN # cursor over element, no buttons
agent-ctrl focus @eN # UIA SetFocus
agent-ctrl highlight @eN # move cursor to element for human debuggingagent-ctrl type "hello" # synthetic Unicode keystrokes
agent-ctrl fill @eN "value" # native value setting where supported
agent-ctrl clear @eN # clear an editable field
agent-ctrl press "Ctrl+S" # key chord - Enter, Tab, Ctrl+A, Cmd+A, etc.
agent-ctrl key-down "Shift" # hold a modifier
agent-ctrl key-up "Shift" # release it
agent-ctrl clipboard read # read clipboard text
agent-ctrl clipboard write "text" # replace clipboard text
agent-ctrl clipboard copy # send Ctrl+C
agent-ctrl clipboard paste # send Ctrl+Vagent-ctrl select @eN "Option name" # pick an item in a select / combo / list
agent-ctrl select-all [@eN] # select all in field; without ref, sends Ctrl+A to focus
agent-ctrl check @eN # set a TogglePattern control on
agent-ctrl uncheck @eN # set a TogglePattern control off
agent-ctrl toggle @eN # toggle a TogglePattern control
agent-ctrl scroll <DX> <DY> [--ref @eN] # wheel scroll (positive DY = down)
agent-ctrl scroll-into-view @eN # UIA ScrollItemPattern
agent-ctrl drag @eFROM @eTO # source-to-destination drag
agent-ctrl mouse move X Y # raw mouse move
agent-ctrl mouse down X Y --button left # raw button down
agent-ctrl mouse up X Y --button left # raw button up
agent-ctrl mouse wheel X Y --dy -120 # raw wheelagent-ctrl find "Save" # case-insensitive substring on name
agent-ctrl find "Save" --role button # narrow by role (kebab-case)
agent-ctrl find "Save" --exact # case-sensitive equality
agent-ctrl find --role menu-item # all nodes of a role; no name filter
agent-ctrl find "OK" --in @e2 # restrict to subtree under @e2
agent-ctrl find "Save" --first # bare ref for shell substitution
agent-ctrl find --limit 5 # cap result countfind queries the cached snapshot - it does not re-walk the OS tree. With no match, writes no match to stderr and exits non-zero. --first prints just @eN so the canonical "find then act" pattern composes:
agent-ctrl click "$(agent-ctrl find "Save" --role button --first)"agent-ctrl get text @eN # value if present, otherwise accessible name
agent-ctrl get value @eN # editable/value-bearing field value
agent-ctrl get name @eN # accessible name
agent-ctrl get role @eN # canonical role
agent-ctrl get state @eN # full state object
agent-ctrl get bounds @eN # logical screen bounds
agent-ctrl get window # cached window context
agent-ctrl is visible @eN
agent-ctrl is enabled @eN
agent-ctrl is focused @eN
agent-ctrl is selected @eN
agent-ctrl is checked @eN
agent-ctrl is expanded @eNInspect commands read the cached snapshot. They are fast and deterministic, but require a prior snapshot.
agent-ctrl wait <MS> # dumb sleep on the daemon worker
agent-ctrl wait-for "Save" --role button # wait for a node to appear
agent-ctrl wait-for "Loading..." --gone # wait for a node to disappear
agent-ctrl wait-for "Agree" --state checked # wait for a boolean state
agent-ctrl wait-for --role text-field --value-contains ready
agent-ctrl wait-for --window-appears "Dialog title" # wait for a sibling window title
agent-ctrl wait-for --stable [--idle-ms 500] # wait for the tree signature to settle
agent-ctrl wait-for ... --timeout 10000 --poll 250 # tune the poll loopThree reliability tiers. Use --stable after a click to let the UI settle before the next action. Exit codes: 0 satisfied, 1 bad args, 2 timeout - branch on those in shell pipelines instead of parsing strings.
agent-ctrl window-list # all top-level windows owned by the pinned process
agent-ctrl window-list --first-other # bare hex id of the first non-pinned window
agent-ctrl focus-window <hex_id> # bring a window to the foreground; re-pins the session
agent-ctrl switch-app <app_id> # foreground an app by id (path or bare exe name); re-pinsWhen a file dialog, confirmation dialog, or popup appears as a sibling top-level window, window-list is how you find it. focus-window re-pins so subsequent snapshot / find / actions target the dialog. Mirrors agent-browser's tab_list / tab_switch.
switch-app and focus-window un-minimize their target (a window only minimized to the taskbar) before bringing it forward. They do not un-hide a window the app has hidden to the system tray (Slack, Teams, Discord and friends "close to tray") - tray apps re-hide a window shown out from under them. If snapshot --target-process X reports the process is running but its window is hidden, bring it forward through the app's own channels: click its tray icon, or agent-ctrl launch <path> (launching a packaged/Store app's exe still routes through the activation broker, which resumes and shows it correctly).
agent-ctrl press "Ctrl+S" # may open a sibling dialog HWND
agent-ctrl focus-window "$(agent-ctrl window-list --first-other)"
agent-ctrl snapshot # now sees the dialog
agent-ctrl click "$(agent-ctrl find "OK" --role button --first)"For detailed Windows guidance on dialogs, elevation, stale refs, foreground focus, IME, screenshots, and app framework quirks, see docs/windows-reliability.md.
agent-ctrl screenshot # PNG of the pinned window to a temp path
agent-ctrl screenshot result.png # to a specific path
agent-ctrl screenshot --region X,Y,W,H # crop in physical screen pixels
agent-ctrl screenshot --target desktop # virtual desktop
agent-ctrl screenshot --target window # pinned window
agent-ctrl screenshot --target ref --ref @eN # element bounds
agent-ctrl screenshot --annotated # draw @eN labels from cached snapshot bounds--annotated draws cached snapshot refs onto the PNG. Run snapshot first so the screenshot has a current ref map and bounds.
agent-ctrl batch --file steps.json
Get-Content steps.json | agent-ctrl batch --stdin # PowerShell-friendly
agent-ctrl batch '[{"op":"find","query":{"name":"Save","limit":1}}]' # Unix-shell-friendlyBatch steps run in order on one daemon session and return structured per-step results. Supported step ops: act, find, get, is, wait, and list_windows.
agent-ctrl list --json
agent-ctrl find "Save" --role button --json
agent-ctrl get state @eN --json
agent-ctrl is enabled @eN --json
agent-ctrl click @eN --json
agent-ctrl wait-for --stable --json
agent-ctrl window-list --json
agent-ctrl screenshot out.png --jsonMost runtime commands accept --json for machine-readable output. snapshot --json returns the full snapshot; get --json, is --json, action commands, wait-for --json, and window-list --json return structured protocol results. batch output is always JSON, so batch --json is accepted as a compatibility no-op.
Session commands redact the TCP auth token from JSON output. screenshot --json writes the PNG to disk and prints file metadata (path, width, height, bytes, annotated) instead of echoing the base64 image payload.
When --json is present, parse and runtime failures are emitted as one structured object with ok: false, error.code, error.message, and, when available, error.hint. Exit codes still matter: 0 means success, 1 means command/request failure, and wait-for --json keeps exit 2 for timeouts while printing the structured wait outcome.
Run multiple isolated UIA sessions side by side:
agent-ctrl open uia --session app1
agent-ctrl open uia --session app2
agent-ctrl snapshot --session app1 --target-process <process-a>
agent-ctrl snapshot --session app2 --target-process <process-b>
agent-ctrl list
# SESSION SURFACE PID ENDPOINT
# app1 uia 12345 127.0.0.1:54001
# app2 uia 12346 127.0.0.1:54002
agent-ctrl close --session app1
agent-ctrl close --session app2The default session is default, so most commands need no flag. Each session has its own daemon process, pinned target window, cached snapshot, and refs. Session metadata lives at ~/.agent-ctrl/<name>.json while the daemon is running. TCP session files include a random per-session auth token, and every CLI TCP request sends it automatically. Stdio daemon clients, including the TypeScript client, do not need a token.
The mock surface returns a fixed two-button window - handy for testing the protocol without UIA permissions or a target app:
agent-ctrl open mock
agent-ctrl snapshot
agent-ctrl click @e0
agent-ctrl closeAvailable on every OS, no setup required. Used by the integration tests under packages/client/tests/.
import { AgentCtrl } from "@agent-ctrl/client";
const ctrl = new AgentCtrl(); // spawns `agent-ctrl daemon` over stdio
const session = await ctrl.openSession("uia");
await ctrl.snapshot(session, {
target: { by: "process-name", name: "target-app" },
});
const matches = await ctrl.find(session, {
name: "Save",
role: "button",
});
await ctrl.act(session, { kind: "click", ref_id: matches[0]!.ref_id });
const outcome = await ctrl.waitFor(session, {
predicate: { kind: "stable", idle_ms: 500 },
timeout_ms: 5000,
poll_ms: 250,
});
await ctrl.closeSession(session);
await ctrl.close();Method surface: openSession, snapshot, act, find, waitFor, listWindows, closeSession, close. Both transports (shell CLI and stdio TypeScript) talk the same wire protocol; agents can mix and match.
See packages/client/README.md for the full API.
agent-ctrl uses a client-daemon architecture mirroring agent-browser:
- Rust CLI (
crates/cli) - parses commands, dials the daemon, prints results. - Rust daemon (
crates/daemon) - long-running process that owns surface sessions and dispatches snapshot / action / find / wait / list-windows requests. - Surface trait (
crates/core) - cross-platform contract every backend implements. Per-platform crates (crates/surface-uia,surface-ax,surface-atspi) provide the implementations, gated bytarget_os.
The daemon starts via agent-ctrl open <surface> and persists across CLI invocations for fast subsequent operations. Each session has its own daemon process and writes a discovery file at ~/.agent-ctrl/<session>.json.
The repository is a dual workspace - a Cargo workspace for the Rust engine and an npm workspace for the TypeScript client.
| Crate / package | Purpose |
|---|---|
crates/core |
Shared types and the Surface trait. Schema, role taxonomy, action vocabulary, errors. |
crates/daemon |
Long-running process that owns surface sessions and dispatches actions. |
crates/cli |
The agent-ctrl binary - user-facing entrypoint. |
crates/surface-uia |
Windows UI Automation surface (Windows-only). |
crates/uia-fixture |
Deterministic native Win32 fixture app for UIA reliability tests. |
crates/surface-ax |
macOS Accessibility surface (full action vocabulary; macOS-only). |
crates/ax-fixture |
Deterministic native Cocoa fixture app for AX reliability tests. |
crates/surface-atspi |
Linux AT-SPI surface - snapshot-read path (Linux-only). |
crates/atspi-fixture |
Deterministic GTK4 fixture app (Python) for AT-SPI tests. |
packages/client |
@agent-ctrl/client - typed TypeScript wrapper over stdio JSON-RPC. |
Surfaces gated by target_os compile to empty crates on other platforms, so the workspace builds on any host.
A surface is one accessibility protocol - UIA, AX, AT-SPI, etc. A platform is an operating system. They aren't 1-to-1: most platforms can be driven by more than one surface.
| Platform | Native surface | Status |
|---|---|---|
| Windows | surface-uia - UI Automation |
ready |
| macOS | surface-ax - Accessibility / AX |
ready |
| Linux | surface-atspi - AT-SPI / D-Bus |
snapshot-read - snapshot, find, inspect, and window-list work; the action vocabulary is a follow-up. Mapping in docs/atspi-mapping.md; headless dev/CI stack in docker/linux-dev/ |
| Android | planned surface-accessibility-service (JNI) |
not started |
| iOS | planned surface-xcuitest (WebDriverAgent) |
not started |
For browsers, run agent-ctrl alongside agent-browser; the two are complementary, not competing.
Acronyms in one line: UIA = Microsoft UI Automation, AX = macOS Accessibility, AT-SPI = the Linux GNOME accessibility bus, XCUITest = Apple's UI test framework.
AX feature coverage is in docs/macos-ax.md; production guidance for macOS lives in docs/macos-ax-reliability.md. Windows production guidance lives in docs/windows-reliability.md.
cargo check --workspace # fast type-check
cargo build --release -p agent-ctrl-cli # the binary
cargo test --workspace # all unit + integration tests
cargo clippy --workspace --all-targets -- -D warnings # lint, fail on warnings
cargo fmt --all -- --check # format checkWindows UIA fixture:
cargo build -p agent-ctrl-cli -p agent-ctrl-uia-fixture
.\target\debug\agent-ctrl-uia-fixture.exe --ready-file "$env:TEMP\agent-ctrl-fixture.ready"
.\target\debug\agent-ctrl.exe open uia --session fixture
.\target\debug\agent-ctrl.exe snapshot --session fixture --target-process agent-ctrl-uia-fixtureThe fixture is the preferred real-UIA test target. It exposes common native controls through stable Win32/UIA patterns so tests do not depend on Notepad, Calculator, localized strings, or Windows-version-specific app redesigns.
Opt-in fixture integration test:
cargo build -p agent-ctrl-uia-fixture
$env:RUN_UIA_TESTS = "1"
cargo test -p agent-ctrl-cli --test windows_uia_fixtureSuccessful UIA actions may print a method diagnostic such as ok method=keyboard-space, ok method=selection-item-pattern, or ok method=toggle-pattern. These are intended for agents and humans debugging cross-app behavior.
macOS AX fixture:
cargo build -p agent-ctrl-cli -p agent-ctrl-ax-fixture
target/debug/agent-ctrl-ax-fixture --ready-file /tmp/agent-ctrl-ax-fixture.ready &
target/debug/agent-ctrl open ax --session fixture
target/debug/agent-ctrl snapshot --session fixture --target-process agent-ctrl-ax-fixtureOpt-in AX fixture integration test:
cargo build -p agent-ctrl-ax-fixture
RUN_AX_TESTS=1 cargo test -p agent-ctrl-cli --test macos_ax_fixtureThe AX fixture covers the deterministic macOS loop for snapshots, find,
click, fill, check, uncheck, toggle, and window-list. Keyboard
actions exist, but are still validated manually because host focus and event-tap
behavior can vary under the Rust test harness.
Linux AT-SPI fixture:
AT-SPI does not exist on Windows or macOS, so the surface-atspi crate is
developed and tested inside the headless container in docker/linux-dev/
(Xvfb + a private session D-Bus + the AT-SPI registry + GTK4). The fixture is a
deterministic GTK4 app, crates/atspi-fixture/main.py.
docker build -t agent-ctrl-linux-dev docker/linux-dev/
docker run --rm -v "$PWD:/work" -w /work agent-ctrl-linux-dev \
bash -c 'RUN_ATSPI_TESTS=1 cargo test -p agent-ctrl-cli --test linux_atspi_fixture'The opt-in linux_atspi_fixture test (gated by RUN_ATSPI_TESTS=1, like the
UIA and AX fixture tests) opens an atspi session, launches the GTK fixture,
and exercises snapshot, find, get, is, and window-list. See
docker/linux-dev/README.md for the full
container invocation.
TypeScript client:
npm install
npm run build --workspace=@agent-ctrl/client
npm run test --workspace=@agent-ctrl/client # spawns the Rust daemon under cargo runThe TS test suite spawns the Rust daemon under cargo run and exercises the full protocol against the mock surface - including find, waitFor, and listWindows.
The simplest approach - tell your agent it can use it:
Use agent-ctrl to drive Windows apps. Run `agent-ctrl --help` to see the command list,
and `agent-ctrl info` to check what's available on this machine.
The --help output is comprehensive and most modern agents can figure out the rest from there.
For consistent results, add to your project or global instructions:
## OS automation
Use `agent-ctrl` for native UI automation on Windows and macOS. Core workflow:
1. `agent-ctrl open uia` (Windows) or `agent-ctrl open ax` (macOS) - spawn a daemon
2. Bring the app forward: `agent-ctrl switch-app <name>` (un-minimizes it too) if it has a visible
window; `agent-ctrl launch <path>` if it isn't running or is parked in the system tray
3. `agent-ctrl snapshot --target-process <name>` - pin to the app and capture refs.
Add `--settle` for Chromium/Electron apps (Slack, Teams, VS Code) - their tree
is built lazily, so the first plain snapshot is often just the window frame
4. `agent-ctrl find "Save" --role button --first` - discover refs by name/role
5. `agent-ctrl click @eN` / `fill @eN "text"` / `type "text"` / `press "Ctrl+S"` (or `Cmd+S` on macOS) - interact.
Prefer `type` over `fill` for web-based / Chromium text boxes that need real key events
6. `agent-ctrl wait-for --stable` - let the UI settle before the next action
7. Before anything hard to undo (send a message, hit OK, delete): re-`snapshot`,
confirm the field shows your input and the submit control is enabled, then `press "Enter"`
8. `agent-ctrl window-list` + `focus-window <id>` - switch to dialogs / popups
9. Re-`snapshot` after the tree changesThe recommended pattern is app-agnostic: bring the target forward, snapshot
(with --settle for Chromium-based apps), find by role/name, act, wait for
stability, re-snapshot to verify before any irreversible step, then commit.
Concrete walkthroughs:
- examples/notepad-tour.sh - a simple Win32 app
- examples/chat-dm.sh - the quick-switcher -> compose -> verify -> send flow for a Slack/Teams-style chat app
Production agents should prefer the generic loop over app-specific assumptions.
These are real today - the goal is to fix or document them as the project matures.
- Windows and macOS are the action-ready surfaces. Linux (AT-SPI) is snapshot-read only - it captures trees, resolves
find/inspect refs, and lists windows, but cannot yet click, type, or focus; Android / iOS / browser flows are not implemented in this project yet. macOS additionally requires Screen Recording permission forscreenshotand may require Automation permission for some Apple system apps (Notes, Calendar, Music) - see docs/macos-ax-reliability.md. - Linux apps must have accessibility enabled to be visible. GTK and Qt only build their AT-SPI tree when
org.a11y.Status.IsEnabledis set;agent-ctrl open atspiflips it, but an app already running may take a moment to register its tree (usesnapshot --settle). Headless geometry is approximate - GTK under Xvfb reports no screen coordinates, soboundsmay be absent. - Local TCP daemon auth is developer-machine scoped. TCP session files include a random bearer token and the daemon rejects missing or incorrect tokens, but anyone who can read
~/.agent-ctrl/<session>.jsoncan still use that session. Treat sessions as a local developer-machine boundary, not a multi-user security sandbox. - Refs are valid only against the snapshot that produced them. If
wait-forruns in parallel with another command on the same session (across two shells), the wait loop refreshes the cached refs on each poll, and a previously-issued ref may resolve to a different element. Sequential CLI usage in one shell - the realistic flow - doesn't trip this. - Modern Win11 file dialogs and popup menus open as sibling top-level windows, not as children of the app's main window. Use
window-list+focus-windowto discover and switch to them. typebypasses IME. Synthetic Unicode keystrokes viaSendInputare reliable for ASCII; CJK with IME composition is not supported yet.fill(UIAValuePattern) is the right escape hatch for non-ASCII text input.- HWND recycling. Windows reassigns numeric HWNDs after a window closes;
window-listshows whatever currently holds an id, with no UIA-runtime-id verification. Theoretical, never observed in practice. - An unresponsive target wedges the UIA session. UIA calls are cross-process COM calls; if the target app stops pumping messages, a snapshot or action can't return. After ~45s the call times out, the session is marked wedged, and every subsequent call on it fails fast - run
agent-ctrl closethenagent-ctrl open uiato start a fresh one. (The stuck worker thread is abandoned, so the daemon and other sessions keep working.)
Apache-2.0. See LICENSE.
