Agent Vision
Give AI agents eyes — and hands — on your screen
Description
A macOS utility that gives AI agents eyes — and optionally hands — on your screen. Mark a region and any AI coding agent can take screenshots on demand or control the mouse and keyboard within that area, enabling visual feedback loops for UI development.
Key capabilities:
- Screenshot capture of marked screen regions on demand
- Element discovery via the macOS Accessibility API and Vision OCR
- Focus-free control — clicks, text input, keyboard commands, scrolling, and drag gestures without stealing user focus
- Isolated sessions with UUIDs so multiple agents can work simultaneously without interfering
- Annotated screenshots with numbered badges on discovered elements for spatial context
Installation
Requires macOS 14+ (Sonoma), Apple Silicon, and Xcode 16+ (builds from source).
brew tap rvanbaalen/agent-vision
brew install agent-vision
After installation, grant permissions in System Settings > Privacy & Security:
- Screen Recording — for capturing screenshots
- Accessibility — for discovering and controlling UI elements
Quick Start
agent-vision start # Prints session UUID, shows floating toolbar
# Click "Select Window" to pick a window, or "Select Area" to drag-select a region
agent-vision wait --session <uuid> # Blocks until area is selected
agent-vision elements --session <uuid> # Discover clickable elements
agent-vision control click --element 1 --session <uuid> # Click (focus-free)
agent-vision stop --session <uuid> # End session
Every command (except start) requires --session <uuid>. Multiple agent instances can run separate sessions simultaneously without interfering.
Visual Feedback Loop
The core workflow for UI development:
# Start a session and wait for area selection
SESSION=$(agent-vision start)
agent-vision wait --session $SESSION
# Capture reference state
agent-vision capture --session $SESSION --output /tmp/before.png
# Make code changes, wait for hot reload...
# Capture the result and compare
agent-vision capture --session $SESSION --output /tmp/after.png
Screen Control
Agents follow a scan → act → re-scan loop. Element indices change when the UI updates, so always re-scan after actions.
# Discover interactive elements
agent-vision elements --session $SESSION
# Click by element index (focus-free, uses Accessibility API)
agent-vision control click --element 3 --session $SESSION
# Type text into a field (focus-free, replaces field value)
agent-vision control type --text "hello@example.com" --element 2 --session $SESSION
# Keyboard shortcuts
agent-vision control key --key "cmd+a" --session $SESSION
# Scroll
agent-vision control scroll --delta 0,-300 --session $SESSION
# Drag (for mobile simulators — swipe up to scroll down)
agent-vision control drag --from 200,500 --to 200,200 --session $SESSION
# Re-scan after every action that changes the UI
agent-vision elements --session $SESSION
Element targeting is focus-free. When you use --element N, the agent interacts via the Accessibility API — the system cursor doesn’t move and the user’s active window stays focused. The user can keep working while the agent clicks buttons and fills forms. Manual --at X,Y coordinates are available as a fallback for custom-drawn UIs.
CLI Commands
| Command | Purpose |
|---|---|
start | Launch a session, returns a UUID |
wait | Block until the user selects an area |
capture | Screenshot the selected region |
calibrate | Capture with coordinate crosshairs |
preview | Preview click coordinates visually |
elements | Discover interactive UI elements (with optional --annotated screenshots) |
control click | Click an element or coordinate |
control type | Type text into an element |
control key | Send keyboard shortcuts |
control scroll | Scroll within the region |
control drag | Drag between two points |
stop | End the session |
All commands except start require --session <uuid>.
How It Works
- Toolbar: A floating macOS panel (like the built-in screenshot toolbar) that stays above all windows
- Area selection: Full-screen overlay with crosshair cursor — click and drag to select, or hover and click to select an entire window
- Border: A dashed blue border with “Agent Vision” label marks the active capture area — click-through and invisible to screenshots
- Capture: Uses
CGWindowListCreateImagefor pixel-perfect region screenshots with the border overlay auto-excluded - Element discovery: Combines the Accessibility API (interactive elements) and Vision OCR (static text) into a unified JSON output with roles, labels, and exact coordinates
- Input execution: Actions are sent via JSON files and executed using the macOS CGEvent API with visual ripple feedback. All coordinates are bounds-checked to stay within the selected area
- State: The app and CLI communicate via
~/.agent-vision/state.jsonfor area coordinates and element data