All projects

Agent Vision

Give AI agents eyes — and hands — on your screen

v0.1.0 Swift macOS CLI Source

Description

A macOS utility that gives AI agents eyes — and optionally hands — on your screen. Mark a region and any AI coding agent can take screenshots on demand or control the mouse and keyboard within that area, enabling visual feedback loops for UI development.

Key capabilities:

  • Screenshot capture of marked screen regions on demand
  • Element discovery via the macOS Accessibility API and Vision OCR
  • Focus-free control — clicks, text input, keyboard commands, scrolling, and drag gestures without stealing user focus
  • Isolated sessions with UUIDs so multiple agents can work simultaneously without interfering
  • Annotated screenshots with numbered badges on discovered elements for spatial context

Installation

Requires macOS 14+ (Sonoma), Apple Silicon, and Xcode 16+ (builds from source).

brew tap rvanbaalen/agent-vision
brew install agent-vision

After installation, grant permissions in System Settings > Privacy & Security:

  • Screen Recording — for capturing screenshots
  • Accessibility — for discovering and controlling UI elements

Quick Start

agent-vision start                                        # Prints session UUID, shows floating toolbar
# Click "Select Window" to pick a window, or "Select Area" to drag-select a region
agent-vision wait --session <uuid>                        # Blocks until area is selected
agent-vision elements --session <uuid>                    # Discover clickable elements
agent-vision control click --element 1 --session <uuid>   # Click (focus-free)
agent-vision stop --session <uuid>                        # End session

Every command (except start) requires --session <uuid>. Multiple agent instances can run separate sessions simultaneously without interfering.

Visual Feedback Loop

The core workflow for UI development:

# Start a session and wait for area selection
SESSION=$(agent-vision start)
agent-vision wait --session $SESSION

# Capture reference state
agent-vision capture --session $SESSION --output /tmp/before.png

# Make code changes, wait for hot reload...

# Capture the result and compare
agent-vision capture --session $SESSION --output /tmp/after.png

Screen Control

Agents follow a scan → act → re-scan loop. Element indices change when the UI updates, so always re-scan after actions.

# Discover interactive elements
agent-vision elements --session $SESSION

# Click by element index (focus-free, uses Accessibility API)
agent-vision control click --element 3 --session $SESSION

# Type text into a field (focus-free, replaces field value)
agent-vision control type --text "hello@example.com" --element 2 --session $SESSION

# Keyboard shortcuts
agent-vision control key --key "cmd+a" --session $SESSION

# Scroll
agent-vision control scroll --delta 0,-300 --session $SESSION

# Drag (for mobile simulators — swipe up to scroll down)
agent-vision control drag --from 200,500 --to 200,200 --session $SESSION

# Re-scan after every action that changes the UI
agent-vision elements --session $SESSION

Element targeting is focus-free. When you use --element N, the agent interacts via the Accessibility API — the system cursor doesn’t move and the user’s active window stays focused. The user can keep working while the agent clicks buttons and fills forms. Manual --at X,Y coordinates are available as a fallback for custom-drawn UIs.

CLI Commands

CommandPurpose
startLaunch a session, returns a UUID
waitBlock until the user selects an area
captureScreenshot the selected region
calibrateCapture with coordinate crosshairs
previewPreview click coordinates visually
elementsDiscover interactive UI elements (with optional --annotated screenshots)
control clickClick an element or coordinate
control typeType text into an element
control keySend keyboard shortcuts
control scrollScroll within the region
control dragDrag between two points
stopEnd the session

All commands except start require --session <uuid>.

How It Works

  • Toolbar: A floating macOS panel (like the built-in screenshot toolbar) that stays above all windows
  • Area selection: Full-screen overlay with crosshair cursor — click and drag to select, or hover and click to select an entire window
  • Border: A dashed blue border with “Agent Vision” label marks the active capture area — click-through and invisible to screenshots
  • Capture: Uses CGWindowListCreateImage for pixel-perfect region screenshots with the border overlay auto-excluded
  • Element discovery: Combines the Accessibility API (interactive elements) and Vision OCR (static text) into a unified JSON output with roles, labels, and exact coordinates
  • Input execution: Actions are sent via JSON files and executed using the macOS CGEvent API with visual ripple feedback. All coordinates are bounds-checked to stay within the selected area
  • State: The app and CLI communicate via ~/.agent-vision/state.json for area coordinates and element data