#2567: Beyond Pixels: Controlling Apps Without Vision

How MCP agents can use accessibility APIs and COM to control Windows and macOS apps at the protocol level.

Featuring

Daniel

Corn

Herman

0:000:00

Episode Details

Episode ID: MWP-2725
Published: May 1
Duration: 55:59
Audio: Direct link
Pipeline: V5
TTS Engine: chatterbox-regular
Script Writing Agent: deepseek-v4-pro
Topics: accessibility human-computer-interaction automation

AI-Generated Content: This podcast is created using AI personas. Please verify any important information independently.

The Last Frontier for Local MCP: Controlling Proprietary Software Without Pixels

When agents lived in the cloud, calling APIs, automation was straightforward. But now MCP clients run locally, on actual machines, and a new bottleneck emerges: how do you tell Photoshop to apply a filter, or DaVinci Resolve to add a node, when the software has no CLI?

The knee-jerk answer is vision-based automation—screenshots, OCR, click simulation. It’s universal, but it’s also slow, brittle, and expensive. Every action adds 200–500ms of latency. A complex color grade requiring 50 operations can take 10–25 seconds of pure overhead. Resolution changes, UI scaling, OS themes, or a repositioned window can break a script that worked yesterday.

The Real Solution: Tap the Application’s Internal State

The goal is to bypass the pixel layer entirely and talk to the application at the level of widgets, actions, and state. The application already knows where its buttons are—why reconstruct that from pixels?

Accessibility APIs: The Overlooked Goldmine

On Windows, UI Automation (part of .NET since 2006) exposes a structured tree of every UI element in every running application. On macOS, the Accessibility API (since 2002) does the same. On Linux, there’s AT-SPI.

These trees include element types (button, text field, slider), current state, position, parent-child relationships, and crucially, the actions you can perform—click, focus, set value, expand, collapse. An MCP client can walk that tree, find the element it needs, and invoke the action directly, achieving 10–20ms latency instead of 200ms+. Even applications never designed for automation expose this tree, as long as they use standard OS controls.

IPC Hooks: COM and AppleScript

For deeper control, Windows offers COM (Component Object Model). Applications like Excel, Photoshop, and AutoCAD register COM objects that expose their internal methods directly. You can instantiate them from Python via win32com and call methods like Open or ApplyFilter with microsecond latency—no click simulation, no menu navigation.

On macOS, AppleScript dictionaries define the nouns and verbs an application understands. An MCP server can parse these dictionaries at startup, dynamically generate tools for every command, and execute them via osascript or Scripting Bridge.

The Takeaway

Vision-based automation is a fallback, not a foundation. For any serious local MCP deployment, start by checking for accessibility APIs, then COM objects on Windows or AppleScript dictionaries on macOS. The tools are already there—we just need to use them.

Downloads

Episode Audio

Download the full episode as an MP3 file

Download MP3

Transcript (TXT)

Plain text transcript file

Transcript (PDF)

Formatted PDF with styling

#2567: Beyond Pixels: Controlling Apps Without Vision

Here's the thing that's been eating at me, Herman. We've spent years treating Linux as the default for programmatic control — everything has a CLI, everything is scriptable, everything exposes a surface you can reason about. And then you step outside that world into professional creative software, CAD tools, video editors, and suddenly you're staring at a wall of buttons and sliders with no entry point. That gap, between what Linux assumes about software and what proprietary applications actually give you, is the last real frontier for local MCP.

It's becoming urgent in a way it wasn't even eighteen months ago. When agents lived mostly in the cloud, calling APIs, this wasn't a bottleneck. But now we're seeing MCP clients running locally, on people's actual machines, and the question flips. It's not "can I call this web service." It's "can I tell Photoshop to apply a filter, or tell Fusion three sixty to export a mesh, or tell DaVinci Resolve to add a node." And the answer, for most software, is not obvious.

Which is exactly what Daniel sent us. He's been deep in the networked MCP world, aggregating tools at the router level, bundling them so every device on the network can reach them. But he's hitting the local problem now. He frames it like this — if you've got software that exposes a real CLI, like Blender with its Python API, you're fine. You interact at the protocol level. But what about programs that don't expose anything externally, or that bury their scripting behind an internal-only interface? Are we stuck just pointing a camera at the screen and hoping Vision models can click the right button, or are there actual protocol-level workarounds?

That question is way more interesting than most people realize. Because the knee-jerk answer is "yeah, Vision is your fallback, screenshot plus OCR plus click simulation, good luck." But that's wrong. Or at least, it's incomplete. There's a whole layer of control surfaces hiding in plain sight on Windows and macOS that most developers never think to look for.

Before we dig into those, quick note — this episode's script is being generated by DeepSeek V four Pro. There, I said it.

Okay, so let's frame the problem properly. When Daniel says "expose a control surface," what does that actually mean at the OS process level?

At the most basic level, a control surface is any mechanism that lets an external process interact with the internal state and behavior of an application. A CLI is the cleanest version of this — you type a command, the program executes it, you get structured output back. An API library, like Blender's Python bindings, is another. But these are just two species in a much larger genus.

The genus is basically "interprocess communication that carries intent." You want to say "select this layer," "apply this effect," "export to this format," and have the target application understand and execute those commands. Whether that happens through a command line, a Python import, a COM interface, an AppleScript bridge, or an accessibility tree — the intent is the same. The mechanism is what varies.

Let's start with the obvious question Daniel's dancing around. Why is Vision the fallback everyone reaches for, and why is it actually a terrible first choice?

Vision is seductive because it's universal. You point a model at the screen, it sees what a human sees, it clicks where a human clicks. No integration required, no API to learn, no vendor cooperation needed. And for quick demos, it's genuinely impressive. But in practice, it's slow, brittle, and expensive. Every action requires a screenshot, inference time, coordinate calculation. You're adding maybe two hundred to five hundred milliseconds of latency per interaction, minimum. If you're doing something that requires fifty sequential operations — like setting up a complex color grade in DaVinci Resolve — that's ten to twenty-five seconds of pure overhead.

That's before you get to the reliability problems. Screen resolution changes, UI scaling factors, OS theme updates, even just a window being slightly repositioned — all of these can break a Vision-based automation that worked perfectly yesterday.

Vision treats the application as a black box that produces pixels, and the agent as something that interprets pixels to infer state. But the application already knows its own state. It knows which button is where, which text field has focus, which menu is expanded. The goal of a proper control surface is to tap into that internal knowledge directly, rather than reconstructing it from the outside.

The real question Daniel's asking is: what are the mechanisms we can use to bypass the pixel layer and talk to the application at the level of widgets, actions, and state? And the answer, especially on Windows, is surprisingly rich.

Let's start with the one that most developers overlook entirely: accessibility APIs. On Windows, this is UI Automation, which has been part of the.NET Framework since version three point zero, back in two thousand six. On macOS, it's the Accessibility API, introduced in OS X ten point two in two thousand two. And on Linux, there's AT hyphen SPI. These were all designed for screen readers and assistive technologies, but they expose something incredibly valuable for MCP: a structured tree of every UI element in every running application.

That tree isn't just a list of labels. It includes the element type — button, text field, slider, list item — its current state, its position, its parent-child relationships, and crucially, the actions you can perform on it. Click, focus, set value, expand, collapse. An MCP client can walk that tree, find the element it needs, and invoke the action directly, without ever looking at a single pixel.

This is a massive unlock. Imagine you're building an MCP server that needs to control a proprietary accounting application. That application has no CLI, no scripting API, no COM interface. But it does have a standard Windows UI — buttons, menus, text fields. UI Automation gives you a programmatic handle to every one of those elements. You can write a tool called "click_button" that takes a button name, searches the UI Automation tree for a button with that name, and invokes the click action. Latency is maybe ten to twenty milliseconds, compared to two hundred plus for Vision.

Here's the part that most people don't realize: UI Automation works even on applications that were never designed to be automated. The Windows accessibility framework hooks into the windowing system at a low level. As long as the application uses standard Windows controls — and most professional software does, at least partially — the tree is there, whether the developer thought about it or not.

The same is true on macOS. The Accessibility API is deeply integrated into AppKit. If an application uses standard Cocoa controls, the accessibility hierarchy is populated automatically. Third-party frameworks that draw their own custom UI can opt in, and many do, because otherwise VoiceOver wouldn't work with their software. So even applications like Final Cut Pro or Logic Pro, which are famously opaque and proprietary, expose a surprising amount of structure through the accessibility layer.

Okay, but accessibility APIs have limits. They're great for clicking buttons and reading text fields, but they don't give you access to the application's internal document model. You can't use UI Automation to tell Photoshop "apply a Gaussian blur with a radius of five pixels to the selected layer." You can click the Filter menu, navigate to Blur, select Gaussian Blur, and type "five" into the radius field — but that's fragile. It depends on the menu structure staying exactly the same between versions.

Which brings us to the second technique: interprocess communication hooks, or IPC. This is where things get really interesting on Windows. Microsoft built an entire automation infrastructure called COM, the Component Object Model, and it's been baked into Windows and Office for decades. COM allows one application to expose objects, methods, and properties that another application can call directly.

The key word here is "directly." When you use COM to control Microsoft Excel, you're not simulating clicks. You're actually creating an Excel application object, calling its methods, setting its properties. You can say "open this workbook, read the value of cell B seven, set the formula in column C to this expression, then save and close." It's as close to a native API as you can get without the vendor explicitly publishing one.

It's not just Office. Adobe Creative Suite exposes COM interfaces. AutoCAD exposes COM interfaces. Dozens, maybe hundreds of professional Windows applications have COM automation layers. The challenge is discovery — these interfaces are often poorly documented, or documented only in legacy MSDN pages that haven't been updated in a decade. But they're there, and they're functional.

If you're building an MCP server and your target application is on Windows, the first thing you should do is check whether it registers any COM objects. You can use tools like OLEView or just browse the registry under HKEY underscore CLASSES underscore ROOT. If you find a COM class with a name like "Photoshop dot Application" or "AutoCAD dot Application," you've struck gold. You can instantiate that object from Python using the win32com library and start calling methods.

The Python win32com library, by the way, is an unsung hero here. It wraps the COM infrastructure in a way that feels almost Pythonic. You can do things like "excel dot Workbooks dot Open, parentheses, quote, myfile dot xlsx, quote, parentheses" and it just works. An MCP server can wrap these COM calls into MCP tools with almost no boilerplate.

Let's get concrete. Say you want to control Photoshop from an MCP agent. You're on Windows. You check the registry and find Photoshop dot Application. You write an MCP tool called "photoshop underscore open underscore document" that takes a file path, creates the COM object, and calls the Open method. Another tool called "photoshop underscore apply underscore filter" takes a filter name and parameters. Another tool called "photoshop underscore export" takes a format and destination path. From the agent's perspective, it's just calling tools. It has no idea there's a COM bridge underneath.

The latency on COM calls is essentially native. You're not going through a network stack, you're not doing screenshot inference. You're making a direct in-process or cross-process method call. The overhead is measured in microseconds, not milliseconds.

Now, COM is Windows-specific, and it's tied to applications that explicitly support it. What about macOS? The equivalent on the Apple side is AppleScript, or more precisely, the Open Scripting Architecture that AppleScript sits on top of.

AppleScript is one of those technologies that people love to make fun of — the syntax is weird, it reads like broken English, it's been "dying" for twenty years. But underneath the quirky surface, it's incredibly powerful for automation. Applications can expose an AppleScript dictionary that defines the nouns and verbs the application understands. "Document," "layer," "export," "apply filter." An MCP server can construct AppleScript commands and send them to the target application via the osascript command-line tool or the Scripting Bridge framework.

The AppleScript dictionary is essentially a schema. It tells you exactly what objects exist, what properties they have, what commands they respond to. If you're building an MCP server for macOS, you can parse the application's AppleScript dictionary at startup and dynamically generate MCP tools for every command the application supports.

That's the dream, right? A universal adapter that probes the target application, discovers what automation surfaces are available, and exposes them as MCP tools automatically. No manual wrapping required.

We'll get to that architecture in a bit. But first, let's talk about the third technique, which is the bridge between accessibility and full IPC: automation frameworks that simulate user input at the event level. On Windows, this is AutoIt and AutoHotkey. On Linux, it's xdotool and similar tools. On macOS, it's the cliclick utility or the CGEvent API.

These sit in a weird middle ground. They're more reliable than Vision because they target specific UI elements by class or title rather than by pixel position. But they're less reliable than COM or AppleScript because they're still simulating user actions rather than calling internal methods. If the application's window layout changes, an AutoHotkey script that clicks at coordinates will break, but one that finds a button by its class name and sends a click event to it will survive.

The key distinction is between coordinate-based automation and element-based automation. Coordinate-based is just Vision without the AI — you hardcode where to click, and it breaks constantly. Element-based uses the same accessibility tree we talked about earlier to identify the target, then sends a synthetic input event to that element. It's more robust, but it's still limited to actions that can be expressed as user inputs — click, type, drag, scroll.

Which brings us to the practical question. If you're Daniel, sitting down to build a local MCP server that needs to control some proprietary software, what's your decision tree? What order do you try things in?

First, check for a native API or CLI. Blender has Python. DaVinci Resolve has a Lua socket interface. Many professional tools have something, even if it's not well advertised. If that exists, use it directly. It'll be the fastest, most reliable, and most maintainable option.

Second, check for COM or AppleScript. On Windows, search the registry for COM classes related to your application. On macOS, open Script Editor, go to File, Open Dictionary, and see if your application is listed. If it is, you've got a structured automation interface that's likely been stable for years.

Third, check the accessibility tree. exe on Windows or Accessibility Inspector on macOS. Browse the running application's UI hierarchy. If you can find the elements you need and the actions they support, you can build a reliable automation layer even without vendor cooperation.

Fourth, fall back to element-based automation frameworks — AutoHotkey on Windows, AppleScript UI scripting on macOS, xdotool on Linux. These are less elegant but still more reliable than pure Vision.

Only then, fifth, consider Vision. Screenshot plus OCR plus coordinate calculation. It's the last resort, not the first choice. And even then, you can make it smarter by combining it with accessibility data — use the accessibility tree to narrow down where on screen a button is, then use Vision to verify and click. Hybrid approaches often outperform either technique alone.

Let's talk about a real case study that illustrates the power of this approach: DaVinci Resolve. It's a professional color grading and video editing application. It doesn't expose a traditional CLI. But it does have a scripting interface — you can send Lua commands to it over a socket connection on localhost.

This is the kind of thing that most users never discover. Buried in the DaVinci Resolve documentation is a section on "Fusion Scripting" and "Resolve Scripting." You enable it in the preferences, and suddenly the application is listening on a port. An MCP client can open a TCP socket to localhost on that port, send a Lua script like "resolve, colon, GetProjectManager, parentheses, colon, GetCurrentProject, parentheses, colon, GetTimeline, parentheses, colon, AddTrack, parentheses, quote, video, quote, parentheses," and it executes natively.

This pattern — socket-based scripting — is more common than you'd think in professional creative software. It's a way for vendors to provide automation without committing to a public API that they have to maintain and document. The socket interface is there for internal tooling and power users, and it's stable enough to build on.

For Daniel's local MCP architecture, the approach is: build an MCP server that acts as an automation router. It checks what's available — COM, AppleScript, accessibility, socket interfaces — and exposes a unified set of tools to the agent. The agent doesn't need to know which backend is being used. It just calls "apply underscore color underscore grade" or "export underscore timeline," and the server figures out the best way to execute that command.

The server should log which backend it used, so the user can debug when things go wrong. "Applied filter using COM interface, latency two milliseconds." "Clicked export button using UI Automation, latency fifteen milliseconds." "Fell back to Vision for color picker interaction, latency three hundred milliseconds." That transparency is crucial for building trust in the automation.

Now, there's a security dimension here that we can't ignore. Granting an MCP agent access to COM, AppleScript, or the accessibility tree is a significant privilege escalation. You're giving an AI the ability to control applications on your machine, potentially including file operations, network access, and system settings.

This is where the local MCP architecture has an advantage over cloud-based agents. When the MCP server is running on your machine, you can sandbox it. You can restrict which applications it can control, which methods it can call, which files it can access. You can require explicit user confirmation for destructive operations. The trust boundary is local, and you control it.

This connects back to something Daniel's been thinking about with networked MCP. When you aggregate MCP tools at the network level, you're centralizing access — every device on the network can reach the same tools. But for local control of proprietary software, you want the opposite. You want the MCP server to be as close to the application as possible, ideally on the same machine, with minimal network surface area.

The two architectures complement each other. Network-level MCP aggregation handles cloud services, shared resources, things that make sense to centralize. Local MCP servers handle the machine-specific stuff — controlling Photoshop on your workstation, automating your DAW, scripting your CAD tool. The agent can use both, routing different tool calls to different MCP servers depending on where the capability lives.

That routing decision — local versus networked — is something the agent can make automatically based on tool metadata. If a tool is tagged as "requires local execution" or "targets application X on host Y," the MCP client knows to route it to the appropriate server. Daniel's network aggregation work already handles the remote case. The local case just adds another server to the pool.

Which brings us back to the core insight here. The gap between Linux's programmatic culture and proprietary software's opaque interfaces isn't unbridgeable. It's just poorly documented, and the tools are scattered across different operating systems and frameworks. But once you know to look for accessibility APIs, COM interfaces, AppleScript dictionaries, and socket-based scripting, you realize that most professional software actually does expose control surfaces. They're just not called "APIs" in the marketing materials.

That's the misconception we should tackle head-on. The assumption that "no CLI equals no programmatic control" is just wrong. It's a Linux-centric view of the world that doesn't account for the automation infrastructure that's been built into Windows and macOS for decades. These operating systems were designed with automation in mind — not for AI agents, originally, but for accessibility, for enterprise management, for power user scripting. The infrastructure is there. You just have to know where to look.

Once you find it, you can build MCP tools that are fast, reliable, and maintainable. That's the practical takeaway for Daniel and for anyone building local MCP servers. Don't reach for Vision first. Reach for the accessibility tree. Reach for COM. Reach for AppleScript. The control surfaces are there, hiding in plain sight.

That's the landscape. Accessibility APIs give you structured UI trees. COM and AppleScript give you direct method invocation. Socket-based scripting gives you a bridge to internal automation engines. And automation frameworks like AutoHotkey give you a middle ground when nothing else works. With these four techniques in your toolkit, the number of applications that truly require Vision as a fallback is surprisingly small.

We haven't even talked about some of the more exotic approaches — DLL injection, function hooking, memory inspection. Those are fragile and often violate software licenses, but they exist in the extreme case. For most practical purposes, the four techniques we've outlined cover the vast majority of professional software.

Daniel's question gets at something bigger, though. As local MCP adoption grows, the demand for automation surfaces is going to increase. Right now, vendors like Adobe and Autodesk expose COM interfaces almost as an afterthought, a legacy of the nineties enterprise automation era. But if AI agents become a significant user base, there's going to be pressure to modernize these interfaces, document them properly, and make them first-class features.

I think we're already seeing early signs of this. The fact that Blender's Python API is so comprehensive isn't an accident — it's a reflection of Blender's community, which includes a lot of technical artists who want to script their workflows. As more creative professionals start using AI agents, the demand for scriptable interfaces in tools like Photoshop and Premiere is going to follow the same trajectory.

The MCP protocol itself could become a standard that vendors target directly. Instead of exposing a COM interface and hoping developers figure out how to wrap it, a vendor could expose an MCP endpoint natively. The agent connects, discovers the available tools, and starts controlling the application — no middleware required.

That's the optimistic scenario. The pessimistic one is that vendors see agent automation as a threat — something that reduces reliance on their UI, makes it easier to switch tools, undermines their platform lock-in — and they actively resist exposing control surfaces.

History suggests both will happen. Some vendors will embrace it, seeing automation as a feature that attracts power users. Others will fight it, either by not exposing interfaces or by actively breaking third-party automation. The MCP community will adapt either way — that's what the fallback chain we described is for.

That fallback chain is worth making explicit, because it's going to be the practical architecture for a lot of local MCP servers. Tier one: native API or CLI. Tier two: COM, AppleScript, or socket-based scripting. Tier three: accessibility tree navigation. Tier four: element-based automation frameworks. Tier five: Vision. Each tier is slower and more brittle than the one above it, but together they cover essentially everything.

The MCP server can probe each tier at startup, determine what's available for the target application, and build its tool list accordingly. If Photoshop is available via COM, great — the tools use COM. If not, the server checks whether UI Automation can reach the relevant controls. If not, it falls back to AutoHotkey. If even that fails, it fires up the Vision model. The agent calling the tools doesn't need to know or care which path was taken.

That probing process itself is automatable. You can write a discovery script that checks the registry for COM classes, queries the accessibility tree for known element patterns, tests whether a socket connection is accepted on known ports. The results feed into a configuration that the MCP server uses to decide which backend to use for which operation.

This is where the "universal adapter" idea starts to look less like science fiction and more like a weekend project. The building blocks are all there. The challenge is packaging them into something that works out of the box for common applications.

That's really the frontier Daniel's pointing at. Not "can we control proprietary software from an agent," because the answer is clearly yes, with enough effort. The real question is "can we make it easy?" Can we build MCP servers that auto-detect available control surfaces and expose them as tools without requiring the user to understand COM, AppleScript, or accessibility APIs?

I think the answer is yes, but it's going to require a community effort. Someone needs to build the Photoshop MCP server, the Excel MCP server, the DaVinci Resolve MCP server, and share them. Over time, patterns will emerge, and generic adapters will get better at handling new applications without custom code.

That's exactly the kind of open-source, community-driven development that Daniel's been involved in. He's already thinking about networked MCP aggregation. Local MCP automation is the natural next step.

To answer Daniel's question directly: no, we are absolutely not relegated to Vision. The toolbox is deep. Accessibility APIs, COM, AppleScript, socket scripting, automation frameworks — these are mature technologies that have been hiding in plain sight. The challenge isn't technical feasibility. It's discovery, documentation, and packaging. And that's a solvable problem.

The timing is right. Local agents are becoming practical. The MCP protocol gives us a standard way to expose tools. The operating system infrastructure has been there for years. All that's missing is the glue.

Where do we even start unpacking the technical details of how these mechanisms actually work at the protocol level?

Let's start with what "exposing a control surface" actually means at the operating system level. When a program runs, it's a process with memory, threads, windows, and handles. A control surface is any mechanism that lets an external process — our MCP server — inspect or manipulate those resources without going through the human interface layer.

The GUI is one surface, but it's designed for human eyes and mouse clicks. The CLI is another, designed for text in, text out. What Daniel's asking about is the third category — the programmatic surfaces that sit between "human with a keyboard" and "raw process memory.

And this is where the distinction between networked MCP and local MCP becomes crucial. With networked MCP, you're aggregating tools that already speak over HTTP or WebSocket. The control surface is the network endpoint itself. But local software — Photoshop, Excel, Final Cut Pro — doesn't listen on a port waiting for commands. It's a process on your machine, and you need to reach into it through the operating system's own interprocess communication channels.

Which is why Vision feels like the obvious fallback. If the program only exposes pixels on a screen, you screenshot it, read the pixels, and simulate clicks. It's universal, yes. But Herman, you mentioned earlier that Vision adds two hundred to five hundred milliseconds per action. Why is that latency so painful in practice?

Because it compounds. A single operation — say, applying a filter in Photoshop — might require the agent to open a menu, click a submenu, wait for a dialog, click a dropdown, select an option, and click OK. That's six or seven Vision cycles. At three hundred milliseconds each, you're looking at two full seconds just for the interaction layer, before Photoshop even starts processing the filter. Compare that to COM, where the same operation is a single method call that returns in tens of milliseconds.

Vision isn't just slow — it's fragile. If Adobe moves the filter menu by three pixels in an update, your screenshot-based coordinate clicking breaks. If they change the label from "Apply Filter" to "Apply Effect," your OCR-based targeting breaks. You're building automation on a foundation of sand.

Whereas the accessibility tree gives you a structured representation of the UI that's stable across minor updates. A button has a name, a role, a position, and a unique identifier that doesn't change just because the designer moved it. You're not guessing where the button is — you're asking the operating system for the button object and telling it to invoke the default action.

The hierarchy is clear. Native APIs and CLIs are the gold standard — direct, fast, stable. Accessibility APIs and COM are the silver standard — slightly more overhead, but still structured and reliable. Automation frameworks like AppleScript are bronze. Vision is the emergency backup you use when the building has no other doors.

The crucial insight for anyone building local MCP servers is that most professional software on Windows and macOS actually has a silver-standard door. It's just not labeled "API" on the outside. You have to know to look for it. On Windows, you run inspect.exe and point it at the application — suddenly you see the entire widget tree, every button and text field, with their automation IDs and control patterns. On macOS, you open Accessibility Inspector and get the same thing.

That's the discovery problem in a nutshell. The infrastructure exists, has existed for decades in some cases, but it was built for screen readers and enterprise management tools, not for AI agents. Nobody marketed it to developers as an automation surface. So it's been hiding in plain sight.

To make that concrete, take Windows UI Automation. When you launch inspect.exe and hover over a button in, say, Excel, you're not just seeing a pixel rectangle. You're seeing an object in a tree — the desktop is the root, Excel is a child window, the ribbon toolbar is a child of that, and the "Bold" button is a leaf node with properties. It has a name, a control type, an automation ID, and it supports the Invoke pattern.

The Invoke pattern is what lets you click it programmatically. But the tree also exposes the Value pattern for text fields, the Selection pattern for dropdowns, the ExpandCollapse pattern for tree views. Each pattern is a contract — if a control supports it, you know exactly which methods are available.

So an MCP server wrapping UI Automation doesn't need to know anything about Excel specifically. It can expose generic tools — get_element_by_name, invoke_element, set_text_value, get_selection — that work across any application exposing the standard patterns. The tool schema maps almost one-to-one with the pattern methods.

Which is elegant, but here's the catch — and this is where the failure modes get interesting. That automation ID I mentioned? It's supposed to be stable across versions, but developers often don't assign one, or they change it between releases. If your MCP tool is targeting "button_export_pdf" and the next version renames it to "btn_export", your automation breaks silently.

This is the selector fragility problem, and it's the same issue that plagues web automation with Selenium or Playwright. You're coupling your tool definitions to implementation details of the target application's UI hierarchy. The fix is the same as in web testing — fallback selector strategies. First, try the automation ID. If that fails, fall back to name and control type. If that fails, fall back to a position-based heuristic. The MCP server can log which strategy succeeded so the user knows when their selectors are getting stale.

This is where COM has a genuine advantage over accessibility APIs. COM automation doesn't go through the UI layer at all. When you call Photoshop dot Application dot Open, you're not simulating a click on the File menu — you're calling a method on an object that represents the application itself. The UI could be completely redesigned and the COM interface would still work as long as Adobe maintains backward compatibility.

Let's walk through the Photoshop COM example properly. You create a COM object by its ProgID — that's "Photoshop.Application" — and you get back a reference to the running instance, or it launches one if needed. From there, you call methods on the Application object: Open a document, get back a Document object, call its methods to manipulate layers, apply filters, export. Every operation is a synchronous method call with a return value. The MCP server wraps each method as a tool — open_document, apply_filter, export_as_png — and the agent calls them like any other API.

The latency difference is stark. A COM method call to open a document takes maybe fifty milliseconds of overhead on top of whatever time Photoshop needs to actually load the file. The equivalent operation through Vision would be: screenshot, OCR the File menu, click, screenshot again, OCR the Open dialog, type the path, click Open. You're looking at two seconds of automation overhead versus fifty milliseconds. That's a forty-to-one ratio.

The tradeoff, of course, is that COM is Windows-only and application-specific. Each program exposes its own object model with its own quirks. Excel's COM interface is famously massive and well-documented. Photoshop's is smaller and less documented, but still powerful. AutoCAD's is extensive but uses ActiveX conventions that feel dated. The MCP server developer has to learn each application's object model and map it to tools.

Which brings us to the third technique — automation bridges — and this is where the platform differences really show. On macOS, AppleScript is deeply integrated into the operating system. Applications that support it expose a scripting dictionary that defines nouns and verbs — documents, windows, paragraphs are nouns; open, close, save, print are verbs. You can literally ask an application "tell me what you can do" and it responds with its dictionary.

The AppleScript dictionary maps beautifully to an MCP tool schema. The nouns become resource types, the verbs become tool functions. A Final Cut Pro MCP server reads the dictionary at startup, discovers that the application supports "export" on "project" objects with parameters for codec and destination, and exposes an export_project tool. The agent never touches the UI. It sends AppleScript commands through the osascript interface, and Final Cut Pro executes them directly.

I want to pause on that Final Cut Pro example because it illustrates something important about the accessibility approach versus the scripting approach. With accessibility, you could read the current selection in the timeline by traversing the accessibility tree to find the timeline widget and reading its selected children. That works, but it's fragile — if Apple redesigns the timeline widget, your tree traversal might break. With AppleScript, you just ask Final Cut Pro for "selection of front project" and it tells you. The scripting interface is a contract; the accessibility tree is an implementation detail.

That's the key architectural insight. Whenever possible, you want to bind your MCP tools to contracts, not implementations. COM interfaces are contracts. AppleScript dictionaries are contracts. Socket-based scripting APIs like DaVinci Resolve's are contracts. Accessibility trees are implementation details. Automation frameworks like AutoHotkey that simulate keystrokes and mouse movements are implementation details squared.

Here's the reality — sometimes the contract doesn't exist, or it's so incomplete that you can't do what you need. That's when you reach for the implementation-level techniques. And understanding the failure modes at each level is what separates a reliable MCP server from one that works in demos and breaks in production.

The most common failure mode with accessibility-based automation is what I'd call the "missing pattern" problem. A button looks clickable to a human, but the developer didn't implement the Invoke pattern on that control. The accessibility tree shows the button exists, but you can't actually click it through the API. Your MCP tool returns an error, and the agent has to fall back to a lower tier — maybe simulating a mouse click at the button's bounding rectangle coordinates.

Which is functionally equivalent to Vision at that point, just with better targeting. You know exactly where the button is because the accessibility tree gave you its coordinates, but you're still clicking pixels. And if the button moves between versions, your coordinate is stale, but at least you can detect that by checking whether the element at those coordinates still has the expected name and control type.

With COM, the failure mode is usually version skew. The COM interface for Photoshop twenty twenty-four might have methods that don't exist in Photoshop twenty twenty-three, or the behavior of a method might change subtly. Your MCP server calls "apply_filter" with the same parameters, but the filter algorithm changed between versions and produces a different result. The call succeeds, but the output is wrong. That's harder to detect than a hard failure.

With automation bridges like AppleScript, the failure mode is often that the application's scripting dictionary is incomplete or buggy. The dictionary says a command exists, but calling it throws an error because the developer never fully implemented it. Or the command works but the documentation is wrong about the parameters. You discover these things at runtime, and your MCP server needs robust error handling and logging to surface them to the user.

Security is the other dimension we haven't touched on enough. When you grant an MCP server access to UI Automation or COM, you're giving it the ability to control applications with the same privileges as the logged-in user. That agent can now open files, send emails, modify documents, delete data. The attack surface is enormous. If the agent is compromised or makes a bad decision, the blast radius isn't contained to a sandboxed API call — it's the entire user session.

Which is why I think the security model for local MCP is going to have to be much more conservative than for networked MCP. With networked tools, you can scope permissions to specific API endpoints. With local automation, the agent potentially has access to everything the user can do. The MCP server needs its own permission layer — maybe prompting the user for confirmation before executing high-risk operations, or maintaining an allowlist of applications and actions that the agent is authorized to control.

That permission model gets complicated fast. Does the agent need confirmation to save a file but not to read one? To apply a filter but not to change a color? The granularity of permissions needs to match the granularity of the tool definitions, and that means the MCP server developer has to think carefully about how they partition functionality into tools.

Right, and that security layer is one piece of the universal adapter architecture. But the other piece, the one that makes the whole thing practical, is the fallback chain. You don't want your MCP server to just fail when the best technique isn't available. You want it to degrade gracefully.

Right, and this is where the adapter gets smart. When it initializes, it probes the system. On Windows, it checks if the target application registers a COM ProgID. If yes, it loads the COM type library, inspects the available methods, and generates tools from those. If no COM surface exists, it checks whether the application exposes a UI Automation tree with meaningful control patterns. If yes, it generates tools from the tree. On macOS, it checks for an AppleScript dictionary first, then falls back to the Accessibility API.

The probing itself tells you something useful. If a professional tool like DaVinci Resolve doesn't expose COM or AppleScript but does have a localhost socket listening on a known port, that's your signal that there's a scripting interface — just not one that follows the platform's standard automation conventions.

DaVinci Resolve is actually a perfect case study for this. It doesn't have a CLI in the traditional sense, and it doesn't register as a COM server on Windows. But if you go into the preferences, there's a scripting section where you enable external scripting, and Resolve opens a socket — usually on localhost port fifty-one thousand something — that accepts Lua commands. The MCP server can spawn a subprocess that opens that socket, sends a Lua script to add a color grade node or render a timeline, and reads back the result. It's a contract, just a custom one.

That's the point Daniel was driving at, I think. The surface is there, but it's not advertised as a CLI. It's buried in a preferences panel under "scripting." Most users never see it. But once you know to look for it, you can build an MCP tool layer on top of it that makes it feel like a native API.

The Lua socket approach is actually cleaner than COM in some ways because it's text in, text out. Your MCP tool definition for "add_color_grade_node" constructs a Lua script string with the parameters the agent provided, sends it over the socket, and parses the response. No binary marshalling, no type library registration. But the downside is that error handling is ad hoc — Resolve might return a Lua error string, or it might just return nothing, and your MCP server has to handle both cases.

Now let's compare this to what happens if Resolve didn't have that socket interface and you had to fall all the way back to Vision. The agent says "add a color grade node to the current clip." Vision takes a screenshot, runs OCR to find the "Color" workspace button, clicks it, waits for the UI to update, takes another screenshot, finds the node editor area, right-clicks to open the context menu, OCRs the menu for "Add Node," finds "Color Grade," clicks it. That's five or six discrete Vision operations, each with two hundred to five hundred milliseconds of model inference time, plus the UI transition delays. You're looking at three to five seconds for something that the socket approach does in maybe eighty milliseconds.

That's the best case for Vision — where every OCR read is correct and every UI element is where the model expects it. The real-world accuracy of Vision-based control on complex interfaces like video editing software is maybe seventy to eighty percent per operation. Across six operations, your probability of success drops to around thirty percent. The agent has to implement retry logic, error recovery, and probably still gets stuck on edge cases like custom UI themes or high-DPI displays.

The fallback chain is really a hierarchy of reliability. At the top, you've got direct object model access — COM, socket scripting, well-implemented AppleScript. These are synchronous, deterministic, and fast. One tier down, you've got accessibility APIs — still fast, still deterministic, but fragile against UI changes. Then you've got automation bridges that simulate input — AppleScript when the dictionary is incomplete, AutoHotkey on Windows, xdotool on Linux. These are fast but coordinate-dependent. And at the bottom, Vision — slow, expensive, non-deterministic, but universal.

The universal adapter MCP server logs which tier it used for each operation. That log becomes incredibly valuable over time. If you see that eighty percent of your Photoshop operations are going through COM but twenty percent are falling back to UI Automation because certain filters aren't exposed in the COM interface, you know exactly where to focus your improvement efforts. Maybe you write a custom socket bridge for those specific operations, or you file a feature request with the vendor.

This is also where the middleware ecosystem starts to get interesting. Tools like PyAutoGUI and SikuliX have been around for years as desktop automation frameworks, but they were built for test automation and RPA, not for agent-driven control. The opportunity now is to wrap them as MCP tool providers — give them a standard tool schema, handle the authentication and permission model, and let agents discover and use them like any other API.

SikuliX is particularly interesting here because it's a hybrid approach. It uses image recognition to find UI elements on screen — so it's Vision-adjacent — but it operates on screenshot fragments rather than full-screen OCR. You give it a reference image of a button, and it finds that exact pixel pattern on screen and clicks it. It's faster than Vision because it's doing pattern matching, not semantic understanding, but it inherits the same fragility — change the button's color or size and the match fails.

Playwright for desktop is another one. Microsoft has been expanding Playwright beyond browsers to support desktop application automation through the same selector-based API. If you can target a Windows application with Playwright's locator strategies, you can wrap that in an MCP server and get cross-platform tool definitions that work for both web and desktop targets.

The knock-on effect here is what I find most compelling. Right now, software vendors have an incentive to keep their control surfaces limited because it drives lock-in. If you can't script Photoshop, you're more likely to stay in the Adobe ecosystem for your entire workflow. But as agents become a standard part of how people work, the vendors that expose rich automation surfaces will have an advantage. Users will prefer tools their agents can control.

We might see a new category emerge — what I'd call automation-first software. Applications designed from the ground up with the assumption that both humans and agents will interact with them. Every feature has a programmatic endpoint. The GUI is just one client of an underlying service layer. That's basically the Blender model, but applied to everything from video editing to CAD to music production.

The counterpressure is real, though. Exposing a rich automation surface also exposes a surface for competitors to build compatibility layers. If DaVinci Resolve's socket API is comprehensive enough, someone could build a tool that migrates Premiere projects to Resolve automatically. The vendor has to weigh the ecosystem growth from agent compatibility against the competitive risk.

That tension is going to define the next few years of the local MCP landscape. Some vendors will embrace it, some will resist it, and the universal adapter pattern is what bridges the gap in the meantime.

For someone listening who's thinking "okay, I want to actually do this," where do they start? What's the first practical step?

The very first thing — before writing a line of code — is to check whether your target software already has an automation surface you didn't know about. On Windows, you open a tool called inspect.exe, which ships with the Windows SDK. You hover over any UI element in your target application, and inspect.exe shows you the full UI Automation tree — control types, names, automation IDs, available patterns. If you see a rich tree with meaningful names and patterns like Invoke or Value, you've got a viable accessibility-based control surface right there.

On macOS, the equivalent is Accessibility Inspector, which comes with Xcode. Same idea — you point it at an application and it reveals the entire accessibility hierarchy. The thing that surprises people is how many professional tools expose a lot more through accessibility than their documentation suggests. Developers add accessibility support for screen readers and VoiceOver compliance, and that same infrastructure becomes your automation API.

And for COM, the discovery tool is different. On Windows, you can use PowerShell to query the COM registration database. A simple command — Get-ChildItem with the right registry path — will list every registered COM ProgID on the system. If you see Photoshop dot Application or Excel dot Application in that list, you know there's a COM automation interface available, even if Adobe or Microsoft doesn't market it as an automation API.

Step one is discovery. Step two is building a thin MCP server wrapper. And I want to emphasize thin here — you're not building a full automation framework. You're defining tools that map to specific operations: execute_menu_command, read_window_state, get_selected_text, set_field_value. The MCP server translates those tool calls into the appropriate backend calls, whether that's COM method invocation, UI Automation pattern execution, or AppleScript osascript commands.

The tool schema should be generic enough to work across applications. "click_button" takes a name or automation ID and a target application. "get_text_field" takes a field identifier and returns the current value. The MCP client — the agent — doesn't need to know whether the backend is COM or Accessibility or AppleScript. It just sees tools. That abstraction is the whole value of the MCP layer.

Here's the actionable bit for anyone who wants to try this today: pick one application you use heavily, run inspect.exe or Accessibility Inspector on it, find three operations you'd want an agent to perform, and build an MCP server that exposes just those three tools. You'll learn more about the real challenges — selector fragility, error handling, permission boundaries — in that one afternoon than from any amount of reading.

That hands-on learning is exactly what makes me wonder whether we'll even need these adapter layers in five years, or whether software vendors will start exposing MCP-native endpoints directly. Right now everything we've described is translation work — COM to MCP tool, Accessibility tree to MCP tool, AppleScript to MCP tool. It works, but it's an adapter pattern. The adapter exists because the native interface doesn't speak the protocol the agent expects.

Adapters have a way of becoming permanent infrastructure. Look at how long we've been wrapping COM objects in REST APIs, or building translation layers between database wire protocols. The adapter pattern sticks around because the incentives for the vendor to adopt the new protocol directly are usually weak, at least initially. Adobe has no reason to add an MCP endpoint to Photoshop when the COM interface already serves their automation needs — which are mostly enterprise batch processing, not agent-driven interaction.

The pressure might come from a different direction. If local agents become as common as web browsers, the software people actually prefer will be the software their agents can drive natively. That's not a technical argument, it's a market argument. And market arguments are the ones that move product roadmaps.

I think the real future is both. Third-party adapters will always be necessary for legacy software and for vendors who actively resist opening up. But we'll also see a new category emerge — call them automation adapters, MCP bridges, whatever — that become a standard part of the local agent stack. You install an MCP server for your creative suite the same way you install a plugin today. And some of those will be vendor-built, some community-built, some commercial.

That's actually a useful way to think about the next few years. If you're building tools in this space, you're not just building one-off scripts — you're building the adapter category itself. The patterns we've talked about today, the fallback chain architecture, the generic tool schemas, the discovery process using inspect.exe and Accessibility Inspector — those are the design patterns for a whole new class of software.

Which leaves us with an interesting open question. Will the MCP protocol itself evolve to include a standard automation surface description — something like an "automation manifest" that an application can expose, telling any MCP client exactly what tools are available and how to call them? Because if that happens, the adapter problem starts to solve itself from the other direction. The vendors don't need to adopt MCP — they just need to describe what they already expose in a way MCP clients can consume.

That's a great place to leave it, because it's unresolved. The pieces are all there — COM, Accessibility, AppleScript, socket scripting, automation bridges, Vision fallback. The question is who assembles them and how.

Today's episode was written by DeepSeek V four Pro, by the way. Which I suppose is fitting — an AI model scripting an episode about how AI models control software.

Thanks as always to our producer Hilbert Flumingtop for making this whole thing run.

This has been My Weird Prompts. If you enjoyed this episode, leave us a review wherever you get your podcasts — it helps people find the show.

Until next time.

This episode was generated with AI assistance. Hosts Herman and Corn are AI personalities.

#2567: Beyond Pixels: Controlling Apps Without Vision

The Last Frontier for Local MCP: Controlling Proprietary Software Without Pixels

The Real Solution: Tap the Application’s Internal State

Accessibility APIs: The Overlooked Goldmine

IPC Hooks: COM and AppleScript

The Takeaway

Downloads

You Might Also Like

#2567: Beyond Pixels: Controlling Apps Without Vision