AI Vision Changes Everything

Maps are for tourists. Locals use their eyes. So do my agents.

I’ve spent the better part of the last two years trusting a computer to hurl me down the interstate at 85 miles per hour. I drive a Tesla Model Y, and since early 2024, I’ve logged over 35,000 miles on Full Self-Driving (FSD). I use it 99% of the time.

When I first started, the skepticism was palpable. My wife, sitting in the passenger seat, was a coiled spring, ready to grab the wheel at the slightest twitch. But then came the updates. Then came the smooth lane changes. Recently, after a particularly aggressive merge that cut off a dawdling sedan, she muttered, “That was a nice move by MadMax,” and went back to her phone. (MadMax is a Tesla FSD mode for “drive with exuberance”)

That’s when I knew: Vision had won.

The argument against autonomous driving always hinged on the “Map Problem.” How can a car navigate a road it hasn’t seen? How does it handle a construction zone that popped up overnight, where the lines are painted over with chaotic orange tape and a guy named Steve is waving a flag?

The answer is embarrassingly simple: Roads are designed for people with eyes.

We don’t drive using LIDAR or pre-downloaded geometric schemas of the asphalt. We drive by looking. If a car can see and reason about what it sees, it doesn’t need a map. It just needs to understand the rules of the road.

Here is the pertinent truth that traditional developers are ignoring: Software is just another road. And for the longest time, we’ve been trying to navigate it with the equivalent of blindfolded GPS.

The API Trap: Driving Blind

In the world of software automation, APIs (Application Programming Interfaces) are the maps. They are rigid, pre-defined routes.

“Go to Endpoint A, fetch JSON Object B, update Field C.”

It’s efficient, clean, deterministic, and totally fragile. If the “road” changes—if a developer renames a button, changes a <div> to a <span>, or moves a text box—the API breaks. The map no longer matches the territory. Your code crashes. You spend your weekend debugging.

I recently hit this wall hard while building an automation for Antigravity, my chosen platform for agentic workflows and skills. As an experiment, I wanted to automate a simple temperature control workflow in a Coda document.

The Experiment: Headless vs. Head-On

My goal was simple: Use the Coda Model Context Protocol (MCP) to read a temperature, press a button to change it, and log the result. I wanted to keep it “headless”—pure code, no browser UI.

I failed immediately.

Coda MCP is very capable in its lane, and that lane is presently very narrow.

  1. The Button Problem: I could identify the “Set Temp” button ID (ctrl-RQdAjmDXwR), but the MCP had no hands. It could see the button existed in the schema, but it couldn’t push it.
  2. The Canvas Problem: I wanted to write the result onto the canvas. The API screamed schema validation errors. It turns out, directly modifying loose text controls that aren’t backed by a database table is like trying to drive a tank through a revolving door.

The API gave me read access to the structure but blocked me from the interaction. I was a ghost in the machine—able to float through walls but unable to turn a doorknob.

Turbo Mode: Giving the Agent Eyes

If the API is a map, Computer Vision is the driver.

I decided to stop fighting the API and treat the Coda document exactly like my Tesla treats a construction zone. I switched to an agentic “Vision” approach using browser subagents.

Instead of asking the Coda API “What is the value of Row 4?”, I told the agent: “Look at the screen. Find the box that says ‘Inside Temp’. Read it.”

The results were startling.

The Workflow Shift

  • Old Way (API/Map): Authenticate → Call Endpoint → Parse JSON → Handle Error → Retry.
  • New Way (Vision):
    1. Reuse Session: The agent sees the ‘PramaHub’ tab is open and attaches to it. (Hot Swap).
    2. Read: It visually scans the DOM for the value.
    3. Act: It locates the pixels of the “Set Temp” button and clicks it.
    4. Verify: It watches the screen for the visual update (the recalculation).

It was fast. It was context-aware. It felt like working with a human intern who was sharing my screen.

The Stress Test: Breaking the Map

Here is where “AI Vision Changes Everything.”

To prove this wasn’t just a fancy script, I tried to trick the Antigravity skill. I went into the Coda doc and renamed the button and field labels. In a traditional API integration, this is a death sentence. The code looks for btn_submit_v2, finds btn_save_now, and throws a 404 error.

But my Vision Agent? It paused. It reasoned. It looked at the context. It saw a button that was semantically identical to what it needed, even if the label had shifted.

It clicked the button.

The skill was developed solely based on the layout in the green box. It was never shown what’s in the red box. Just as my Tesla navigates a lane shift where the white lines are faded and confused, the Coda automation AI skill adapted to the UI change. It didn’t need the map to be perfect; it just needed to see the objective.

Why This Matters

We are entering an era where “Vibe-Coding” isn’t just a meme; it’s a competitive advantage. This is Vibe-Automation, a form of vibe-coding.

Ben Stancil often notes that AI blurs roles, making everyone a “good enough” coder. But I’d argue it goes deeper. We are moving from Deterministic Automation (if X, then Y) to Probabilistic Agency (Here is the goal, figure out X and Y).

  • APIs are brittle. They require maintenance, documentation, and permission.
  • Vision is universal. If a human can use the software, the agent can use the software.

This is the “Universal Adapter.” We don’t need to wait for Coda (or Salesforce, or Jira) to update their MCP or fix a webhook. If the pixels are on the screen, the data is ours. The action is ours.

Field Notes from the Edge

As a pioneer in building these systems at Stream It and other clients, I’ve seen the shift firsthand.

  • Cold Start vs. Hot Swap: My vision agent is robust enough to open the browser if it’s closed (Cold Start) or latch onto an existing session (secure Hot Swap).
  • Domination of the Canvas: Vision allows us to treat the entire screen as an application surface. We aren’t limited to the database rows; we can interact with the interface.

This is the complete Antigravity skill:

Set Temp Workflow
This skill automates the user's "Set Temp" workflow in Coda.

Usage
Run this skill when the user runs the slash command /set-temp or asks to "run the set temp workflow".

Steps
Context Check:

Check if the Coda document PramaHub (PageID: 804443C032DE791B2E9058FB88F943DD, URL: https://coda.io/d/PramaHub_dz3Pj-NB9iX/_canvas-P3cqo7UIuK) is already open in the browser.
If not, open it.
Browser Automation:

Launch a Browser Subagent with the following task:
"Read the 'Inside Temp' value. Click the 'Set Temp' button. Wait 5 seconds. Read the new 'Inside Temp' value. Calculate the delta (New - Old). Type the signed delta (e.g., '+3f') into the 'Temp Change' text input and press Enter."

Reporting:

Report the Initial Temp, New Temp, and calculated Delta to the user.

The Future is Visual

There is a beautiful irony here. For decades, computer scientists have tried to turn the messy, visual world into clean, structured text so computers could understand it.

Now, we are doing the opposite. We are giving computers eyes so they can understand the messy world on their own terms.

My Tesla doesn’t need a map of every pothole in my town. It magically steers around them; even the new one created last night. And my Antigravity agent doesn’t need perfect API documentation of every button in my Coda doc.

They just need to see.

Stop building fragile integrations. Give your agents eyes.

5 Likes

An impressive example, @Bill_French. I’m trying to imagine the impact if this approach can be used robustly and reliably in the future. It opens up countless application possibilities, and perhaps smaller companies in particular could benefit greatly. I hope I’ll soon have the opportunity to gain access to the closed MCP beta program; unfortunately, I haven’t yet. Perhaps the time is approaching when MCP will be available to everyone. Somehow, this all sounds like an exciting future.

1 Like

Indeed. They need to accelerate access. It works. It’s secure.

Bear in mind that this example with vision doesn’t require Coda MCP. In fact, that’s the entire point of this vision excursion - to show that when you give an agentic platform access to a browser, it can proxy as a vision-based API that has none of the typical limits of an actual code-based API. Coda MCP is precisely that - a code-backed service that provides the definitive and deterministic pathway to make and update Coda documents subject to all of its limitations (and there are several).

As to reliability, much of that depends on the reasoning and agentic agility your chosen platform/models provide. Claude Code and Antigravity seem to be extremely competent. I suspect Cursor and Codex are as well.

While I agree that this approach has many benefits, it’s not a sliver bullet and doesn’t remove the need for proper APIs for automation.

Determinism/reliability
Determinism isn’t a solved problem yet for LLMs, and I’ve used enough AI automation to know that even for simple tasks such as “Type the signed delta (e.g., ‘+3f’)” it will spit out a random “Here is the code that types the signed delta…” 1 out X times (this is an obvious error that’s relatively easy to catch, but malformed values also happen that are not as easy to catch).

Depending on the importance and amount of the data, having even a small % of corrupted data could be unacceptable.

Scale
Just because I have a car that can drive itself, it doesn’t mean it is the fastest or most efficient way to move 50 people from point A to point B (although there are certainly cases where this would be a useful option nonetheless). A better solution would be a bus.

In the same way this approach would be slower and more error prone for handling large amounts of data. Paging, loading times, UI layout, missing data, different formats, etc. can all cause issues.

There are other problems as well, but I don’t want to fall into this rabbit hole :sweat_smile: In short, I think your solution is great for situations where there is no proper API, or for Ad Hoc or non critical tasks, but for scale and reliability a proper API is the way to go.

2 Likes

I agree. When they exist, they should be used.

I agree. And, it may never be. They’re intentionally indeterministic by design.

Agents, on the other hand, live by a constitution and increasingly, those constitutions are designed to overcome their lack of determinism.

If you examine the traces and reasoning of LLMs you see lots of assumptions that are obvious missteps. Agentic traces, depending on the agent’s programming, are very different. Here’s an actual example:

Last night I asked an agent to recommend an approach in this community question. While examining the agent’s reasoning and solution-validation steps, I saw this error.

Note: The agent often conflates Coda’s MCP with the API. This work is being performed with Coda MCP because it is a superset of the Coda API.

As you can see, there are two important distinctions we can make based on this examination:

  1. The agent recognized it had made an error.
  2. It corrected the error.

This agent, unlike general conversational LLMs running inside ChatGPT, Gemini, and Claude, operate within a set of constitutional laws. The reasoning traces call out it’s exact understanding of the requirements to verify, test with actual documents and components, and directly eval the tests against outputs and documentation. Many of these rigorous guidelines are in mcpOS, a framework that extends Coda’s MCP in ways that nudge it closer to the determinism we seek.

No LLM or agent will ever be 100% perfect. But, increasingly, they are performing in ways that are more perfect than people. I think they’re able to do this because we’re giving them more guidance in frameworks that perform deep assessments of their own work, which are processes that humans mostly skip because of constraints or are too lazy to check their work from different perspectives and scenarios.

Another significant reason these tools are different is because of the fundamental agentic loops. LLMs don’t operate in a loop based on goals, a plan, and testable outcomes. Some models are being built to contain loops within, but these are currently limited to the expensive variety and their outputs show far fewer hallucinations as would be expected.

Thanks for the detailed response.

I agree, agentic loops can catch a lot of mistakes and the problem of determinism can be mostly avoided with function/tool calling.

In the end it depends on what you are trying to automate. If you are doing a one off task, or automation is not supported, then this is definitely the way to go. If you are automating some type of data processing in an environment with proper APIs, then the agent could be just a very expensive regex or switch and a potential source of malformed data.

My point, to return to your original example, is that “vision has not won”. It is just another tool in our arsenal and shouldn’t be used for every problem. That is why we have microcontrollers and don’t have Intel I9s running our washing machines :wink:

1 Like

In the context in which I concluded, it appears to have won. My car takes me everywhere and has done a relatively complete and safe job for the last ~39,000 miles, solely based on camera vision. It has reduced my accident risk roughly 9x. That seems like vision won.

In the context of the article, AI vision seems to change everything. When tools can see what you’re doing and what they’re doing, doesn’t that feel like it vastly changes how we approach problems?

The analogies are getting a bit strained here so I will focus on the software side :sweat_smile:

Yes, it changes how we can approach problems, but it doesn’t mean it is the most efficient or reliable solution for every problem. An AI query on average is 10x more expensive than a google search (this was provided by an AI overview on Google, oh the irony :rofl:). Depending on if you believe that current AI pricing is subsidized, this difference may grow.

We saw a less pronounced version of this with blockchain a couple of years ago. Yes, a distributed ledger is the correct solution for some problems, but it shouldn’t be the default option for when you need a database. Megawatts of power were wasted because of this. AI is much more broadly applicable than blockchain, and the amount of power wasted because it is the “easy” solution will be huge.

Even if we ignore the other problems, the problem of cost at scale says that if vision has won, it has done so in a very specific context (or we will feel the consequences of this victory later).

1 Like

The predicates I wrote in the article matter. In the context of my Tesla, I concluded that Vision has won. Other sensors are not required. It does a better job that humans.

As to software solutions, vision changes how we look at future solutions. I think you agree notwithstanding the additional predicate that it matters greatly that we consider cost, practicality, risks, and several other aspects of smart software engineering. I think we agree.

1 Like

Sorry, I dipped back in to the analogy back there. By “vision” I meant the AI approach. I didn’t want to start a discussion about vision vs Lidar (although I do think a hybrid approach would be better than vision alone :wink:).

And yes, we are in agreement. This does change how we look at future solutions. I’m just trying to point out that that we shouldn’t forget about possibly better solutions because of the shiny new thing, as software developers usually tend to do.

Thank you for the lively discussion on an interesting topic :person_bowing:

1 Like

Yes, absolutely. …

2 Likes