The Billion-Dollar Hallucination: Why Your Chatbot Can’t Count (And Why It Matters)
Let’s be honest: we have spent the last two years building the world’s most expensive, elaborate, and seemingly intelligent tape recorder, and we are shocked—shocked—when it doesn’t exhibit judgment, taste, or the ability to do basic arithmetic.
We are witnessing a pivotal moment in the enterprise AI hangover. The “vibes” era is ending. The “truth” era is gasping for air.
Since the early days of Coda AI, I’ve sustained a message that Large Language Models (LLMs) are stochastic remix engines—probabilistic parrots that excel at sounding plausible, but fail spectacularly at being precise. They are dream machines, not calculating engines. And nowhere is this failure more expensive, or more embarrassing, than in data analytics.
A recent head-to-head battle between Coda’s native AI and a Coda MCP (Model Context Protocol) agent running on mcpOS has just provided the smoking gun. It turns out that when you ask a chatbot to analyze your business data, it doesn’t count rows; it guesses them.
And it guessed wrong by $1.1 million.
The “Bolt-On” Fallacy: Coda AI is Just Copilot in a Hoodie
First, let’s clear the air regarding the elephant in the server room. Coda AI is effectively no different than Microsoft Copilot.
They represent the “Bolt-On” era of AI: taking a chat interface and duct-taping it to the side of an existing product in a desperate attempt to convince users that a conversation is the same thing as a workflow. It isn’t.
The industry fell into a collective delusion that if we just “bolted on” a chat window to our spreadsheets and documents, the AI would magically understand the structured reality of the data underneath. But that is not how these models work.
Microsoft Copilot suffers from the exact same “garbage in, garbage out” hallucination loops when asked to perform data analytics. If you ask Copilot to sum a column of 10,000 sales figures, it doesn’t run a SQL query; it predicts the next likely number in a sequence based on the tokens it sees. It’s like asking a poet to do your taxes—the result might rhyme, but you’ll probably go to jail.
In our test, Coda AI fell into this exact trap. It hallucinated on aggregation. It confidently presented 186 orders (missing ~10% of the data) and a Total Retail Value of $12,735,000.
Sounds impressive, right? It’s a big number. It looks authoritative. It was formatted beautifully.
It was also $1,128,970 short.
The outcome would have been just as disastrous had we tried this in Microsoft Copilot. The failure isn’t the brand; it’s the architecture. When you rely on a Bolt-On AI, you are relying on a vibe-check, not a calculation.
Taming the Stochastic Beast: Gemini 3 (or Claude, etc) + MCP
Here is the nuance that the “AI is Magic” crowd misses: The agent that won this battle—a Google Antigravity agent running Gemini 3—is also a stochastic Large Language Model capable of serious hallucinogenic missteps.
Left to its own devices, Gemini 3 is just as capable of hallucinating as GPT-4 or Claude. It wants to dream. It wants to predict the next word, not the next truth. It is designed to be creative, not compliant.
The difference? mcpOS puts the beast in a straitjacket.
When paired with the Coda MCP (Model Context Protocol), we aren’t asking Gemini 3 to do the math. We are asking it to orchestrate the math. The MCP tools act as the “guardrails of truth,” forcing the stochastic model to defer to deterministic functions.
- The LLM (Gemini 3) provides the intent and the reasoning (“The user wants to know the total revenue”).
- The MCP (Coda’s MCP Beta) provides the execution and the database access (“I will run a query on Table A for Column B”).
It’s the difference between asking a well-read friend to guess the number of jellybeans in a jar versus handing a counting machine to an engineer. Gemini 3 didn’t guess the row count; it used a tool to fetch it. The result? 206 orders and $13,863,970. Accuracy: 100%.
We effectively “contained” the stochastic qualities of the LLM. We let it be smart about what to do, but we forbade it from being creative about how to count.
The Rot in the Machine: Why Context Efficiency Wins
But accuracy isn’t the only casualty of the “Bolt-On” model; efficiency is dying too.
There is a prevalent myth in Silicon Valley that “bigger context windows” solve everything. “Just throw 2 million tokens at it!” they scream. “The model will figure it out!”
This leads to a phenomenon known as Context Rot.
As you stuff an LLM’s context window with thousands of tool definitions, API schemas, and document text, the model’s reasoning capabilities degrade. It gets confused. It suffers from “attention dilution,” where critical instructions get buried in the noise. Even a model as powerful as Gemini 3 will start to wander if you give it too much junk to look at.
This is where mcpOS quietly changes the game.
Look at the difference in architecture:
- Before (The “Bolt-On” Way): You dump 40 GitHub tools, 10 Figma tools, and 25 Linear tools into the context window just in case the user needs them. You are burning tokens on potentiality, not reality. You are cluttering the workspace before the work has even begun.
- After (The mcpOS Way): You load a lightweight “MCP Server List.” The agent uses a search sub-agent to find the one tool it needs, when it needs it.
In my testing, this approach saved 26% of the conversation context. Others report similar gains. And Warp leans into this with agents optimized for context - a CLI capable of engineering contexts with precision.
Why does this matter? Because “context rot” is the silent killer of complex agents. By keeping the context window clean, we ensure the LLM has the “brain space” to follow complex instructions without getting lost in the weeds. We aren’t just saving money on tokens; we are saving the model’s sanity.
The Triple Crown
This test serves as a pivotal proof point. By moving from “Chat” (Coda AI/Copilot) to “Agentic Orchestration” (Gemini 3 + MCP), we achieved the holy trinity of enterprise AI:
- The Right Answer: Deterministic truth, not probabilistic guessing ($1.1M variance eliminated).
- Quicker Execution: No back-and-forth prompting to fix hallucinations. The agent gets it right the first time because it is reading the database, not reading the room.
- Fewer Tokens: Optimized context usage via mcpOS, preventing context rot and lowering costs.
A Memo to Coda Leadership
You are sitting on the engine of truth in a world drowning in synthetic noise.
Coda AI (the chat feature) is a fun toy. It’s excellent for summarizing meeting notes or drafting emails. But if you want Coda to be the operating system for the enterprise, you cannot rely on a system that hallucinates 10% of the data.
The future isn’t “smarter” LLMs that hallucinate slightly less. The future is hybrid architectures: LLMs for the interface (the “intent”) and MCP agents for execution (the “truth”). Perhaps there’s an engineering team working on integrating Coda MCP tools into Coda AI features.
We need to stop trying to train the parrot to be an accountant. Let the parrot talk. Let the MCP agent count.
And for the love of accuracy, let’s stop accepting “close enough” as a standard for enterprise intelligence.
On the Docket: The Rise of Agentic AI in CLIs
The rapid momentum behind CLI-based AI agents and broader agentic platforms signals a pivotal shift in how we build and interact with intelligent systems. As we enter 2026, with tools like Gemini CLI, Claude Code, and emerging multi-agent frameworks gaining traction, it’s an ideal moment to examine the evolving role of AI in collaborative, cloud-based SaaS ecosystems—like Coda, Notion, and Airtable.
These platforms are already embedding powerful AI assistants for content generation, data insights, and workflow automation. But the agentic wave—autonomous agents that reason, plan, and execute multi-step tasks—could redefine them: from reactive tools to proactive “digital colleagues” that orchestrate complex processes across docs, tables, and integrations.
There seems to be a significant rise in the use of agentic tools within CLI frameworks. This is precisely how I mostly use Coda MCP [beta]. Will agentic Coda increasingly move to the desktop, or will the Superhuman Packs Agent SDK interrupt this landslide of adoption?
Key questions on the horizon:
- How will agentic workflows bridge CLI power with no-code accessibility?
- Can cloud SaaS platforms evolve into true agent orchestration hubs without losing their intuitive, user-friendly core?
- What governance and interoperability challenges arise as agents proliferate?
This intersection of developer-centric agentic tools and collaborative SaaS feels ripe for disruption. Worth diving deeper—thoughts?

