The model is not the bottleneck. The code around it is.

Every time a client comes to us frustrated with their AI agent, the conversation goes the same way. They have swapped the model twice, maybe three times. GPT-4o to Claude, Claude to Gemini, back again. The agent still underperforms. They want to know which model to try next.

The answer is almost always. none of them. The model is not the problem.

What determines whether an agent actually works in production is the infrastructure you build around it. The tools it can call. The memory it can access. The permissions that define what it is allowed to touch. Get those wrong, and no model upgrade will save you. Get them right, and even a mid-tier model will outperform a frontier one running on a weak scaffold.

This is not a fringe position. It is where serious AI engineering has been quietly landing for the past year. And it has real consequences for how agencies, product teams, and founders should be allocating their time.

What the scaffold actually does

An AI agent is not a model. It is a system. The model is one component inside a larger architecture that includes tool definitions, memory stores, retrieval mechanisms, orchestration logic, and permission layers. Each of those components shapes what the agent can do more than the model weights themselves.

Take tool access. A model has no ability to act on the world unless it is given tools: APIs it can call, databases it can query, services it can write to. The quality of those tool definitions, how clearly they describe what a tool does and when to use it, directly affects whether the model reasons about them correctly. A vague tool description produces wrong tool calls. A precise one produces reliable behavior. The model did not change. The interface did.

Memory is the same story. Most agent failures in the wild are not reasoning failures. They are context failures. The agent did not have access to the right information at the right moment. Whether that means a well-structured vector store, a short-term scratchpad, or a summary of prior conversation turns, the architecture of memory determines what the model can know when it needs to know it. A GPT-3.5-era model with excellent retrieval will beat a frontier model operating blind.

Permissions are the one people talk about least and break most often. What can the agent read? What can it write? What requires a human confirmation? These boundaries are not just safety guardrails, they are functional design. An agent that can write to production without a checkpoint is not a powerful agent. It is a liability. An agent with well-designed permission gates is one you can actually deploy.

The most common mistake I see is treating the model as the product. The model is an engine. The product is everything you build around it.
Max Pinas, founder, Studio Hyra

Where agencies have been getting this wrong

The agency world has a particular version of this problem. There is a commercial incentive to lead with the model, because models have names, benchmarks, and press coverage. Telling a client you are using Claude 3.5 Sonnet or GPT-4o signals something. Telling them you spent three weeks on retrieval architecture and permission schema design is harder to put in a proposal.

So most agencies do not do it. They bolt a capable model onto a thin scaffold, show a demo that works in controlled conditions, and hand it over. The client then runs it in the real world, where the inputs are messier and the context is richer and more ambiguous, and the thing breaks.

The fix is never to swap the model. The fix is to go back and redesign the scaffold. But by that point, the relationship is strained and the budget is spent.

At Studio Hyra we have made a deliberate choice to invert this. Before we touch model selection, we map the operational layer: what tools does this agent need, what memory architecture fits the use case, where do humans need to stay in the loop. Model selection comes after that. It is almost always the easiest part of the project.

Three things that actually move agent performance

Tool schema design. The JSON schema you write for each tool is a prompt in disguise. If the description field is vague, the model will misfire. Write tool descriptions as if you were explaining the tool to a smart junior colleague who cannot ask follow-up questions. Include what the tool does, what it does not do, and when to prefer it over a similar tool. This alone eliminates a large category of agent errors.

Retrieval before generation. For any agent operating in a knowledge-heavy domain, the retrieval step deserves as much attention as the generation step. Chunking strategy, embedding model choice, reranking, metadata filtering: these decisions compound. A retrieval pipeline that surfaces the right document 90% of the time produces a dramatically different agent than one hitting 60%, regardless of which generation model sits downstream.

Explicit human checkpoints. Agents that can act without limit do not inspire confidence. They produce anxiety, and rightly so. Designing explicit confirmation points, where the agent pauses and asks a human before writing, deleting, or sending, turns an unpredictable system into a trustworthy one. The goal is not automation for its own sake. The goal is reliable outcomes. Sometimes that means the agent stops and waits.

Reliable outcomes sometimes mean the agent stops and waits. That is not a limitation. That is the design.
Max Pinas, founder, Studio Hyra

What this means for where you spend your time

If you are building an AI product or integrating agents into a workflow, the practical implication is straightforward: stop auditing model leaderboards and start auditing your scaffold.

Ask where your agent fails in production. In the majority of cases you will find one of three things: it called the wrong tool because the schema was ambiguous; it gave a stale or irrelevant answer because the retrieval step missed; or it did something it should not have done because the permission layer was under-specified.

None of those failures are fixed by a better model. All of them are fixed by better engineering.

This is actually good news. Model capabilities are largely outside your control. You take what the labs give you, and the labs update on their own schedule. The scaffold is entirely within your control. It is where craft shows. It is where the difference between an agent that demos well and one that runs reliably in production is made or lost.

Frontier models are impressive. But the most capable agent we have built this year runs on a model that is not the newest or the largest. It runs on a scaffold that took four weeks to get right. That is the work.

The shift worth making

The industry conversation around AI agents is slowly moving in this direction. Infrastructure companies are building better tooling primitives. Evaluation frameworks are maturing. Memory and retrieval are becoming recognized disciplines, not afterthoughts.

For builders who have been in the model-swapping loop, the reframe is simple: treat the model as a commodity input and the scaffold as the product. That is where differentiation lives. That is where your time is worth spending.

If you want to talk through what that looks like for a specific use case, we are easy to find.

The model is not the bottleneck. The code around it is.

What the scaffold actually does

Where agencies have been getting this wrong

Three things that actually move agent performance

What this means for where you spend your time

The shift worth making

Keep reading.

Europe is not caught between two powers. It is being squeezed out by both.

AI ran the ransomware attack. Now figure out who owns that.

Momentum starts with a conversation.

Keep reading.

Europe is not caught between two powers. It is being squeezed out by both.

AI ran the ransomware attack. Now figure out who owns that.