Technology6 min read

From AI pilot to AI in production

April 29, 2026

Most companies have run a pilot by now. A few have run several. The pattern is familiar: a small team picks a use case, wraps a model, demos it to the leadership group, gets applause, and then stalls somewhere between "this works in staging" and "we can ship this."

According to a 2024 McKinsey survey, 79% of executives say AI adoption is causing them pain. A separate figure from Gartner puts the share of organisations with a formal measurement framework for production agents at 31%. Read those two numbers together and the problem becomes clear. The pilots are running everywhere. The production discipline is almost nowhere.

This is not a capability gap. The models are good enough. The tooling is mature enough. What is missing is the craft of taking an experiment and making it something you would stake your quarter on.

Why pilots die at the threshold

A pilot is designed to answer one question. can this work? It is allowed to be fragile. Someone watches it. Someone intervenes when it halves. Someone manually checks the output before it touches anything real.

Production is a different contract. No one is watching every run. Failure is silent. Costs accumulate in the background. Users form habits around the system, good and bad. And when something goes wrong, you need to know within minutes, not weeks.

The gap between those two states is not a sprint of cleanup work. It is a different way of thinking about what you built. Most teams skip that re-think because the pilot looked so promising. That is exactly where the trouble starts.

There is also an organisational pull toward the demo. A working demo is visible. It produces excitement. Production readiness is invisible until the moment it fails in front of a customer. So the incentive to ship the demo and call it done is real, and it takes deliberate pushback to resist it.

A pilot is allowed to be fragile. Someone watches it. Production is a different contract. No one is watching every run, and failure is silent.
Max Pinas, Studio Hyra

Three things that matter in the second half of 2026

After working with production agent systems across several client tracks, three disciplines separate the ones that hold from the ones that quietly get switched off.

1. A measurement framework per agent, not per product

The default metric for most shipped AI features is adoption. How many users opened it. How many sessions. How many seats activated. That is a product metric, not an agent metric. It tells you nothing about whether the agent is doing what you built it to do.

Useful metrics live one level down. Task success rate. did the agent complete the assigned task without a human having to intervene or redo it? Tool call accuracy: when the agent reached for a function or an API, did it call the right one with the right arguments? Cost per outcome: not cost per token, not cost per session, but cost per unit of actual value delivered.

These are harder to instrument, especially if your agent was scaffolded quickly for a demo. But they are the only numbers that tell you whether the system is earning its place in the stack. Pick two or three before you go live. You can always add more later. Starting with none means you are flying without instruments.

2. Identity, audit logs, rollback, and human override are the floor, not the ceiling

Every production agent needs to know who it is acting on behalf of, leave a trace of what it did, be reversible when it acts on bad data, and have a clear path for a human to step in and take over.

Those four things are not a compliance checklist. They are the mechanical properties that make an agentic system safe to run at scale. Without them, a single bad run can corrupt state, charge a customer incorrectly, delete something irreversible, or trigger a downstream process that takes days to unwind.

The pushback I usually hear is that adding this infrastructure slows the team down. It does, slightly, the first time. By the third agent it is a two-hour setup because the patterns are already in place. The cost of retrofitting them after an incident is orders of magnitude higher.

This is the same logic that made version control non-negotiable for code. No one argues about it anymore. Audit logs and rollback for agents will get there too. The teams building now who treat these as optional are simply borrowing time.

3. ROI at the outcome level, not the tool level

This is where a lot of the business case collapses quietly. A team builds an agent, measures the cost of running it (compute, API calls, engineering time), compares it against the license cost of the SaaS tool it replaced, and calls it a win.

But the tool cost was never the real cost. The real cost was the time a person spent doing a task that produced a result. The right question is: does the agent produce that result faster, more accurately, and with fewer downstream corrections? That is an outcome. That is where the ROI lives.

Measuring at the tool level is comfortable because the numbers are easy to pull. Measuring at the outcome level requires you to define what a good result looks like, which forces a conversation about quality that many teams are not ready to have. Have that conversation before you ship. It makes everything else sharper.

The production mindset in practice

At Studio Hyra, this work sits inside what we call Track B. Track A is the fast, opinionated build: assisted coding, rapid prototyping, getting something real in front of people within weeks. Track B is the discipline that follows. Not a handoff, not a separate engagement. The same thinking, applied to the question of whether what we built can actually be trusted over time.

That framing matters because it changes the conversation with the client. If Track A and Track B are two separate projects with two separate budgets, the client will often stop at Track A and assume the work is done. If they are two phases of the same arc, the production questions show up early, during design, during scaffolding, before the demo is even finished.

The "Decision Making, Speed of Taste" principle we work with is relevant here too. Fast taste is the ability to make a call without a three-week analysis cycle. In agent production, that means being able to look at your task success rate on a Tuesday morning and decide by noon whether you need to pull the agent back, tune a prompt, or reroute a tool call. That speed of judgment requires the instrumentation to already be in place. You cannot improvise it mid-incident.

Fast taste is the ability to make a call without a three-week analysis cycle. In agent production, that means looking at your task success rate on a Tuesday morning and deciding by noon.
Max Pinas, Studio Hyra

What to do this month

If you have a pilot running and are thinking about the path to production, start with three questions before you write a single line of infrastructure code.

First. what does a successful run look like for this agent, in a sentence a non-engineer could read? If you cannot write that sentence, you cannot instrument it.

Second. what is the worst thing this agent can do silently? A bad email draft is low stakes. A write to a customer record or a payment trigger is not. Map the risk profile before you decide how much override and rollback you actually need.

Third. who owns this agent in six months? Not the team that built it. The person who will be paged when it degrades, who will read the logs, who will decide whether to retrain or replace it. If there is no name attached to that role, the agent is not production-ready regardless of how good the demo looked.

Those three questions take an afternoon. They will save you months.

Ready when you are

Momentum starts with a conversation.

No forms, no intake. Just a real conversation with the people who do the work.

Book a callBook a call

Three things that matter in the second half of 2026

After working with production agent systems across several client tracks, three disciplines separate the ones that hold from the ones that quietly get switched off.

1. A measurement framework per agent, not per product

2. Identity, audit logs, rollback, and human override are the floor, not the ceiling

Every production agent needs to know who it is acting on behalf of, leave a trace of what it did, be reversible when it acts on bad data, and have a clear path for a human to step in and take over.

From AI pilot to AI in production

Why pilots die at the threshold

Three things that matter in the second half of 2026

1. A measurement framework per agent, not per product

2. Identity, audit logs, rollback, and human override are the floor, not the ceiling

3. ROI at the outcome level, not the tool level

The production mindset in practice

What to do this month

Momentum starts with a conversation.

From AI pilot to AI in production

Why pilots die at the threshold

Three things that matter in the second half of 2026

1. A measurement framework per agent, not per product

2. Identity, audit logs, rollback, and human override are the floor, not the ceiling

3. ROI at the outcome level, not the tool level

The production mindset in practice

What to do this month

Keep reading.

What the EU AI Act means for your product stack in 2026

Reach 800 Million Users Without Apps

Momentum starts with a conversation.

Keep reading.

What the EU AI Act means for your product stack in 2026

Reach 800 Million Users Without Apps