Inference costs halved. So why are AI budgets still growing?

Inference costs for large language models have dropped by more than half over the past twelve months. You would expect AI budgets to follow. They have not. Most agencies and product teams are spending more on AI than they were a year ago, not less. That gap is worth examining closely, because it tells you something useful about how organisations actually adopt new technology versus how they plan to.

The short version. cheaper inference does not mean cheaper AI. It means more AI. And more AI, run without a clear architecture, compounds your costs in ways that a per-token price cut does not fix.

A person reading a book alone at a wooden table inside a sunlit library.

The Jevons trap, in your sprint backlog

In the 1860s, economist William Stanley Jevons observed that more efficient coal engines led to greater coal consumption, not less. Efficiency made coal economical for more use cases, so demand expanded faster than efficiency improved. The same dynamic is running through AI budgets right now.

When GPT-4 class inference cost a dollar per thousand tokens, teams were disciplined. You routed only what needed routing to the expensive model. You cached aggressively. You designed prompts to be short. When costs dropped sharply, that discipline relaxed. Teams added features, expanded context windows, connected more data sources, ran more evaluations. Each of these is individually reasonable. Cumulatively, they erase the savings and then some.

This is not a failure of judgment. It is the normal behaviour of a rational team working with a newly affordable input. The mistake is assuming that cheaper per-unit cost translates into a smaller total bill.

Cheaper tokens do not reduce your AI bill. They increase your appetite for AI. That is worth planning for explicitly.
Max Pinas, Studio Hyra

Where the money actually goes

When we map AI spend for clients, the inference line item is rarely the largest one. Here is where budgets actually accumulate.

Orchestration and glue code. Connecting models to your data, your APIs, and your existing tools takes engineering time. That time does not get cheaper when OpenAI cuts prices.

Evaluation. A serious eval suite for a production AI feature costs real money to build and maintain. Most teams underestimate this by a factor of three or four when they start.

Human review. For anything consequential, someone still reads the output. That person has a salary. Their time is a cost that does not appear in your model billing dashboard.

Rework from rushed architecture. The most expensive AI cost is the one that does not show up until six months in, when you have to re-engineer a feature because you built it around a model capability that shifted, or a prompt pattern that stopped working at scale.

Inference costs are a small fraction of this picture. When they drop, the picture does not get proportionally cheaper. It gets wider.

An empty urban street at sunset, with long shadows cast across the pavement.

What faster model economics actually change

That said, the shift is real and it does matter. Here is what it actually changes for agencies and product teams.

The entry threshold for features moves. Use cases that were economically marginal twelve months ago are now worth building. AI-assisted search over large document sets, real-time personalisation at the content block level, multimodal inputs in mobile flows. These were nice-to-have a year ago. They are now within budget for mid-market products.

The competitive window compresses. If you were waiting for AI to be cheap enough to justify a feature, your competitors were waiting for the same threshold. When the price drops, everyone crosses it at roughly the same time. Speed of implementation matters more than it did when cost was the differentiator.

Model selection logic needs updating. A lot of teams built routing logic, choosing between cheaper and more capable models, based on price points that no longer exist. If you are still running a model router calibrated for 2023 prices, you may be adding latency and complexity for savings that have already materialised at the provider level anyway.

The planning cycle problem

Most organisations budget annually. AI model economics are moving on a six-month cadence, maybe faster. That mismatch creates a specific failure mode: teams lock in assumptions at the start of a fiscal year that are materially wrong by Q3.

The fix is not to plan less carefully. It is to separate the layers of your AI investment. Separate what you spend on infrastructure and model access, which will keep shifting, from what you spend on architecture, evaluation, and the people who maintain and improve the system. The latter is stickier. It should be planned and staffed with more stability, not treated as variable cost.

If your AI budget is primarily a line item for API access, you are measuring the wrong thing. The actual investment is in the team and the system design around the models. That is where the durable value sits, and that is the part that does not get cheaper when inference costs drop.

A potted plant on a windowsill, silhouetted against a brightly lit window.

The question is not what inference costs today. It is what your system will cost to maintain when the model it depends on is deprecated in eighteen months.
Max Pinas, Studio Hyra

What to do with this

Three things worth doing now, regardless of where inference prices go next.

First, audit your AI spend by layer. Separate model costs from engineering costs from review costs. If you have not done this, you are probably optimising the wrong number.

Second, revisit feature candidates that were killed on cost grounds in 2023 or early 2024. Some of them are now viable. A quick re-evaluation takes a day and might unlock something useful.

Third, invest in your eval layer before you invest in more features. Cheap inference means you can run more, faster. That is only an advantage if you can tell quickly whether what you are running is actually working. Without a solid evaluation process, speed becomes a liability.

The economics of AI are shifting in your favour. Whether your budget reflects that depends less on what the models cost and more on how deliberately you have structured the system around them.

Inference costs halved. So why are AI budgets still growing?

The Jevons trap, in your sprint backlog

Where the money actually goes

What faster model economics actually change

The planning cycle problem

What to do with this

Keep reading.

Nvidia is acting like a central bank for AI startups

When you cut a system prompt by 80 percent, what were the other 80 percent doing?

Momentum starts with a conversation.

Keep reading.

Nvidia is acting like a central bank for AI startups

When you cut a system prompt by 80 percent, what were the other 80 percent doing?