What happens when you put 100 agents in a room and ask them to break things

There is a specific kind of meeting that every agency knows well. Seven people in a room. One of them defends a decision. The others push back. Someone finds the hole in the argument. The decision gets stronger, or it gets replaced. It is slow and expensive and it works.

Microsoft just ran that meeting at machine speed, with more than 100 AI agents, aimed at the Windows codebase. The system, called MDASH, found 16 previously unknown vulnerabilities in a single Patch Tuesday cycle. That is not a benchmark score. Those are real bugs that would have shipped.

A network of interconnected nodes and lines forming a complex, abstract structure.

How MDASH actually works

The architecture is not exotic once you look at it clearly. MDASH puts multiple specialized agents in deliberation with each other. Some agents propose attack vectors. Others argue against them, test assumptions, flag weak reasoning. A coordinating layer decides what survives the debate.

This is the same logic as red team versus blue team security testing, except the red team never gets tired, never stops at five o'clock, and scales horizontally without a hiring budget. The debate structure matters more than any individual agent's capability. A single model scanning code for vulnerabilities will miss things. A model that has to defend its finding to 99 peers misses fewer things.

The number 16 is worth sitting with. Security researchers running conventional static analysis and fuzzing tools on a mature codebase like Windows typically find vulnerabilities in the single digits per cycle, and that work takes significant human time. MDASH produces comparable output autonomously, inside the same timeframe as a monthly release cadence.

The debate structure matters more than any individual agent's capability. A model that has to defend its finding to 99 peers misses fewer things.
Max Pinas, Studio Hyra

Why this is relevant to agencies, not just security teams

The obvious read is that MDASH is a story about Microsoft and cybersecurity. The less obvious read, which is the one worth paying attention to, is that it is a proof of concept for a class of system design that applies almost anywhere humans currently run structured critique.

Agencies run structured critique constantly. Design reviews. Content audits. Strategy validation. QA before a product ships. The common shape is: someone produces something, others evaluate it, the group surfaces problems, the thing gets better. That shape maps directly onto what MDASH is doing.

The constraint has always been that critique is expensive. You need skilled people. You need calendar time. So most agencies do less of it than they should. One round of design review instead of three. One pass of copy QA instead of a proper adversarial read. The thing ships with the hole still in it.

Multi-agent debate systems do not solve every critique problem. They are genuinely good at tasks that have a defined success condition, where being wrong has a measurable consequence, and where the space of possible errors is large enough that a single reviewer will miss things systematically. Security vulnerability discovery fits all three. So does accessibility auditing. So does checking whether a component library has internal contradictions. So does reviewing whether a UX flow breaks on a specific class of edge case.

A geometric, abstract landscape featuring a large, glowing orb resembling a sun.

The orchestration problem nobody talks about

Here is the part that gets skipped in most writeups about agentic systems: getting 100 agents to produce useful output requires more design work than getting one agent to produce useful output, not less.

The failure modes are specific. Agents can converge too fast, which means the debate collapses into groupthink before it finds anything. They can diverge too far, which means the output is noise. The coordinator model needs to know when a minority position is actually the signal, not the outlier to dismiss. That judgment is not free.

For agencies thinking about where to apply this pattern, the practical implication is that the prompt engineering and the system architecture are inseparable. You cannot just spin up 100 instances of the same model and call it a debate. The agents need different priors, different roles, different instructions. Some should be optimistic about whether something works. Some should be structurally skeptical. The ensemble only beats the individual when the ensemble is genuinely diverse in its reasoning.

This is craft work. It looks like system design but it requires the kind of thinking that good creative directors do instinctively: who is in the room, what are they incentivized to notice, and how does the group reach a decision that is better than any individual's first take.

You cannot just spin up 100 instances of the same model and call it a debate. The agents need different priors, different roles, different instructions.
Max Pinas, Studio Hyra

What the 16 bugs actually tell us

Security is a useful domain to study because the feedback is unambiguous. A vulnerability either exists or it does not. That clarity makes it a good test for whether multi-agent debate produces real value or just the appearance of thoroughness.

The MDASH result says it produces real value. Sixteen verified findings, in one cycle, on a codebase that has been under continuous professional scrutiny for decades. That is a meaningful signal.

For agencies, the equivalent test is to find a domain in your own work where the feedback is similarly unambiguous. Where being wrong is visible and consequential. Start there. Not with the work that is hardest to evaluate, but with the work where a failure is obvious in retrospect and where you are currently catching fewer of those failures than you know you should be.

Accessibility is one candidate. Performance budgets are another. Consistency between a design system and what actually ships in production is a third. These are all domains where a multi-agent review could catch things a single-pass human review misses, and where the cost of missing them is real.

The broader point is this. the most interesting thing about MDASH is not that it uses AI. It is that it takes a process, structured adversarial debate, that humans invented and already trust, and runs it at a scale and speed that changes what is economically feasible. That is the actual opportunity. Not replacing judgment, but making it cheaper to apply more of it.

Abstract flowing data streams composed of small particles and curved paths.

Where to start

If you are running a product team or a design function and you want to experiment with this pattern, the starting point is not the tooling. It is the question: where in our process do we currently do one round of critique when we know three rounds would produce a better outcome?

Answer that first. Then design the agent roles around the specific failure modes you are trying to catch. Give some agents the job of finding problems. Give others the job of arguing that the problems are not real. Make the coordinator earn its conclusion.

The tooling to build this exists today. The design thinking to make it work is the same design thinking your team already has. The gap is mostly recognizing that the pattern applies.

Microsoft ran the experiment at scale so the rest of us can read the result. Sixteen bugs. One cycle. That is a concrete number attached to an approach that was mostly theoretical six months ago. It is worth taking seriously.

What happens when you put 100 agents in a room and ask them to break things

How MDASH actually works

Why this is relevant to agencies, not just security teams

The orchestration problem nobody talks about

What the 16 bugs actually tell us

Where to start

Keep reading.

When your employer pulls the tool you rely on

Anthropic rents compute from xAI. That tells you everything about who holds the power.

Momentum starts with a conversation.

Keep reading.

When your employer pulls the tool you rely on

Anthropic rents compute from xAI. That tells you everything about who holds the power.