Most AI failures in regulated work are not model failures. They are deployment failures. Someone wired a language model to a live tool — a refund button, a contract clause, a booking — on day one, watched it work three times, and shipped it. Then on day forty it did something nobody had thought to test for, and the clean-up cost more than the project saved. The pattern I keep coming back to, and the one we use across IMPT, Bro AI and the work feeding into Ireland Quantum, is simpler and slower. The swarm shadows the team for ninety days before it is allowed to touch anything live.
What shadow mode actually means
Shadow mode is not a demo. It is not a sandbox. It is the AI doing the real job, on real inputs, at the same time as the human team, with one rule: the AI's output never reaches the customer, the database, or the regulator. It writes its answer to a log. The human writes theirs to the world. Then the two are compared.
That comparison is the entire point. You are not measuring whether the model is clever. You are measuring whether, on the actual distribution of work that walks through your door, it agrees with the people who already do the job well. Disagreement is the signal. Where it disagrees, you read both answers and decide who was right. Often it is the human. Sometimes it is the model. Sometimes both are wrong in different ways, which is the most useful finding of all because it tells you the process itself is under-specified.
The discipline of shadow-mode AI is the discipline of writing things down. Every disagreement becomes a row in a spreadsheet. Every row gets a category. Categories become the backlog. The backlog becomes the prompt, the tool definitions, the guardrails, and the training data. None of this is glamorous. All of it is what separates an AI rollout pattern that earns trust from one that burns it.
Why ninety days, and not thirty
Thirty days catches the obvious failures. The model hallucinates a policy that does not exist. It cites a clause from the wrong jurisdiction. It books a refund in the wrong currency. You will find these in the first week and fix them in the second. By week four the system looks ready.
It is not ready. Day-thirty confidence is the most expensive feeling in cautious AI deployment. What ninety days buys you is the long tail: the month-end close, the quarterly filing, the bank holiday weekend when staffing is thin and edge cases queue up, the customer with the unusual surname that breaks a regex you didn't know was there, the supplier who changes their invoice format without telling anyone. None of that shows up in the first month. Most of it shows up between days forty-five and seventy-five.
Ninety days is also long enough that the team stops performing for the AI. In the first weeks people know they are being watched and they sharpen up. Their work gets better, which makes the model look worse by comparison. By month three the team has gone back to being themselves, and the comparison gets honest.
The three gates
I structure the ninety days as three gates, each thirty days, each with a different question.
Days 1–30: does it agree?
The only metric that matters in the first month is agreement rate on the work the team already does well. If the team's senior people would have written substantially the same answer the model wrote, that is a pass. You are not yet asking whether the model is good. You are asking whether it is sane. A model that agrees with your seniors most of the time is a candidate. A model that does not is not, and no amount of prompt engineering in month two will save it. Change the model.
Days 31–60: where does it break?
Once agreement is reasonable, you stop celebrating it and start hunting failures. This is the month where you push the hard cases through deliberately. The unusual claim. The non-standard contract. The customer with three accounts and two addresses. You read every disagreement, not just the sampled ones. You categorise the failures into three buckets: the model is wrong, the human is wrong, or the policy itself is unclear. The third bucket is the gold. Every entry in it is a place where your business has been making decisions on vibes, and now you have a chance to write the rule down.
Days 61–90: would you sign it?
The final gate is not technical. It is personal. You take a random sample of the model's outputs from the last thirty days and you ask the people who would carry the regulatory risk — the head of compliance, the senior solicitor, the clinical lead, whoever owns the licence — to read them. The question is not "is this good?" The question is "would you have signed this?" If the answer is yes for the overwhelming majority, and the failures in the remainder are ones the human reviewers would also have made, the system has earned the right to act. If the answer is no, you are not ready, and another month of prompt tweaks will not change that. You have a deeper problem.
What "earning the right to act" actually unlocks
At day ninety, if the gates are passed, the swarm gets one tool. Not all the tools. One. The lowest-risk action it has been shadowing — usually something reversible, audited, and rate-limited. It might be drafting the email, not sending it. Suggesting the refund, not issuing it. Booking the hold, not the confirmation. The human still signs. But now the AI's output is the default, and the human's job changes from author to reviewer.
This matters because the failure mode of skipping shadow mode is not that the AI does something catastrophic on day one. The failure mode is subtler. The AI does something almost right on day one, the team waves it through, and a slow drift sets in where nobody is really reviewing because everyone assumes someone else is. Shadow mode forces the review muscle to develop before the AI is given any authority. By the time the swarm is acting, the team already knows what good and bad output looks like, because they have been grading it for three months.
Where this fits in regulated work
The 90 day AI deployment pattern is overkill for a marketing copy generator. It is the right shape for anything where a wrong answer creates a duty: financial services, healthcare, legal, anything carbon-accounted, anything with a regulator who can ask to see your working. At IMPT every booking offsets a tonne of CO₂ on-chain — that is an auditable claim, and the agent layer that will eventually make those bookings cannot be allowed near a live transaction until it has shadowed the human booking team and a compliance review has signed off on the pattern of its decisions, not just the headline accuracy.
The same logic applies to Bro AI, where the stakes are different but no less serious. A grief companion that shadows trained human responders for ninety days, with every reply read and graded by clinical reviewers, is a different product from one that goes live on a launch date because the launch date was announced. We picked 24 April 2026 for a reason. The months before it are the shadow.
What the spreadsheet looks like
If you take nothing else from this, take the spreadsheet. Six columns:
- Input — the case the AI and the human both saw.
- Human output — what the team did.
- AI output — what the swarm would have done.
- Agreement — yes, no, or partial.
- Who was right — human, AI, both, neither, or unclear.
- Category — model error, human error, policy gap, or noise.
Run it daily. Read it weekly. Theme it monthly. At day ninety this spreadsheet is the document you take to the regulator, the board, or the insurer. It is the evidence that AI risk management was a thing you did, not a thing you said. No dashboard replaces it. No vendor produces it for you. It is the work.
What to do this week
If you are anywhere near deploying AI into regulated work, do not ship anything live this week. Instead, pick one workflow, stand the model up beside the team in shadow mode, and start the spreadsheet. Put a date ninety days out in the calendar and write next to it: "decide whether the swarm has earned a tool." That is what we are doing across IMPT's booking agent, Bro AI's response layer, and the operational systems we are designing now for the Ireland Quantum facility. None of those AIs gets to act until they have spent ninety days proving they would have agreed with the people who already do the job. If that sounds slow, it is. It is also the only deployment pattern I have seen survive contact with a regulator and a real customer base at the same time.