Intelligence Brain · methodology

The shadow-mode rollout explained — week by week

← Back to Intelligence Brain

Most AI rollouts fail the same way. Someone signs a contract, a tool gets switched on, staff are told to use it, and within six weeks half the firm is ignoring it and the other half is quietly pasting client data into a public chatbot. The shadow-mode rollout exists because that pattern is avoidable. You run the AI alongside the humans for long enough that you can see what it actually does — on your data, with your edge cases — before a single decision depends on it. This is how I phase the Intelligence Brain into a regulated firm over roughly ninety days, and why each week is shaped the way it is.

What "shadow mode" actually means

Shadow mode is borrowed from autonomous-vehicle engineering. The self-driving stack runs the whole way through — perception, planning, the lot — but its outputs never touch the steering wheel. A human drives. The system's decisions are logged and compared to what the human did. You get full coverage of real-world conditions without the system ever being trusted with consequence.

Applied to a law firm, an accountancy practice, or a medical group, the principle is identical. The AI ingests the same matters, files, emails, and documents the team is already working on. It produces summaries, classifications, drafts, flags, retrieval results — whatever the use case calls for. None of it is sent to clients, none of it is filed, and none of it is used to make a decision. It sits in a parallel pane that only the staff can see, and they mark it: useful, wrong, dangerous, or irrelevant.

That mark is the most valuable data you will ever collect about an AI system. It tells you, on your own corpus, where the model earns trust and where it does not. Vendor benchmarks will not tell you this. Demos certainly will not.

Weeks 1–2: ingestion, boundaries, and the read-only baseline

The first fortnight is unglamorous and deliberately so. The Intelligence Brain is installed on-premise or in the firm's own tenancy. Nothing is sent outside the boundary. The first job is connecting it to the document stores, matter management, practice management, or PMS — read-only, with audit logging on every fetch. If the IT team cannot tell me, by the end of week one, exactly which folders the system can see and which it cannot, we stop and fix that before going further.

During this period the system is not producing anything user-facing. It is indexing, embedding, and building the internal graph of who works on what, which documents relate to which matters, and which terminology is house style. I want the indexing to settle before any human sees an output, because half-indexed retrieval produces confidently wrong answers, and confidently wrong first impressions are very hard to recover from.

Two artefacts come out of weeks 1–2: a data-flow map showing every source the system reads and every place its outputs land (initially: only its own audit log), and a written boundary statement that the partners or directors sign. That document becomes part of the firm's AI governance file when the regulator eventually asks for it.

Weeks 3–4: silent generation against live work

Now the system starts producing. Whatever the agreed use cases are — matter summaries, file-note drafts, ledger anomaly flags, triage classifications — it generates them on real, current work. The outputs go to a single review pane. The humans doing the actual work see them; nobody else does. Clients see nothing. The matter file sees nothing.

This is the most important phase and the one most often skipped. The temptation is to run pilots on synthetic data or on closed historical matters. Both are misleading. Synthetic data has no edge cases, and closed matters have no time pressure. You need the system seeing the messy, half-finished, ambiguous work that comes through on a Tuesday afternoon when somebody is also on the phone.

The review pane is not a star rating. It is structured: would I have sent this, what would I have changed, did it miss anything material, did it invent anything. The fourth question matters most. Hallucinations in a regulated context are not annoyances; they are the single thing that can end the project. You need to know, by the end of week four, the rate at which the system fabricates and the categories in which it fabricates.

Weeks 5–8: pattern analysis and the first scoped trust

By the start of week five there should be a few hundred reviewed outputs across the use cases. Now you stop generating new things and start analysing. Where does the system perform well enough that a human reviewer is mostly rubber-stamping it? Where does it consistently miss? Where is it dangerous — not just wrong, but wrong in a way the reviewer might not have caught without doing the work themselves?

The answers cluster more cleanly than people expect. A retrieval system might be excellent on contracts and poor on email threads. A summarisation pipeline might handle litigation files well and butcher tax advice. An anomaly flagger might catch duplicate invoices reliably and miss timing fraud entirely. These patterns are what you build the deployment around. You do not deploy the AI; you deploy the parts of the AI that earned it.

This is also where you meet the people who will use it daily and decide who is in the first cohort. Not the loudest enthusiast and not the most sceptical partner. Pick the people whose judgement you would trust to overrule the system, because that is exactly what you are asking them to do. The methodology around cohort selection and rollout sequencing is something I've written about separately in the Intelligence Brain methodology notes, but the short version: start with the people who are already careful.

Weeks 9–10: assisted mode for one cohort

Around week nine, for the use cases and the cohort that earned it, the system moves from shadow mode to assisted mode. The difference is small but real: outputs are now visible inside the workflow rather than in a separate pane. A draft appears in the email composer. A summary appears at the top of the matter. A flag appears next to the ledger entry. The human still does everything; the AI is now in their line of sight.

Logging continues at the same density. Every accept, every edit, every reject is captured. You are now collecting a different kind of data: not "is this output any good" but "does having this output change what the human does, and does it change it for the better". Sometimes the answer is no — the AI is accurate but the human ignores it, or the AI is accurate and the human becomes lazy. Both are problems and both show up in the logs if you are watching.

Crucially, the rest of the firm is still in shadow mode during these two weeks. You are running two regimes in parallel because you want a control group. If something goes sideways, you can compare cohorts and see whether the AI caused it or whether it was going to happen anyway.

Weeks 11–13: scoped production and the handover

The final phase is widening the assisted mode to the rest of the firm, use case by use case, with a written go/no-go for each. Some use cases will not graduate. That is the point — the methodology is supposed to kill the bad ones. A cautious AI deployment is one where roughly a third of what you originally scoped never ships, and the things that do ship are the things you can defend in front of a regulator, a client, or a court.

Handover to the firm's own people happens here too. The system is theirs; I am not. The audit logs, the boundary statement, the cohort feedback, the use-case decisions, the prompts, the retrieval configurations — all of it lives inside the firm. If I disappeared tomorrow, the firm would still know what their AI was doing and why. That is the difference between a 90 day AI rollout that becomes part of how the firm works and one that becomes a renewal conversation. You can read more about how the Intelligence Brain is structured to support this kind of on-premise, fully-auditable deployment if it is useful.

Where to start this week

If you are thinking about AI for a regulated firm, do one thing this week: write down, on a single page, every use case you are considering and what would have to be true for you to trust the system's output unsupervised. Do not list features. List the conditions for trust. If you cannot articulate those conditions, you are not ready to deploy anything yet — and that is genuinely the most useful thing a shadow-mode rollout will tell you. Start there, and the next ninety days plan themselves.

Book a 30-minute assessment

Direct with Michael. No charge. No pitch deck.

Pick a slot →