Most "AI platforms" sold into regulated Irish firms are a thin wrapper around someone else's API, with a vector database bolted on the side and a login screen painted over the top. That's fine for a demo. It falls apart the moment a partner asks where the data sat at 03:14 last Tuesday, or a regulator wants to see the audit trail for a specific inference. The Michael English Intelligence Brain is built backwards from those questions. This article walks through the actual architecture — the layers, the boundaries, the trade-offs — so you can judge it on engineering merit rather than marketing copy.
The design constraint that shapes everything else
Before any architecture diagram, there is one constraint: client data does not leave client infrastructure. Not to OpenAI, not to Anthropic, not to a vendor cloud in Frankfurt, not to a "private endpoint" that is still someone else's tenancy. That single rule cascades through every other decision. It rules out most of the convenient SaaS plumbing the wider AI industry takes for granted. It forces local model hosting, local vector storage, local orchestration, and local observability. It also rules out a lot of clever features that depend on a vendor's hosted retrieval or fine-tuning service.
I made this choice for two reasons. First, the firms I work with — solicitors, accountants, medical practices, schools, public bodies — operate under professional duties and regulatory frameworks (the Data Protection Acts, GDPR, sectoral codes, Law Society guidance, and so on) where "the data left the building" is a sentence that ends careers. Second, on-premise is honest. You can audit it. You can pull the plug on it. You can hand it to a forensic accountant and they can tell you exactly what it did. Hosted AI cannot offer that, no matter what the marketing says.
So the architecture is not a hosted product with an on-prem option. It is an on-prem product with no hosted option.
The four layers
The Intelligence Brain decomposes into four layers, each with a clear contract:
- Ingestion — connectors that pull from the firm's existing systems (file shares, document management, practice management, email archives, accounting ledgers, case management) and normalise them into a canonical document model.
- Knowledge — the indexed representation: chunking, embeddings, a vector store, a keyword index, a graph of entities and relationships, and a metadata catalogue that knows who owns what and who can see what.
- Inference — locally hosted language models (small for routing, larger for synthesis), the retrieval pipeline that feeds them, and the prompt and tool layer that constrains their behaviour.
- Interface — the surfaces a user actually touches: a web UI, a Word/Outlook plug-in pattern, an API for the firm's internal tooling, and an admin console for the partner or practice manager.
The contracts between layers matter more than the layers themselves. Ingestion never talks to inference directly. Knowledge never returns a chunk without its provenance. Inference never sees a document the requesting user is not entitled to see. Interface never trusts the user — every request is re-authorised against the knowledge layer's permission model. These are boring rules, and they are the entire point.
Ingestion: meeting the firm where it is
No firm I have worked with has tidy data. They have twenty years of Word documents, scanned PDFs of varying quality, an email archive that nobody fully understands, a case management system on its third vendor, and a shared drive that grew like a hedge. Ingestion has to cope with all of it without forcing a migration project that would kill the engagement before it started.
The connector pattern is straightforward: each source system gets a connector that runs inside the client's network, pulls deltas on a schedule, and writes into a staging area. From staging, a normalisation pipeline produces canonical documents — text, structure, metadata, source pointers, ACL information from the originating system. OCR runs over scans, with the original image preserved so a later reviewer can see what the model actually saw. Email threads are reconstructed properly, not flattened into individual messages. Spreadsheets are parsed cell-by-cell where they carry semantic structure, and as flat text where they don't.
The non-obvious part is the ACL mirroring. If a document in the DMS is restricted to the conveyancing team, that restriction has to follow it into the knowledge layer, survive chunking, survive embedding, and be enforced at query time. Most RAG tutorials hand-wave this. In a regulated firm it is the whole game.
Knowledge: retrieval that respects the firm
The knowledge layer is where most "AI" projects quietly fail. They embed everything into a single vector store, run cosine similarity, return the top-k chunks, and call it retrieval. It works for a tech demo and falls over on real corpora because legal, accounting and medical documents are not interchangeable bags of prose — they have structure, hierarchy, temporal validity, and conflicting versions.
The Intelligence Brain runs three indices in parallel: a dense vector index for semantic similarity, a sparse keyword index (BM25-family) for exact-match recall on names, citations, statute references and identifiers, and an entity graph that links people, matters, companies, properties and dates. A query hits all three, and a re-ranking step combines the signals before anything goes to a model. The graph is the part that earns its keep on hard questions: "what advice did we give the Murphy estate about the Coolmore site before the 2019 amendment" is a graph query first and a similarity query second.
Chunking is document-type-aware. A statute is chunked by section. A contract by clause. A medical letter by paragraph but with the header always attached. A spreadsheet by logical block. The chunking strategy is part of the connector contract, not a global setting, because there is no global setting that works.
Inference: small models doing specific jobs
The inference layer is where I have changed my mind most often. The early instinct is to run the largest model you can fit and route everything through it. The mature instinct is to run a small fast model for routing, classification and extraction, and reserve the larger model for synthesis where its capability actually shows up in the output.
A typical request flow: the router classifies intent (lookup, summarise, draft, compare, extract), selects a retrieval strategy, runs retrieval, assembles a prompt with explicit citation requirements, calls the synthesis model, and then a verification step checks that every claim in the output traces to a retrieved chunk. Claims that don't trace get flagged or stripped, depending on the policy the firm has chosen. The model never invents a citation, because the citation layer is deterministic — it is built from the chunks that were actually retrieved, not generated by the model.
Models run on GPU hardware sized to the firm. For smaller practices that means a single workstation-class box. For larger firms, a small cluster. Either way, the inference layer is replaceable: today's open-weights model is not tomorrow's, and the architecture treats the model as a swappable component behind a stable interface. If you want a deeper view of how this is packaged for specific verticals, the Intelligence Brain overview walks through the deployment shapes.
Audit, observability and the things regulators actually ask about
Every request, every retrieval, every model call, every document touched, every permission check — logged, with a stable identifier that lets you reconstruct any single inference months later. This is not a feature bolted on at the end. It is wired through every layer because retrofitting auditability is how you end up with the kind of "we think it probably did X" answers that fail a professional indemnity review.
The audit log is append-only, signed, and stored on the client's infrastructure. A partner can ask "show me everything the system did for matter 2024-0381" and get a complete trace: the documents ingested, the queries run, the chunks retrieved, the prompts sent, the outputs produced, who saw them, and what was done next. That trace is the artefact you hand to a regulator, an insurer, or an internal reviewer when something goes wrong — and something will eventually go wrong, because that is the nature of running real systems against real data.
Where to start this week
If you are a partner or practice manager weighing this up, do one concrete thing this week: pick a single, well-bounded internal question your team asks repeatedly — a precedent lookup, a client history summary, a compliance check — and write down what a good answer looks like, including the sources it would cite. That document is the spec. From there, the architecture above is just the means of producing that answer reliably, on your hardware, with a trail you can defend. Everything else is implementation detail.