People ask me this every week. "Mike, which model do you use in the Intelligence Brain — Claude or GPT? Are you running open-source? What about the new one that came out yesterday?" The honest answer is that the question itself is wrong. You don't pick a model. You pick a routing strategy, a privacy posture, and a fallback path, and you let the workload tell you which model belongs where. Below is how I actually think about AI model choice when I'm building the Intelligence Brain for a regulated Irish firm.
The model is not the product. The pipeline is.
The first thing I tell founders and CTOs is to stop benchmark-chasing. The leaderboard you read on a Tuesday is irrelevant by Friday. What matters is whether the system in front of your users does the right thing on your data, with your latency budget, under your compliance constraints. The model is a component. The pipeline — retrieval, prompt construction, tool use, validation, audit logging, fallback — is the product.
In the Intelligence Brain, the model layer is deliberately swappable. I treat the LLM as I'd treat a database driver: an interface, with concrete implementations I can substitute without touching the surrounding code. That means a Claude call, a GPT call, and a local Llama call all return the same shape, go through the same redaction layer on the way out, and write to the same audit log on the way back. If a model gets deprecated, retired, or repriced overnight, I change a config line, not an architecture.
This sounds obvious. It is not what most teams do. Most teams hardcode one provider's SDK into their application logic, then panic when that provider has an outage or changes terms. Don't be that team.
Claude vs GPT — what they're actually good at
I use both. They are not interchangeable, and the differences matter once you're past toy demos.
Claude is, in my experience, better at long-document reasoning where you need the model to hold context across a hundred pages and not lose the thread. It's calmer about admitting it doesn't know something. For a legal or accounting workflow where I'm asking "summarise the indemnity provisions across these twelve contracts and flag inconsistencies", Claude tends to produce output I can actually defend in front of a partner. It also handles structured output instructions more obediently — when I say "return JSON with these exact fields, no prose", it does that.
GPT (the frontier OpenAI models) is generally faster on shorter, more transactional tasks, and its tool-use ecosystem is more mature if you're orchestrating function calls across multiple internal APIs. For "classify this email, extract the invoice number, route to the right inbox", GPT is usually my pick. Its image and document parsing has been ahead in practical terms.
The Claude vs GPT debate is mostly tribal noise. The real question is: which one for which step in your pipeline? In a typical Intelligence Brain deployment I might use Claude for the synthesis step, GPT for the classification step, and a small local model for the embedding and re-ranking steps. Three models, one pipeline, each doing what it's good at.
Open source — when it's the right answer in Ireland
Open-source models — Llama, Mistral, Qwen, the various fine-tunes on Hugging Face — have become genuinely viable for production work in the last eighteen months. For a certain class of Irish customer, they're not just viable, they're the only option that passes the compliance review.
If you're a credit union, a medical practice, a state body, or a law firm with sensitive client data, the conversation about sending content to a US-hosted API is short and uncomfortable. The DPC, your insurer, and your clients all want to know where the data went, who held it, and for how long. "It went to a US LLM provider under their data processing terms" is an answer that sometimes works and sometimes doesn't. "It never left this server in Clonmel" is an answer that always works.
That's the case for open source AI in Ireland. Not because the models are better — frontier closed models are still ahead on raw capability — but because the deployment story is unbeatable for regulated workloads. A 70B-class open model running on a single workstation with a decent GPU will do most enterprise summarisation and Q&A tasks at a quality level that surprises people who haven't tried it lately. For document classification, entity extraction, and retrieval-augmented generation against your own corpus, you do not need a frontier model. You need a competent one that lives inside your firewall.
The trade-offs are real. You take on operational burden. You need someone who can run inference servers, manage GPU memory, deal with quantisation, and update models when better ones drop. That's part of what I do for customers in the Intelligence Brain product line — the on-prem deployment is the point, not an afterthought.
Routing — the hard engineering bit
Once you accept that you'll use multiple models, you need a router. This is where the engineering gets real.
A model router takes an incoming request and decides which model handles it. The decision can be based on:
- Sensitivity — does this request contain PII, client data, health information? If yes, local model only.
- Complexity — is this a one-line classification, or a multi-step reasoning task? Cheap model versus expensive model.
- Latency budget — is the user waiting on a screen, or is this a background job? Sync versus async paths.
- Cost ceiling — has this tenant already burned through their monthly inference budget? Downgrade path.
- Availability — is the primary provider returning 5xx errors right now? Fallback path.
I implement this as a small rules engine in front of the model interface. Nothing fancy. A YAML config per tenant, a handful of Python functions, a circuit breaker, and structured logging on every routing decision so I can explain after the fact why a particular query went to a particular model. That last bit — explainability of routing — is what regulators actually ask about when they ask about "AI governance". They don't care whether you used Claude or GPT. They care whether you can tell them which one you used, and why, on a specific request, on a specific day.
Evaluation — stop trusting benchmarks, start trusting your own evals
The single highest-leverage thing you can build for any LLM system is an internal evaluation harness. Not LMSYS, not MMLU, not whatever leaderboard launched this month. Your own evals, on your own data, scoring the things your users actually care about.
For each customer deployment of the Intelligence Brain, I build a small set of golden examples — usually a few dozen real queries with the answers a human expert would give. When a new model lands, or when I'm considering swapping a step in the pipeline, I run the eval set against the new configuration and compare. The output is a boring spreadsheet. The boring spreadsheet is what lets me confidently tell a managing partner "yes, we can move from model A to model B next month, and here's the regression test that proves it won't break the contract review workflow".
Without your own evals, you are vibing. Vibing is fine for a demo and unacceptable for a production system that a firm has bet its compliance posture on.
Cost, lock-in, and the boring durability question
Frontier model pricing has fallen consistently and will keep falling. I don't optimise heavily for cost on day one, because whatever I pay today is likely to be cheaper in a year. What I do optimise for is lock-in avoidance. Every prompt I write is portable. Every tool definition follows a normalised schema I translate at the edge into whichever provider's format is needed. Every embedding I generate is stored alongside the model identifier and version that produced it, so I can re-embed when I migrate.
Durability over cleverness. The Intelligence Brain customers I have today will still be running it in five years. The model they're using in five years almost certainly does not exist yet. The architecture has to absorb that.
Where to start this week
If you're building anything serious with LLMs and you've been picking one model and hoping, do this in the next seven days: write an interface in front of your model calls so you have one function for "ask the LLM" with a provider parameter. Pick ten real queries from your actual users or workflows. Run them through Claude, through GPT, and through one local open-source model. Score the answers yourself. You will learn more in that one afternoon than from a month of reading benchmark posts. That spreadsheet is the foundation of every grown-up AI system I've ever built — and it's the same place I'd start if I were standing your Intelligence Brain up tomorrow.