Most teams I talk to think the hard part of building an AI product is the model. It isn't. The model is a commodity you rent. The hard part — the part that decides whether your system is useful, safe, and defensible — is the prompt layer. That's the system prompts, the tool definitions, the schemas, and the routing logic that sit between a user's question and whatever the model eventually produces. Get that layer right and a mid-tier model will outperform a frontier model wired up badly. Get it wrong and no amount of GPU spend will save you.

Why the prompt layer is the real product

When I started building the Intelligence Brain, I expected the differentiation to come from clever retrieval or fine-tuning. After enough nights debugging hallucinations and tool misfires, I changed my mind. The prompt layer is where your domain knowledge actually lives in code. It's where you encode the firm's policies, the regulator's expectations, the house style, the escalation rules, and the boundaries of what the assistant can and cannot do.

A model is generic. Your prompt layer is what makes it yours. If a competitor copied your model choice, your vector database, and your UI, but not your prompts and tool definitions, they wouldn't have your product. That's the asset. Treat it like one.

This is also where regulated firms in Ireland and the UK get tripped up. They buy a chatbot, point it at SharePoint, and discover six weeks later that it's confidently citing a 2019 policy that was superseded twice. The fix is almost never "buy a better model". The fix is in the prompt layer.

System prompts: structure, not poetry

A system prompt is not a creative writing exercise. It's a contract. The model is the contractor. Your system prompt tells it: who you are, what you can do, what you must not do, what shape your output must take, and what to do when things go wrong.

I structure system prompts in the Intelligence Brain into roughly six blocks, in this order:

Identity — one or two lines. "You are the assistant for [firm]. You help staff with [scope]." No personality flourishes. Personality leaks into output and burns tokens.
Operating context — what data sources are available, what tools exist, what the user's role is, what the current date is. The model has no clock and no memory across sessions; you give it both.
Hard rules — the things that must never happen. "Never give legal advice." "Never quote a price." "Never email externally without confirmation." Frame these as prohibitions, not preferences. Models follow "must not" better than "try not to".
Soft rules — house style, tone, formatting preferences, citation requirements.
Output contract — exactly what the response should look like. If you want JSON, say JSON and give a schema. If you want prose with citations, give an example.
Failure handling — what to do when the answer isn't in the retrieved context, when a tool errors, when the question is out of scope.

The mistake I see most often is mixing these blocks together — sticking a hard rule halfway down a paragraph about tone. Models attend to position. Put the non-negotiables near the top and at the bottom. The middle of a long prompt is where instructions go to die.

Tool definitions: the model is only as good as its API

Tool definitions — function calling, in OpenAI's terminology — are where most teams underspend their attention. A tool definition is a schema you hand to the model that says "here is something you can do, here are its parameters, here is what each parameter means." The model decides when to call the tool and what to put in the arguments.

The quality of the tool description directly determines the quality of the call. Three things I've learned the hard way:

Names matter more than you think. A tool called search will be called for everything. A tool called search_client_matters_by_reference will be called when the user mentions a reference number. The model uses the name as the first signal of relevance. Be specific.

Descriptions should describe when to use, not just what it does. Bad: "Searches the document store." Good: "Use this when the user asks about a specific document, contract, or matter file. Do not use for general policy questions — use policy_lookup for those." You're writing for a reader who will scan, not study.

Parameter descriptions are prompts too. Every parameter description is a chance to constrain the model's behaviour. "client_reference: the client reference code, format ABC-12345. If the user gives a name instead, ask for the reference before calling this tool." That sentence prevents an entire class of bad calls.

I keep tool definitions in a separate registry, version-controlled, with the same review process as application code. They are application code. Treat changes as a release event with a changelog and a regression test.

Routing and composition: when one prompt isn't enough

For anything beyond a toy assistant, a single system prompt is insufficient. You'll end up with a router — a small first-pass call that classifies the request and dispatches it to a specialised prompt with its own tool set.

In the Intelligence Brain product layer, I split the work into a thin router and a set of focused agents. The router does one thing: decide which agent should handle this. Each agent has a tightly scoped system prompt and a tool list it can actually use. A policy agent doesn't need access to the email tool. A drafting agent doesn't need access to the audit log query tool.

This matters for three reasons. First, smaller tool lists produce more accurate tool selection — models get noticeably worse at choosing tools once you cross fifteen or twenty of them. Second, smaller system prompts leave more context window for actual work. Third, scoped agents are auditable. When something goes wrong, you can point at exactly which agent ran and why.

The trade-off is latency. Each routing step is another round trip. You can mitigate that with parallel tool calls, speculative routing, or by collapsing simple cases into a default path. But don't optimise latency before correctness. A fast wrong answer is worse than a slow right one, especially in regulated work.

Testing prompts like code

If your prompt layer is your product, then your prompt tests are your most important test suite. I use a mix of three things:

Golden tests — fixed inputs with expected outputs or expected tool calls. Run on every prompt change. Catches regressions.
Adversarial tests — inputs designed to break the rules. Prompt injection attempts, out-of-scope questions, ambiguous references, things the model is likely to handle wrong. These are the tests that build trust.
Live sampling — a small percentage of real production traffic, reviewed by humans, fed back into the test set when interesting cases appear.

The thing nobody warns you about: prompts are non-deterministic, and small changes have non-local effects. A wording tweak in the identity block can change tool-call behaviour in ways that are hard to predict. This is why you need the test harness. You cannot reason your way to a good prompt. You measure your way there.

For prompt engineering work in Ireland and the UK, where firms have to evidence their controls to a regulator, this test discipline is also your audit story. "We changed this prompt on this date, here are the tests that ran, here is the human review, here is the rollback plan." That's a defensible posture. "Sarah tweaked the chatbot last Tuesday" is not.

What to keep out of the prompt layer

A quick list of things that don't belong in a system prompt, even though they often end up there:

Secrets. API keys, passwords, internal URLs that aren't meant to be reflected back. Assume anything in the prompt can leak.
Per-request data. The current user's name, today's matters, the document they're looking at — these go in the user message or as tool results, not the system prompt. The system prompt is the stable contract; per-request data is the variable.
Long policy documents. If your "system prompt" is fifteen pages of the firm handbook, you've built a retrieval problem and disguised it as a prompt problem. Move the handbook to the vector store and let the model fetch the relevant section.
Examples that contradict your rules. Few-shot examples are powerful and dangerous. The model will copy patterns from examples even when they break the explicit rules above them. Audit your examples as carefully as your rules.

Where to start this week

If you're building or buying an AI system for a regulated firm, do this before you do anything else: get the system prompt and the full list of tool definitions on a single screen, in front of the people who actually own the risk. Read them out loud. If a paragraph doesn't earn its place, cut it. If a tool doesn't have a clear "when to use" sentence, write one. If there's no failure-handling block, add it. That single hour will move your product more than a week of model tuning. The prompt layer is where the work is. Start there.

Book a 30-minute assessment

Direct with Michael. No charge. No pitch deck.

Pick a slot →

The prompt layer of the intelligence brain — system prompts and tool definitions