Blog
Prompt Harder Isn't a Strategy
A few weeks ago I overheard a man telling a woman she only needed ChatGPT to do all her research. Just tell it to cite sources, don’t hallucinate, he said. I had to interject.
His advice was a version of an idea I keep meeting in the wild: that AI’s shortcomings are a prompt problem. That with a long enough system prompt, a clever enough preamble, the right fine-tune on the right examples, the model will get it right. Lawyers pre-prompting their way to draft contracts. Researchers leaning on the model for source-grounded analysis. Engineering teams that have stopped writing tests because the agent writes the code. You just need a better prompt.
It’s the most confident advice in AI right now. It’s also a category error.
LLMs are probabilistic. That isn’t a slogan — it’s the mathematics. The model is sampling from a distribution at every token. A longer prompt narrows the distribution; it doesn’t replace it with a deterministic function. The chance the next word is wrong falls. It does not fall to zero. Better prompts and better models change the rate at which the model is wrong — they don’t change the kind of system it is.
Last week was my turn on the incident management rota. I’d built a workflow I was quietly proud of: a long-running Claude chat thread, Opus 4.7 on high effort, MCP servers (still in beta) wired into our incident and logging tools for live data, every monitor and stack trace from the week fed into the same context. As much grounding as I could reasonably give it.
It hallucinated. Twice. The same fabricated detail both times — confident, plausible, completely invented. I caught it the first time because something didn’t smell right and I probed for evidence. I caught it the second time because I was already looking for it. If I hadn’t been suspicious, that line goes into my incident report. Stakeholders read it. Decisions get made on it.
And that was with the best model I have, the best context I could assemble, and live tool access to ground-truth data.
If you want a sharper version of the same point, look at what security engineer Ron Stoner did last month, as covered in The Register. He spent $12 on a domain — 6nimmt.com — wrote a single Wikipedia edit citing it as a source, and convinced several frontier LLMs that he was the 2025 world champion of a German card game with no championship. The chatbots dutifully cited his domain back to him. “My site has no independent corroboration,” Stoner wrote. “It’s totally made up.”
The model can’t tell your sources from his. It pattern-matches retrieved text against what it has seen before, with no sense of whether the page was authored by a standards body or by someone bored on a Tuesday morning with a coffee.
There’s a comic version too. The Instagram account @sergiocilli posts AI-generated audition reels — recognisably generative, played for humour. On one reel a self-described prompting expert stepped in to help fix them. Pages of rewritten prompts later, the output was no better and noticeably weirder.
To be clear: the models are getting better. Outputs improve every few months. A year ago, the failure rate on my incident report would have been worse, and best-in-class answers were further out of reach than they are now. None of that’s in question.
What changes more slowly is the shape of the failure. The wrong answer still arrives looking exactly like the right one — same tone, same confidence, same citations. There’s no banner that lights up red. It’s a probabilistic system, and its mistakes still look like its correct answers.
So the interesting question isn’t how do we prompt our way to better answers — it’s what fails loudly when the answer is wrong?
Software engineering has been answering that question for decades. We don’t trust ourselves either. We build CI pipelines, type checks, unit tests, integration tests, golden tests. The walls don’t get smarter as the codebase grows — they get unforgiving. They are the part of the system that doesn’t lie.
And here’s the part that flips the script: AI is exceptionally good at building those walls. Tests are syntactically heavy, boilerplate-rich, and context-bound — exactly the shape of work that maps cleanly onto an LLM with access to the codebase and the acceptance criteria. We use the probabilistic tool to build the deterministic scaffolding that catches it when it’s wrong. That isn’t a contradiction; it’s a division of labour.
Where it gets harder is outside code. Law has rules, but the rules are interpreted — by lawyers, by judges, by precedent. There’s no cargo test for a contract. Research has rules too, but the closest analogue to a test pipeline is source verification — and Stoner’s $12 domain ought to tell you how robust that is when the verifier is also the model.
For those domains the answer is a human in the loop, narrower scope, smaller pieces. The point isn’t that deterministic walls exist for everything. It’s that where they can be built, they’re the answer — and where they can’t, a better prompt isn’t either.
If you’re a CTO whose AI consultants are telling you the team just needs better prompts, here’s the question to put back to them: what fails loudly when the model is wrong? If nothing does, you have a probabilistic process pretending to be a deterministic one. The dressing-up of one as the other is the bet you’re making — knowingly or otherwise.
If you’d like a second pair of eyes on where that bet is sitting in your workflow, that’s a conversation I have most weeks.