How we build AI systems that actually run.
An essay on why most AI projects stall, the four principles we engineer by, and what an engagement with us actually looks like.
Most AI projects do not fail for the reason people expect. The model was fine. The prompt was fine. The demo worked. Something quieter went wrong — usually months earlier, and almost always in a part of the stack no one called AI.
We started Trace because we kept seeing the same pattern. A company runs a hackathon, builds a convincing prototype, shows the board, and commissions a build. Six months later, the system is either shelved or limping along with a person behind it pretending to be a pipeline. The post-mortem always names the model or the vendor. It is almost never the model or the vendor.
Why most AI projects stall
Three failure modes account for most of what we see. They are not glamorous. They are not contested at the frontier of the field. They are boring, operational, and almost entirely solvable.
1. The data layer is not ready.
The model will be deployed against data that is partially wrong, sporadically updated, and owned by a team that doesn't know it is being read. There is no contract on what the data means. The schema drifts on Tuesdays. The metric changes definition when one person leaves. In production, the model produces plausible answers from quietly incoherent inputs, and the failure mode is silent — which is the worst kind.
2. There is no evaluation discipline.
"The demo looked good" is not an evaluation. Neither is "the team is happy with the answers." A production AI system needs a dataset that represents the hard cases, a harness that runs on every change, and a number that lives on a wall somewhere. Without that, the team will quietly optimise for the wrong thing, and no one will notice until it is in the hands of the customer.
3. Prototype-to-production is treated as a handoff, not a rewrite.
The notebook that worked on Wednesday is not the system that runs on Sunday. It calls an API synchronously. It reads a CSV someone dropped in a Drive folder. It has no idempotency, no retries, no observability, no tests, and no way to back out a bad response. When the team tries to "productionise" it, they discover that nearly none of it survives contact with the real infrastructure — and that the interesting work was never in the notebook anyway.
Four principles
These are the rules we engineer by. They are not negotiable inside our engagements — not because we are dogmatic, but because every time we have compromised on one, we have regretted it within a quarter.
The data layer comes first.
Before we build a model or a prompt, we want to see your pipelines, your warehouse, your contracts, and your quality checks. If the data is broken, we will fix the data. If the data is fine, we will say so and move faster. Either way, we refuse to deploy a model on top of a layer we do not trust. That's how the silent failures happen.
Evaluation is a first-class deliverable.
Every engagement produces an evaluation harness — a dataset curated from real cases, a runner that executes on every change, and a dashboard with the number that matters. The harness is yours. It outlives the engagement. If you can't measure the system, you can't safely change it, and you will slowly be unable to change it at all.
The operator's Sunday matters more than the demo's Wednesday.
We design for the person on-call at 3am. That means idempotent jobs, structured logs, traceable failures, sensible retry policy, cost alarms, and runbooks that a reasonable engineer can follow without waking us. Everything we build is operable by a team that is not us.
Boring wins.
When there is a boring option and a clever one, we take the boring one. A scheduled job beats a novel orchestrator. A well-indexed Postgres table beats a vector database the team has never run. A deterministic classifier beats an LLM call if the LLM call is not earning its keep. We are not allergic to new tools — we are allergic to new tools selected for their novelty.
What an engagement looks like
Most clients start with a readiness audit. Some skip it because they've already done the equivalent. Here is the shape of a typical end-to-end engagement, assuming both.
A call. We ask you to describe what you are trying to change about the business. We do not ask for a tech stack survey. If there is a fit, we send a proposal in writing.
Readiness audit. We walk the data estate, interview operators, read the existing code, and test the state of the evaluation discipline. We produce a written recommendation: what to build, what not to build, and what to fix first.
Phase one of the build. Almost always the data foundations: pipelines, contracts, quality checks, the dull work that makes everything after it possible. You get a system you can operate, not a slide.
Phase two. The AI or ML system itself — retrieval, agents, models, whichever shape fits — deployed behind an evaluation harness, observable, and wired into the application. Weekly review, weekly ship.
Operate and hand over. Runbooks, monitoring, a written handover document, and an optional retainer for the first months of life in production. We leave your team able to own it. Or we stay embedded, if that's the shape you want.
What we say no to
A firm is as defined by what it refuses as by what it builds. A partial list, in case it is useful:
- Chatbots no one asked for. If the operators would not use it and the customers have not requested it, the answer is almost always a better search bar.
- One-week proofs of concept. A week is not enough time to find the interesting failure modes, which means a week-long POC reliably produces false confidence. We'd rather not start than start that way.
- Wrapping a prompt around an existing problem. If the underlying process is broken, an LLM will make the breakage faster and harder to trace. We fix the process first.
- Black-box handovers. If we leave and no one at your company can operate what we built, we have not done the work.
- Hype-driven roadmaps. If the reason for building is that a competitor announced something, we will usually suggest waiting one quarter. It is almost always the right call.
The quiet result
If we've done the job well, the outcome is unglamorous. There is a system. It runs on a schedule. The numbers on the dashboard move in the direction the business needs. No one on the operations team has to think about it on a Sunday. The company does not issue a press release. The engagement ends, and a piece of infrastructure continues.
That is the bar we've set. It is an unshowy bar. It is the one we think matters.