I threw everything into one massive prompt and hoped for the best. Forty dollars later, I had hallucinations, broken code, and a useless output.

That was the day I stopped using AI and started engineering with it.

This post is a thesis I've been writing for a year in fragments — in LinkedIn drafts, in production Slack threads, in arguments with senior engineers who think "AI coding" means typing prompts and hoping. It doesn't. Or rather, it can, and it will fail you predictably, and the failure mode is expensive.

Here's the version I've earned the right to publish.

The shift

When AI coding tools first surfaced, the dominant framing was magic: "the AI will write your code, you'll review it, ship faster." That framing burned a lot of people. Including me. Because magic is not a thing you can debug, observe, budget, or staff against. Magic doesn't have a failure mode. Magic doesn't have a quality bar.

What I came around to is the opposite framing: AI is just another system in your stack. It has latency, error rates, cost per call, blast radius on failure, dependency surface area, observability requirements. The engineers who treat it that way are shipping. The engineers who treat it as a vibes-based productivity boost are producing the same kind of debt as the people who skipped TDD in 2010 because "tests slow you down."

Stop using AI. Start engineering with it.

Four principles I run on

1. Context is memory. Finite and expensive.

Every token you put into an agent's context window displaces something else. The window isn't infinite, the cost is real, and the quality of reasoning falls off well before the limit. I treat context like RAM: pack it deliberately, evict aggressively, never assume the agent remembers what you said two turns ago unless you put it in the system prompt.

The practical version: short, specific tasks beat long, vague ones. Foundation-first orchestration (one agent researches, others fan out from the same brief) beats parallel cold-starts. Compaction at well-chosen boundaries beats running until the window saturates and quality silently degrades.

2. One agent, one job.

A code reviewer that only reviews beats a generalist that sometimes reviews. The general-purpose agent is the seductive trap: you give it everything, it does mediocre work on all of it. The specialist agent has a focused system prompt, a narrow tool surface, and a clear handoff to the next specialist. It's slower to set up. It's better in production.

This is the same principle as microservices, but for AI capability. The architectural taste you've already developed for service decomposition is the taste you need for agent decomposition. People who skipped that taste write monolithic agents. People who developed it write fleets.

3. Adversarial evaluation. Always.

If an agent passes tests it has read, the code might be structurally sound — or it might be pattern-matched to pass. The only way to tell is to validate against tests the agent has never seen.

I run agents in isolated workspaces, then validate their output against the project's actual test command in a separate container the agent doesn't have access to. Pass = structurally sound. Fail = retry with the failure output as feedback, still blind. Three attempts, then human review. This catches reward-hacking that "agent says it's done" never will.

4. Cost per task, not cost per month.

"We spent $2,000 on AI this month" means nothing. "$6 per ticket with zero defects" means everything. The first number tells you what you spent. The second tells you whether to scale.

I tag every agent invocation with a task ID and track tokens-in, tokens-out, model used, success or failure, and downstream defect rate. Then I can answer the only question that actually matters: is this task cheaper, faster, and better when I run it through an agent than when a human does it? If yes, scale. If no, fix the prompt, fix the validator, or pull the task back to humans.

Engineers who can't answer that question are still in the "spend money and hope" phase. Engineers who can are running infrastructure.

The boring tools that actually work

The infrastructure framing forces unsexy answers. Here's the stack I default to now:

  • Tiered model assignment. Complex reasoning to Claude Sonnet or Opus. Cost-sensitive tasks to local Ollama (Qwen 2.5 7B handles classification, formatting, and simple validation at zero marginal cost). Don't pay API prices for work a 7B model nails on a laptop.
  • Budget-aware degradation. When budget hits 70%, silently downgrade specialists from Sonnet to Haiku. Front-load expensive models on architecture and complex implementation. Late-stage cleanup doesn't need the best model.
  • Quality gates as shell commands. After an agent finishes, a configurable hook runs. pytest, eslint, a custom validator. Exit 0 = ship. Exit non-zero = the hook's stdout becomes feedback, agent retries with that context. Three attempts, then human review.
  • Provider abstraction. 200 lines of code so the orchestrator doesn't know whether it's talking to Claude, Ollama, or an OpenAI-compatible endpoint. The day a provider deprecates a model or changes pricing, you swap one line of config. Build for optionality.
  • Structured observability. Every agent action emits a typed event. Run ID, agent name, agent ID, type, timestamp. In multi-agent systems, you can't debug what you can't see. Observability is the entire debugging strategy.

None of this is fancy. All of it is what I'd expect from any production system. That's the point.

Where this leaves us

The engineers winning the next decade aren't the ones with the most AI tools. They're the ones whose senior engineers learn to orchestrate agents the way they used to mentor juniors — with clear specs, tight feedback loops, and architectural taste.

The skill ceiling raised. The work hasn't disappeared, it shifted. The mechanical translation phase — taking a mental model and writing the syntax — that compressed dramatically. What expanded is everything that requires judgment: scoping the problem, evaluating the output, knowing when to trust a result and when to throw it away, designing the system the agents operate inside.

That judgment is what the next decade of senior engineering hires for. Not "can you write code faster." That question is already answered. The question is "can you build the system that gets correct code shipped, repeatedly, at scale, when most of the typing is being done by something other than a human."

That's engineering. Not magic. Infrastructure.


I'm available for Staff or Principal roles where this thesis matches the work. If your team is shipping AI-native software at scale — or wants to — say hello.