What we learned across all phases & steps
Let’s start with the good news: Spec Kit won’t replace developers, designers, or architects anytime soon. What it changes is how they work. It automates mechanical, structured tasks and leaves humans to make the contextual, architectural, and design judgments that AI still can’t replicate. After running multiple features through all five phases, five consistent lessons emerged:
The 80/20 rule for automation
Across every phase, Spec Kit handled about 80% of the structured work automatically. The remaining 20% demanded human input: decisions involving trade-offs, domain context, and project-specific nuances. Spec Kit’s job is to handle predictable, repetitive work so engineers can focus on context, judgement, project-specific nuance, logic, quality, and intent. The best outcomes came when we treated it as a collaboration, not delegation.
The cumulative quality problem
Each phase independently delivered roughly 80 percent accuracy. But when you chain five phases together, small imperfections multiply:
0.8 × 0.8 × 0.8 × 0.8 × 0.8 = 0.33
By the end of implementation, cumulative quality would drop to about 33 percent without intermediate checks. This is why review gates are not bureaucratic hurdles, but essential data-backed safeguards. Every phase requires validation before moving forward; otherwise, early inaccuracies propagate through the workflow and erode final output quality
Review fatigue is an unglamorous truth
Reviewing AI-generated Markdown files is necessary but cognitively tiring. Unlike code or documentation, AI-written text appears grammatical and seemingly reasonable but demands constant scrutiny for factual and architectural accuracy.
Naturally, your focus will drift, and errors might start building in plausible wording.
The best mitigation is structured review—treat each file like a code review, rotate reviewers, and take short breaks. The mental workload doesn’t disappear; it shifts from writing to validation.
We've stated multiple times that engineers must read each and every .md file carefully. We highlighted this with the cumulative quality problem math. But here's what we haven't said: reading AI-generated text is mentally exhausting.
Your focus naturally drifts. The prose is grammatical and seemingly reasonable, but validating it requires sustained concentration. It's not like reading code where patterns jump out. It's not like reading documentation where you're learning. It's validating plausible-sounding text for correctness - and that's cognitively draining.
What to do? How did our team handle this? Honestly - we don't have a perfect solution. Take breaks. Stay sharp. Treat it like code review, not casual reading. Rotate who does reviews if possible. But be prepared: it will be boring. This is the tax you pay for AI assistance - the mental effort shifts from typing to validation, but it doesn't disappear.
Model Quality is a productivity decision, not a cost choice
Model intelligence directly influences the 80/20 ratio. Upgrading from Claude Sonnet 4.0 to 4.5 cut our rework time nearly in half. Smarter models hold context through longer implementations, detect code reuse better, follow constitution rules more strictly, and make stronger architectural choices.
The economics are clear: a more capable model costs more per token but saves exponentially more engineering hours. Using cheaper, outdated models often means spending the difference fixing and refactoring their output.
Fixing output: when to iterate vs when restart
Through multiple implementations, we identified two effective strategies for correcting workflow errors in Spec Kit:
- Iterate to correct small deviations: Re-run the same command to regenerate partial outputs when dealing with limited-scope issues like clarifying specifications, refining edge cases, or adjusting minor design details. Iteration helps align outputs without disrupting downstream phases. Be selective, though. Routine technical maintenance (for example, code cleanups or deprecating old APIs) rarely benefits from Spec Kit. For such tasks, local tools or Copilot’s agent mode are faster and more precise. Focus Spec Kit on medium-to-large features where design structure and context matter.
- Restart to recover from foundational flaws: When architectural assumptions, system requirements, or constitution definitions are wrong, iteration magnifies the error chain. Restart the step after manually fixing the last validated .md file, or start the entire process from the beginning if upstream artifacts are compromised. Clean restarts realign design logic and restore phase consistency.
Rule of thumb: After three or four iterative attempts on the same artifact, stop fixing and restart the phase. Fresh context is often the cleanest path to restoring code and architectural integrity.
What’s next? Build reliable engineering features with Spec Kit
Spec Kit isn’t magic. It’s a serious engineering tool that rewards preparation, not blind trust. With the right setup: clear constitution, strong architecture, and task-level execution, it makes AI-assisted development measurable, predictable, and scalable. The real shift isn’t in removing humans but in redefining their role. Spec Kit gives AI the structure it needs to behave like part of the team rather than a code generator. It turns good engineering discipline into repeatable, AI-ready workflows.
If you’re exploring Spec-Driven Development or looking to get started, the AI/Run team is here to help you design that foundation and guide your adoption journey.