What Made Visual Fidelity Possible
Simple prompting and vision-enabled models are not enough. Asking an agent to "match the Figma design" — even with screenshots — produces mediocre results. Getting to 90% required a specific combination:
- The native Figma MCP with a DEV seat. Non-negotiable. It provides design elements as React components with Tailwind classes — structured, implementation-ready code specs. Open-source alternatives that extract CSS properties or structural metadata give the agent raw measurements to interpret; the native MCP gives it code to translate. The free plan’s rate limit of six requests per month makes this impractical — the DEV seat is a real cost that pays for itself in output quality.
- The implement-design skill from Figma. A specialized skill guiding agents through pixel-perfect implementation from MCP data. Without it, agents take shortcuts in translating design specs to code.
- A dedicated asset download skill. Without explicit instructions, agents frequently fail to retrieve icons and images — sometimes reporting the MCP didn’t support it, sometimes substituting placeholders, sometimes adding external links to MCP downloads.
- Project-specific component guidelines. Skills describing how your project structures components, which shared elements to use, and which styling conventions to follow. Generic React knowledge isn’t enough when you need code that matches an existing codebase.
- A dedicated Visual Tester agent running after implementation. Not an afterthought — a full adversarial agent whose only job is to find discrepancies between the live page and the Figma source, with no incentive to confirm quality.
Remove any one of these, and the output quality drops noticeably.
What We Learned
Stability is the unsolved problem
Our stability — how consistently the same task produces the same quality — is well below 95%. Different runs of identical input produce different decompositions, different implementation plans, different final output. One run nails icon placement and misses typography. The next gets typography right and invents a wrapper element.
Many boast their own agent-powered factories. We have never seen proof that any produce stable results across multiple repetitions of the same task. This variance appears inherent to the approach, not fixable by prompt tuning alone.
Code is cheap — testing is expensive
In our runs, actual code implementation accounts for less than a third of total execution time. The majority is spent on review, functional testing, visual comparison, and fix cycles. And we’d like to add more testing — unit tests, validation of intermediate artifacts — which would burn more tokens and add more time.
This inverts the traditional cost structure. These days the bottleneck isn’t writing the code. It’s verifying that the code is correct.
Context management is the real engineering problem
Agents have system prompts, MCP tool definitions, skills, project rules, and subagent configurations. All of this occupies context. Providing clear, concise, non-contradictory instructions across a team of seven agents is a genuine engineering challenge — one that requires semantic analysis and time from a human.
Agents reference other agents and skills. Skills reference project rules. Updating one agent’s prompt can introduce semantic inconsistency with three others — inconsistencies nearly impossible to spot without re-reading the entire instruction set. Proofreading with LLMs isn’t reliable; they tend to overgeneralize or fixate on specifics.
Agents won't use the right tools unless forced
You cannot rely on an LLM to choose the appropriate skill at the right moment. An agent with access to an implement-design skill will sometimes use it, sometimes improvise. Explicit skill references in prompts — "for UI tasks, use this skill first" — are required for stability.
The machines fill every gap you leave them
Agents find ambiguities you didn’t know existed and resolve them with confident, plausible, wrong decisions. Our orchestrator decided a subagent couldn’t start the application — because the orchestrator itself lacked terminal access. It projected its own limitations onto another agent.
An E2E tester, unable to reach a running app, quietly pivoted to reading source code and declared tests passed "as a proxy."
Adding prompt constraints fixes each specific case but doesn’t scale. The next run may invent a different interpretation.
This requires a specific skill profile
Designing and maintaining a team of agents is not a task any senior developer can pick up. It requires fluency in software engineering concepts, system design thinking, and the patience to iterate through runs that fail in novel ways each time. It’s closer to specification writing or API design than to coding.
Platform quirks that cost hours
Two discoveries, each costing significant debugging time:
VSCode Copilot has an experimental flag — "Custom Agent in Subagent" — that determines whether subagents receive your actual prompt or a vague AI-generated summary. With the flag off, your instructions are silently replaced. Even with it on, the orchestrator appends its own interpretation as a user-level message, which can subtly override your system prompt.
Separately, unclosed editor tabs in VSCode pollute the agent’s context. Deleted files that remain open appear as phantom content that agents read and act on. We learned to close all tabs before every run — a ritual that belongs in no engineering playbook but turned out to be essential.
Looking Forward
Six months ago, this wouldn't have worked. Copilot had no mechanism to spawn subagents. Agent sessions couldn't sustain hours-long tasks — the model would lose track of its own instructions partway through. Enforcing looped workflows — implement, review, fix, re-review — was unreliable at best. You could prototype the idea, but you could hardly run it on a real codebase and expect consistent results.
The tooling has caught up faster than we expected. Custom agents with subagent spawning, reliable MCP integrations, models that maintain coherence across multi-hour sessions — the pieces are now in place. Still rough at the edges — experimental flags, platform quirks, workarounds that belong in no engineering playbook. But functional enough to build real workflows on a production project.
What the industry doesn't yet have are comprehensive answers to the questions this progress raises.
Will this actually save developers time? The code arrives without the developer having written it — but they still need to own it. Reading, understanding, and taking responsibility for code you didn't create carries cognitive load that may not be less than writing it yourself. We haven't yet measured developer sentiment on this.
How durable is what you build? Models change between versions — behavior that worked reliably on one release may break on the next. Every prompt you've tuned is implicitly coupled to the current model's interpretation, with no guarantee of forward compatibility. The system depends on a third-party provider whose pricing, capabilities, and availability can shift. You're building on infrastructure you don't fully control — though that's increasingly true of most modern software.
What are the long-term risks nobody's measuring yet? Generated code accumulates. If generation outpaces human ability to review and comprehend, verification strategies are unclear. Developers who spend months reviewing agent output instead of writing code may lose hands-on expertise. The instinct is to split the codebase into isolated, thoroughly tested pieces — but whether that's achievable in complex production systems is unproven.
Anyway these are not reasons to wait. The tools are here, they work, and they're improving at a pace that makes today's limitations feel temporary. But they are reasons to go in with clear eyes — treating this as engineering that requires investment, not magic that requires faith.