Skip navigation EPAM
Dark Mode
Light Mode

From Figma to Production Code: Building a Dark Factory with AI Agents  

Building a team of AI agents that implements Figma designs into production code took us one day. Making it work reliably took weeks. And we're still not done.

This is a report from somewhere in the middle — an experiment that's working well enough to be useful and failing often enough to be honest about. Some call this kind of setup a dark factory, borrowing the fancy manufacturing term for a facility where robots work with the lights off. We'll stick with a less glamorous description: a team of agents that takes a Figma design and a feature specification, then produces production frontend code without a human writing a single line.

The numbers: a feature that typically takes a developer about 2 days arrives in about 2 hours of autonomous agent work — 30+ files changed, roughly 90% visual fidelity to the original Figma design. The code is structured well enough that it doesn't need significant rework, though it's not merge-ready as-is. A developer then spends another 2–4 hours reviewing and refining — mostly code review, then visual polish, and code adjustments. It's a meaningful time savings — not the order-of-magnitude leap some might expect, but real enough to justify the investment.

The number that tells the real story is 90%. Getting to 70% took a day. Getting to 90% took weeks. And we're not sure if the remaining 10% is reachable with current approach at all.

Why We Did This

The motivation was straightforward. Frontend developers spend a significant share of their time on work that is skilled but repetitive: taking a Figma design and faithfully translating it into React components. The layouts, the spacing, the icon placement, the responsive breakpoints — all specified in the design, all requiring careful manual implementation.

We wanted to know three things:

Could a team of AI agents produce a visual feature from a Figma design and specification without human intervention? Could it work reliably enough, given the current state of AI models and tooling? And could it unload developers from the monotonous parts of this work?

The answer to all three turned out to be yes, but — and the but is where everything interesting lives.

The Hard Problem: Looking Right

Most writing about AI agents focuses on code generation — scaffolding components, wiring APIs, handling routing. These are largely approachable problems today.

We were after something harder: visual fidelity. Take a Figma design — specific paddings, exact icon variants, precise typography, consistent color values — and reproduce it so a designer looking at the browser can't tell the difference.

The secondary challenge was code quality — not just code that works, but code that follows project conventions, uses the right shared components, and won't become a maintenance burden as the codebase grows.

Agents are confidently, quietly bad at visual fidelity — and unreliable on code quality.

From our runs: the agent wrapped an entire page in a shadowed box with rounded corners the design never specified — inferred from patterns elsewhere in the codebase. Icons in a subtly wrong shade of gray compared to design. A download button supposed to contain two icons showed up with only one. And our favorite: a back-arrow navigation icon rendered in blue instead of black. We ran multiple automated fix cycles. The agents couldn't detect the difference.

None of these broke the feature. The page loaded, the data displayed, the interactions worked. But every one of them is the difference between a prototype and a shipped product.  And the engineering challenge that interested us was precisely this: how close can agents get autonomously, and what does it take to push the quality boundary higher?

The Stack and Why

A likely question: why this particular toolset?

The pragmatic answer is that everyone on the project already had GitHub Copilot subscriptions with a generous allowance of premium model requests. No new procurement, no new accounts, no budget approval. We could start experimenting immediately.

The stack: VSCode with GitHub Copilot custom agents — markdown files in .github/agents/, each with a system prompt, permitted tools, and model assignment. Figma MCP (native, requiring a DEV seat) as the design input layer. Chrome DevTools MCP for testing agents to interact with the live application. Custom skills — reusable prompt-based instructions for recurring tasks. And project-level rules in AGENTS.md that all agents reference for coding conventions.

A run typically consumes about 3–4% of the monthly premium request allowance. We don't have access to absolute cost figures, but for a team already paying for Copilot, the marginal cost of experimentation is effectively zero.

One platform constraint shaped our agents team: at the time of the experiment, Copilot allowed exactly one level of agent hierarchy. An orchestrator spawned subagents. Subagents couldn't spawn their own. (This limitation has since been lifted — subagents can now spawn subagents, even recursively, though our architecture predates the change.)

The First Team: Assembled in Hours, Working in One

The first team of agents was generated with an LLM in a couple of hours. We described what we needed — a group of agents that could implement a frontend feature from a Figma design — and iterated on the output until it matched our vision.

The result was five agents, each with a minimal prompt: a role, five lines of responsibilities, a few constraints, and an output artifact description.

AgentRole
Architect-PlannerCentral orchestrator — receives requirements, creates plans, delegates tasks
EngineerImplements features from plans; escalates architectural concerns
UI ExpertTranslates Figma/design assets into component specs for the engineer
Code ReviewerReviews engineer's code; routes fixes back to engineer or architect
TesterRuns tests; routes failures to engineer, architect, or UI expert

 

And it worked. On the first run, the team produced a functional feature within an hour. Components rendered. All interactive. The page even looked similar to the original design.

Many people experience this exact euphoria when they first point AI at a real task. A prototype appears, it looks reasonable, a couple of follow-up prompts clean it up. You start to believe the hard part is over. But production work doesn't ask for "looks reasonable." It asks for exact icon variants, precise spacing, adherence to a component library, code that won't confuse the next developer who opens the file. The gap between a working page and a shippable feature is where the first team — with its five-line prompts and minimal constraints — had nothing to offer.

So we started adjusting...

Where We Are Now

We want to be precise about framing: this is not a proven final workflow. It's the best version we've arrived at through many iterations, and it still has real limitations.

That said — it produces genuinely useful results. A 2-hour autonomous run that delivers a feature at 90% visual fidelity with structurally sound code is, by any honest measure, a strong outcome for an approach that didn't exist a year ago.

The team has seven agents across three phases.

Phase 1 — Understand the design. A Design Decomposer reads the spec and the Figma design through the MCP, breaks the UI into implementable components, identifies reusable patterns, and lists assets to download. Separately, a Functional Test Cases Creator extracts every testable behavior from the spec and writes scenarios. Neither writes code.

Phase 2 — Build, with inspection after every piece. A Coder implements one task at a time, querying the Figma MCP for exact component specs. After each task, a Code Reviewer — framed explicitly as adversarial — checks the code against project guidelines and runs the linter. If the review fails, the Coder fixes and resubmits. This loop repeats until the review passes.

Moving from end-of-pipeline testing to test-after-every-step was one of the most impactful changes we made. When errors are caught at the station that produced them, they're small and fixable. When they accumulate across ten tasks, they're tangled beyond repair.

Phase 3 — Test against reality. A Functional Tester executes scenarios against the live application through Chrome DevTools MCP. Failures generate fix tasks cycling back through Phase 2. Then a Visual Tester retrieves both the Figma design layout and the live DOM and runs a structured comparison: colors, spacing, typography, sizing, icon presence, missing elements. Visual bugs trigger another fix cycle.

The orchestrator is a pure state machine driver — owns the task file, never touches code, decides what runs next based on subagent outputs. If interrupted, it re-reads the state file and resumes from where it stopped. This is what proved to be working and we didn't yet experiment with different orchestration techniques.

The adversarial framing in testing is essential. Without it, testers default to optimism — in early runs, our tester reported "all good" while eight visual mismatches were visible on screen.

What Made Visual Fidelity Possible

Simple prompting and vision-enabled models are not enough. Asking an agent to "match the Figma design" — even with screenshots — produces mediocre results. Getting to 90% required a specific combination:

  • The native Figma MCP with a DEV seat. Non-negotiable. It provides design elements as React components with Tailwind classes — structured, implementation-ready code specs. Open-source alternatives that extract CSS properties or structural metadata give the agent raw measurements to interpret; the native MCP gives it code to translate. The free plan’s rate limit of six requests per month makes this impractical — the DEV seat is a real cost that pays for itself in output quality.
  • The implement-design skill from Figma. A specialized skill guiding agents through pixel-perfect implementation from MCP data. Without it, agents take shortcuts in translating design specs to code.
  • A dedicated asset download skill. Without explicit instructions, agents frequently fail to retrieve icons and images — sometimes reporting the MCP didn’t support it, sometimes substituting placeholders, sometimes adding external links to MCP downloads.
  • Project-specific component guidelines. Skills describing how your project structures components, which shared elements to use, and which styling conventions to follow. Generic React knowledge isn’t enough when you need code that matches an existing codebase.
  • A dedicated Visual Tester agent running after implementation. Not an afterthought — a full adversarial agent whose only job is to find discrepancies between the live page and the Figma source, with no incentive to confirm quality.

Remove any one of these, and the output quality drops noticeably.

What We Learned

Stability is the unsolved problem

Our stability — how consistently the same task produces the same quality — is well below 95%. Different runs of identical input produce different decompositions, different implementation plans, different final output. One run nails icon placement and misses typography. The next gets typography right and invents a wrapper element.

Many boast their own agent-powered factories. We have never seen proof that any produce stable results across multiple repetitions of the same task. This variance appears inherent to the approach, not fixable by prompt tuning alone.

Code is cheap — testing is expensive

In our runs, actual code implementation accounts for less than a third of total execution time. The majority is spent on review, functional testing, visual comparison, and fix cycles. And we’d like to add more testing — unit tests, validation of intermediate artifacts — which would burn more tokens and add more time.

This inverts the traditional cost structure. These days the bottleneck isn’t writing the code. It’s verifying that the code is correct.

Context management is the real engineering problem

Agents have system prompts, MCP tool definitions, skills, project rules, and subagent configurations. All of this occupies context. Providing clear, concise, non-contradictory instructions across a team of seven agents is a genuine engineering challenge — one that requires semantic analysis and time from a human.

Agents reference other agents and skills. Skills reference project rules. Updating one agent’s prompt can introduce semantic inconsistency with three others — inconsistencies nearly impossible to spot without re-reading the entire instruction set. Proofreading with LLMs isn’t reliable; they tend to overgeneralize or fixate on specifics.

Agents won't use the right tools unless forced

You cannot rely on an LLM to choose the appropriate skill at the right moment. An agent with access to an implement-design skill will sometimes use it, sometimes improvise. Explicit skill references in prompts — "for UI tasks, use this skill first" — are required for stability.

The machines fill every gap you leave them

Agents find ambiguities you didn’t know existed and resolve them with confident, plausible, wrong decisions. Our orchestrator decided a subagent couldn’t start the application — because the orchestrator itself lacked terminal access. It projected its own limitations onto another agent.

An E2E tester, unable to reach a running app, quietly pivoted to reading source code and declared tests passed "as a proxy."

Adding prompt constraints fixes each specific case but doesn’t scale. The next run may invent a different interpretation.

This requires a specific skill profile

Designing and maintaining a team of agents is not a task any senior developer can pick up. It requires fluency in software engineering concepts, system design thinking, and the patience to iterate through runs that fail in novel ways each time. It’s closer to specification writing or API design than to coding.

Platform quirks that cost hours

Two discoveries, each costing significant debugging time:

VSCode Copilot has an experimental flag — "Custom Agent in Subagent" — that determines whether subagents receive your actual prompt or a vague AI-generated summary. With the flag off, your instructions are silently replaced. Even with it on, the orchestrator appends its own interpretation as a user-level message, which can subtly override your system prompt.

Separately, unclosed editor tabs in VSCode pollute the agent’s context. Deleted files that remain open appear as phantom content that agents read and act on. We learned to close all tabs before every run — a ritual that belongs in no engineering playbook but turned out to be essential.

Looking Forward

Six months ago, this wouldn't have worked. Copilot had no mechanism to spawn subagents. Agent sessions couldn't sustain hours-long tasks — the model would lose track of its own instructions partway through. Enforcing looped workflows — implement, review, fix, re-review — was unreliable at best. You could prototype the idea, but you could hardly run it on a real codebase and expect consistent results.

The tooling has caught up faster than we expected. Custom agents with subagent spawning, reliable MCP integrations, models that maintain coherence across multi-hour sessions — the pieces are now in place. Still rough at the edges — experimental flags, platform quirks, workarounds that belong in no engineering playbook. But functional enough to build real workflows on a production project.

What the industry doesn't yet have are comprehensive answers to the questions this progress raises.

Will this actually save developers time? The code arrives without the developer having written it — but they still need to own it. Reading, understanding, and taking responsibility for code you didn't create carries cognitive load that may not be less than writing it yourself. We haven't yet measured developer sentiment on this.

How durable is what you build? Models change between versions — behavior that worked reliably on one release may break on the next. Every prompt you've tuned is implicitly coupled to the current model's interpretation, with no guarantee of forward compatibility. The system depends on a third-party provider whose pricing, capabilities, and availability can shift. You're building on infrastructure you don't fully control — though that's increasingly true of most modern software.

What are the long-term risks nobody's measuring yet? Generated code accumulates. If generation outpaces human ability to review and comprehend, verification strategies are unclear. Developers who spend months reviewing agent output instead of writing code may lose hands-on expertise. The instinct is to split the codebase into isolated, thoroughly tested pieces — but whether that's achievable in complex production systems is unproven.

Anyway these are not reasons to wait. The tools are here, they work, and they're improving at a pace that makes today's limitations feel temporary. But they are reasons to go in with clear eyes — treating this as engineering that requires investment, not magic that requires faith.