AI-Native Delivery Teams: Are We Navigating Complexity or Locking It In?

Dmitry Tovpeko

Vice President, AI Engineering and Modernization Global Co-Head

DATE

Nov 26, 2025

TL;DR

AI-first delivery demands something humans traditionally skip: formalization. Not because formalization is valuable for its own sake, but because agents need it, and humans ultimately benefit from it. Four components make it work: formalized stages, specialized agents, curated knowledge, and human validation at boundaries.

The mechanism: agents produce detailed artifacts with a predefined structure that subsequent agents then consume. Quality compounds when humans validate at handoffs; it degrades when autonomous workflows run unchecked.

But here’s the catch: we’re building infrastructure to navigate complexity, not reduce it. Are we solving for the right capability?

AI-first isn’t a commitment. It’s a task-level choice

Teams operate in two modes, and the better ones bounce between them:

AI-assisted: You drive. AI provides inline suggestions, autocomplete, and chat responses as you code. You’re steering; AI is following your lead.

AI-first: You frame the task and validate what comes out. AI handles the middle: implementation, test generation, documentation to a substantial degree. AI executes; you review.

Some teams chase “maxi mode” to delegate everything, all the time. The pitch sounds right: if AI can handle it, why wouldn’t you? In practice, balanced teams make a different choice. They look at each task and ask: which mode yields the best result here, given what we’re working on and the constraints we’re operating under? Simple, well-scoped work runs AI-first. Core logic spread across multiple parts of the application, touching different layers and numerous files, or requirements that aren’t nailed down yet? AI-assisted makes more sense.

The catch, however, is that AI-first isn’t plug-and-play. Most teams lack the infrastructure to support this mode, even where it’s feasible. What follows are the four components that separate teams running AI-first work in production from teams stuck celebrating one-off wins.

Why the intuitive setup fails?

The intuitive flow sounds reasonable: configure your AI code assistant, add a couple of MCP servers for Jira and Confluence, let agents search for implementation scope and navigate architecture guidelines, iterate until you have working code—boom, done. On large brownfield projects, this rarely works smoothly.

The gap, however, is that MCP servers enable access but don’t solve navigation. Point an agent at your full Jira backlog and Confluence space, and it drowns. Retrieval returns dozens of potentially relevant, unstructured tickets. Context windows fill with outdated architecture guidelines and security policies. Agents struggle with the volume of information and lack a proper junior-developer-like onboarding process to grasp undocumented or scattered project knowledge.

Four components that work

Here's how you can pave way for an AI-first delivery teams:

1. Formalized SDLC artifacts (Spec-driven development as the latest implementation example)

The SDLC stages stay the same: Idea → Feature/User Story → System Design/Implementation Plan → Code Implementation → Testing. What changes is formalization, strict standardization of inputs and outputs at each stage.

Traditional flow: Developers look at a story and, if the scope seems clear, jump straight to implementation. They might sketch a partial design without all the details. They skip formal planning and just code their way through. This worked for established teams because creating all those artifacts felt tedious and there was no real consumer for the effort.

AI-first flow: Agents generate detailed artifacts at every stage. Next-stage agents consume those artifacts. The mechanism: detailed specs reduce ambiguity, agents execute more reliably, and humans validate at boundaries instead of babysitting every line. SDD—spec-driven development—captures the pattern.

Here’s why formalization matters: vague stories lead to one of two failure modes, either you get an incorrect implementation, or you get correct logic wrapped in inconsistent architecture or code-level design, which tanks maintainability. Detailed specs should include acceptance criteria, dependencies, edge cases, and architecture/coding guidelines, the stuff developers used to carry in their heads or infer from tribal knowledge.

AI handles the discipline and mechanics of generating these artifacts. Humans review for correctness and alignment. This division of labor is more likely to stick than previous attempts at process formalization, precisely because AI removes the tedium that made everyone skip steps before.

The trade-off: more upfront structure for faster, more consistent execution. Teams gain throughput but sacrifice the flexibility to shortcut from vague story to working code in a single leap.

2. Specialized, swappable agents

Productive teams don’t standardize on a single AI tool. They run heterogeneous portfolios: Cursor for agentic coding, Claude Code for autonomous tasks, GitHub Copilot for inline suggestions, custom agents for testing and business analysis tasks. The pattern that works: properly scoped, single-responsibility agents at the edges, coordinated through shared workflows and context.

The volatility argument matters. The field shifts rapidly. Today’s benchmark leader might fall behind within months. Committing long-term to one platform risks lock-in to yesterday’s capabilities. But here’s the tension: chasing every new release burns time without payoff. Switching from Claude Code to Codex or vice versa just because it gained a few percentage points on benchmarks wastes effort without clear benefit.

The balance teams are finding: maintain a limited selection per task type/role, two or three endorsed tools. Keep a small team continuously experimenting with the latest releases. They anchor the broader team to market trends without forcing everyone into permanent experimentation mode. Review the portfolio regularly: retire what’s falling behind, double down on what’s delivering.

Why understanding internals matters: Agents are black boxes to varying degrees. Prompts, tools, context management, context retrieval, low-latency sync engines, model selection—all of these influence outcomes. Different platforms expose different levels of control. The farther you are from a complete black-box solution, the more you can tune behavior, not if, but when, results degrade. Competitive moats are built from these combined mechanisms, which means transparency will increasingly become a challenge.

Heterogeneity requires standardized integration points. Agents coordinate through shared systems of record (VCS, issue tracker, design files, source code), a layer of curated markdown files maintained by teams to direct agents on how to distill project knowledge from project data, and versioned workflows.

3. Project knowledge layer (Curated context)

Enterprises carry mountains of data: thousands of Jira epics, hundreds of Confluence pages, architecture diagrams in various formats, test cases, bug reports, and documents nobody’s looked at in months. All of it created by humans, for humans. Agents can’t parse the volume or extract implicit context at scale.

The distinction that matters: Data exists, but agents need knowledge: a condensed, structured subset of project data, along with meta-information that points to detailed sources. Project knowledge includes application architecture components, dependencies, project structure, naming conventions, code style guides, service/front-end/data-access layer implementation patterns, shared components, external library usage, security protocols, and more. The things any engineer should know to be productive on the project.

The mechanism behind agent effectiveness: Context engines ingest, normalize, search, rank, and retrieve knowledge to shape context. Advanced companies invest in proprietary context engines or adopt platforms that offer tighter control. But every team can benefit from a simpler approach—manually curated markdown files defining key concepts.

Why curate manually: This is tribal knowledge that was never captured in writing or quickly became obsolete because nobody had ongoing incentive to maintain it. Now there’s a buyer. Agents need this context for every transaction. Humans need it too, to achieve the desired outcome or to understand why an agent ignored a constraint and repeated the same mistake twice.

The trade-off: upfront curation effort for sustained agent quality. Teams invest time structuring knowledge once, then maintain it incrementally as it evolves. The payoff: agents make fewer errors, humans spend less time debugging annoyingly poor outputs.

4. Human gates at stage boundaries

AI-first sounds autonomous. In practice, teams insert human gates after every major stage. The reason: unsupervised workflows compound errors on varied, contextual work, pure probability across multiple stages. Unverified agent output passed to the next stage accumulates inaccuracies; quality degrades faster than it improves.

The framing: humans anchor the edges. They’re responsible for planning agent inputs and validating outputs. The middle (implementation) runs autonomously.

The exception: highly repetitive backlogs with narrow scope—straightforward migrations, bulk refactoring, test generation for uniform patterns. Workflows can run longer here without intervention. Insert validation gates elsewhere; quality resets at each boundary.

True efficiency comes not from removing humans entirely, but from shifting where human effort is applied. AI handles tedious implementation, while humans focus on judgment calls, thoughtful preparation (clear specifications and scoped tasks), and rigorous review (validating logic, checking architectural alignment, and confirming edge cases).

Where efficiency actually comes from: not from removing humans entirely, but from shifting where human effort is applied. AI handles tedious implementation, while humans focus on judgment calls, thoughtful preparation (clear specifications and scoped tasks), and rigorous review (validating logic, checking architectural alignment, and confirming edge cases).

Are we navigating complexity or locking it in?

This setup doesn’t replace engineers, it shifts where engineering effort concentrates. From typing code to framing problems, curating knowledge, and validating solutions. The four components formalize what productive teams were already doing informally; agents create the first real incentive to maintain that discipline. The work changes; the need for professional judgment doesn’t.

Here’s what the setup reveals: enterprise complexity isn’t decreasing. We’re building crutches around AI to navigate complexity, not reduce it. High complexity becomes sustainable. The better we get at accommodating complexity, the less pressure we feel to simplify. Are we solving for the right capability—navigating what exists—or locking in a trajectory where complexity only compounds?

Learn to design, train, and scale AI-native delivery teams Learn to design, train, and scale AI-native delivery teams

Frequent Searches

AI-Native Delivery Teams: Are We Navigating Complexity or Locking It In?

CATEGORY

Dmitry Tovpeko

DATE