Skip navigation EPAM
Dark Mode
Light Mode

Are AI Agents the Future of Penetration Testing?

Are AI Agents the Future of Penetration Testing?

In today’s complex threat landscape, cybersecurity must be intelligent and adaptive enough to respond to AI. However, AI is not currently creating a new category of cyberattack. Instead, it is amplifying existing ones. Techniques such as phishing, fraud and identity-based attacks have existed for years, but AI is making them faster, more scalable, and significantly more convincing, while lowering the barrier to entry for a broader pool of attackers. The National Cyber Security Centre warns that AI will increase the effectiveness, efficiency, access, frequency and intensity of cyber threats. Across industries, IDC predicts that by 2027, 80% of organizations will face phishing attacks from criminals using synthetic identities that blend real information with AI-generated data to appear legitimate.

As attacks become faster, more automated and more accessible, the question is no longer whether organizations can defend against known threats, but whether they can keep pace with the industrialization of cybercrime.

One of the most discussed responses is AI-driven penetration testing (“pentesting”), where autonomous agents simulate real-world attacks, identify vulnerabilities and continuously test systems at scale. The concept is compelling. But how effective are these systems without human intervention?

To answer that, we tested them.

Real-World Evaluation Criteria

Capture-the-Flag (CTF) challenges, including the popular XBOW Validation Benchmarks, are among the most common types of penetration tests for these agents, requiring them to find hidden flags. This approach has limitations: Production web applications do not contain flags or clear targets, some solutions use source code rather than a black-box approach, and some widely used benchmarks like DVWA and OWASP Juice Shop are already present in the training data of the latest LLMs, essentially giving agents answers before the test begins.

We decided to make the task more challenging. We tested autonomous AI penetration testing agents against two internal web applications designed to mimic real-world scenarios, each containing approximately 40 vulnerabilities, simulating the complexities of production systems.

The Experiment: Testing AI Pentesters

For testing, we selected five of the most popular open source solutions and one commercial option for this evaluation. Here’s how they performed.

Open-Source AI Pentesting

  1. GH05TCREW PentestAgent
    • Overview: Pentest agent framework with Crew mode that runs multiple specialist agents and builds a knowledge graph of findings.
    • Performance: Struggled to function in Docker mode and found only a few vulnerabilities.
    • Analysis: The agent is barebones: methodologies, CVEs and wordlists must be added manually.  Without this data, the results are underwhelming.
  2. Crossbow
    • Overview: A hobby project attempting to build a fully autonomous AI security engineer.
    • Performance: In autonomous mode, it couldn’t even browse the target application, let alone find vulnerabilities.
    • Analysis: Falls short on promise. Crossbow is a simple tool that’s mostly designed as an AI assistant to learn cybersecurity and requires input from a human at every step.
  3. Shannon
    • Overview: Shannon is a tool for white box pen testing. It utilizes Claude’s code understanding to analyze the source code and browser-using agent to test the application in runtime.
    • Performance: Delivered the best results among open-source tools, finding 13 vulnerabilities in Test App v1 and seven in v2.
    • Analysis: Splitting the pentest into phases and the usage of subagents are the main reasons for the good results. Shannon’s structured configuration approach is a standout feature.
  4. Strix
    • Overview: Strix uses both a team of reasoning agents and traditional penetration testing tools to find vulnerabilities and suggest auto-fixes for them.
    • Performance: It found seven vulnerabilities in Test App v1 but only one in v2, making it best suited for simpler applications.
    • Analysis: This agent is ineffective because it tries to do everything in a single system prompt: map the application, define subagents, check every type of vulnerability, and scan the source code.
  5. PentAGI
    • Overview: PentAGI is a platform for running a team of AI agents that spawns a Kali Linux instance with the pentesting tools. It integrates with search engines and observability instruments.
    • Performance: Despite its impressive architecture, it found only a few vulnerabilities.
    • Analysis: Currently more of a playground for testing AI agents than a production-ready tool.  The main culprit of its failure is that a simple prompt is used to do all the phases of the penetration test

Commercial AI Pentesting

AWS Security Agent

  • Overview: A commercial tool from AWS with extensive configuration options, including source code scanning, complex authentication, and domain ownership verification.
  • Performance: In black-box mode, it found 14 vulnerabilities in Test App v1 and 15 in v2, short of the goal but clearly better than open-source tools.
  • Analysis: The differences between this commercial solution and open source tools is immediately apparent: AWS Security Agent offers support for source code scanning, complex authentication, and uploading documents with knowledge about the target application. While not a replacement for human testers, it performs better than open-source options.

Why AI Pentesters Fall Short

Our analysis revealed several reasons why autonomous AI agents struggle with penetration testing:

  1. Limited Understanding of Application Logic
    AI agents struggle with non-standard application logic. After being trained on millions of typical eCommerce applications, encountering unique functionality or architecture can cause issues, giving humans a significant edge.

  2. Difficulty with Multi-Step Exploits
    Many vulnerabilities require chaining multiple steps, such as registering a user with an XSS payload that triggers on a profile page. AI agents, like traditional vulnerability scanners, often miss these.

  3. Inability to Handle Application Inconsistencies
    Agents falter when faced with inconsistencies, such as locked accounts or unexpected error messages. Unlike humans, they don’t adapt well to such scenarios.

The Role of AI in Penetration Testing

Based on our testing, most open source AI penetration testing agents are not yet ready for production use, let alone replacing the expertise of human testers. While some options stand out, like Shannon for white-box testing, the most effective use of AI in penetration testing is as an assistant, not an autonomous agent. Human testers can and should leverage AI to handle repetitive tasks while applying their expertise and judgment to guide the process.

Our findings also mirror a broader consensus: AI agents are powerful accelerators for pentesting, but they work best alongside humans, not instead of them. Companies like Synack, that provide AI-powered pentesting platforms, are explicitly built as human-in-the-loop, fusing autonomous agents with expert researchers to catch logic flaws, chained exploits, and nuanced vulnerabilities. NIST’s AI Risk Management Framework emphasizes governance, oversight, and lifecycle risk management for high-risk AI, making unsupervised offensive agents on production systems hard to justify.

That said, the rapid advancements in LLMs and emerging techniques, like agent "skills," hold promise. We’re optimistic that a new wave of AI pentesting agents will soon emerge, and we look forward to revisiting this topic in the future.

GET IN TOUCH

Hi! We’d love to hear from you.

Want to talk to us about your business needs?