- Can they navigate large codebases without losing context?
- Can they reconstruct a real vulnerability chain across multiple files and services?
- Can they separate meaningful signals from operational noise? And most importantly, can they reason like security engineers instead of behaving like vulnerability autocomplete systems?
To find out, I benchmarked 15 models across 21 GitHub Copilot CLI agent runs against a vulnerable commit hidden inside a repository containing more than 2,000 files and roughly 350,000 lines of code.
The AI security agent benchmark: 15 AI agents, one real vulnerability, 350,000 lines of production code
The benchmark sat in a deliberately uncomfortable middle ground. I did not hand agents isolated vulnerable snippets, but I also did not ask them to autonomously execute polished exploit chains against live infrastructure.
Agents had to traverse the repository, reconstruct the vulnerable path, connect frontend exposure to backend trust assumptions, and explain the consequence clearly enough that a human reviewer would prioritize the correct remediation.
The vulnerability: How auth-boundary drift creates a real exploit chain
The vulnerability was an auth-boundary failure created through ordinary product drift. A backend API key originally started as a narrow, low-impact internal mechanism. With time, the key accumulated more responsibility across multiple services and was shipped into the browser build, where frontend-originated requests began using it directly despite the application already supporting JWT-based user authentication elsewhere.
On the backend, service-auth decorators treated possession of that static key as proof that the caller was a trusted internal service. Once browser-accessible code exposes a credential that backend systems treat as service identity, the security conclusion is already established.
The remediation workflow includes:
- Remove the service credential from browser paths,
- Move browser-originated requests onto the user-auth boundary,
- Stop treating browser-reachable static keys as trusted service identity.
The repository itself was operationally realistic and intentionally noisy. It contained:
- Python backend services,
- frontend code,
- Docker and CI/CD configs,
- YAML infrastructure,
- gRPC routing,
- Envoy configs,
- deployment scripts,
- README and local configuration examples.
Evaluation criteria: What actually separates a useful security report from a noisy one
The benchmark was never about whether a model could merely mention a vulnerability. The harder question was whether the model could reconstruct the exploit chain, understand the consequence hierarchy, and surface the issue clearly enough that a security engineer would prioritize the right remediation.