There’s a version of the case against AI-generated code that comes with receipts. A recent Stanford study tracked nearly 6,000 coding sessions across public GitHub repos. Developers kept only 44% of AI-written code, rejecting or rewriting the rest. Vibe-coded commits introduced roughly 9x more security vulnerabilities per 1,000 lines than code written by humans.
The standard takeaway: AI writes worse code, especially when it comes to security.
Here’s a different one: those numbers come from sessions without standards.
I had Claude score four codebases I work with against the same rubric: architecture, code quality, error handling, security, testing, async patterns, dependency management, documentation, and modernization readiness. Same rubric. Same model. Same prompt. Here are the overall numbers.
| Codebase | Score |
|---|---|
| 110K LOC .NET 4.8 legacy web app | 2.3/10 |
| 190K LOC .NET 4.8 legacy web app | 2.3/10 |
| 95K LOC .NET 10 greenfield web app | 9.1/10 |
| 250K LOC .NET 10 greenfield library | 9.2/10 |
The Reveal
The two at the top, the ones that scored 2.3, are 100% human-written.
The two at the bottom are 99%+ written by AI agents.
Let that sit for a second.
What 2.3/10 Looks Like
The 2.3 didn’t come from one bad afternoon. It came from years of decisions.
Plaintext passwords sitting in Web.config. ELMAH wide open to anyone who finds the URL. Anti-forgery tokens on 24% of POST endpoints. throw ex everywhere, destroying stack traces. A 64KB BaseController doing the work of a small framework. A 125KB BaseProfileController God class. Radically outdated libraries holding the front end together. Zero unit tests across the entire application.
These weren’t AI mistakes. Senior engineers wrote this. Code reviewers approved it. Releases shipped. The codebase grew, and nobody pumped the brakes because the standards that would have flagged any of it didn’t exist.
The 2.3 isn’t a failure of skill. It’s a failure of standards.
What 9.1/10 Looks Like
The 9.1 didn’t come from a better model. It came from an environment where the model couldn’t fail quietly.
1,781 tests, zero failures. Every secret in AWS Secrets Manager. RFC 7807 problem details on every error response. Nullable reference types enforced everywhere. Zero TODOs, zero pragmas, zero empty catch blocks. 49 centrally managed packages, all consistent, all current. A global exception handler that strips production details before they ever leak.
The agent didn’t decide to do any of that. The standards did. The instructions file pointed to architecture docs, naming conventions, error handling patterns, and security requirements. The agent loaded them at the start of every session and checked its work against them. When the agent drifted, the linters caught it. When the linters missed something, code review caught it.
The variable that produced 9.1 wasn’t the model. It was the bar.
Quality Is Not a Human Trait
Quality is not a property of who types the code. It’s a property of the standards in force when the code gets written.
A human without standards produces 2.3/10. An AI with standards produces 9.1/10. Same rubric. Different bar.
This is uncomfortable, and it should be. The “AI writes slop, humans write craft” frame is convenient. It protects the assumption that experience and judgment automatically produce quality output. The four scorecards on my desk say experience and judgment, on their own, do not.
If you can look at AI-generated code and recognize slop, run the same audit on the codebase you inherited last. Ask what it would score against the same rubric. Most teams won’t like the answer.
The teams shipping clean code with AI aren’t using better models. They’re using the same models everyone else has. They already knew what good looked like, wrote it down, and put it in the path of every line of code their agents produce.
AI didn’t lower the bar for software quality. It made the absence of a bar visible.
If your team has standards, AI will ship them faster than you ever could. If your team doesn’t, AI will reveal that fact at the speed of light.
The four codebases on the scorecards weren’t graded on who wrote them. They were graded on what the writers were held to. Two had a high bar. Two didn’t.
That’s the whole story.
Full Receipts
110K LOC .NET 4.8 legacy web app
| Dimension | Score |
|---|---|
| Architecture & SoC | 2/10 |
| SOLID Compliance | 1/10 |
| Code Quality | 3/10 |
| Error Handling & Logging | 3/10 |
| Security | 2/10 |
| Testability | 1/10 |
| Async/Await | 1/10 |
| Configuration | 4/10 |
| API/Route Design | 3/10 |
| Documentation | 3/10 |
| Dependencies | 3/10 |
| Modernization Readiness | 2/10 |
| Overall | 2.3/10 |
190K LOC .NET 4.8 legacy web app
| Category | Score |
|---|---|
| Architecture & SOLID | 2/10 |
| Separation of Concerns | 2/10 |
| Error Handling | 2/10 |
| Security | 3/10 |
| Testability | 1/10 |
| Async / Performance | 3/10 |
| Dependency Management | 2/10 |
| Code Maintainability | 3/10 |
| Overall | 2.3/10 |
95K LOC .NET 10 greenfield web app
| Category | Score |
|---|---|
| Architecture & Design | 9.5/10 |
| Testing | 9.0/10 |
| Security | 8.5/10 |
| Code Quality | 10.0/10 |
| Performance | 9.0/10 |
| Dependencies | 9.5/10 |
| CI/CD | 9.0/10 |
| Documentation | 8.5/10 |
| Frontend | 8.0/10 |
| Overall | 9.1/10 |
250K LOC .NET 10 greenfield library
| Category | Score |
|---|---|
| Architecture & Design | 9.5/10 |
| Code Quality | 9.0/10 |
| Testing | 8.5/10 |
| Security | 10.0/10 |
| Documentation | 8.5/10 |
| CI/CD | 9.5/10 |
| Dependency Management | 10.0/10 |
| Error Handling | 9.0/10 |
| Performance | 9.0/10 |
| Maintainability | 9.0/10 |
| Overall | 9.2/10 |