May 2026

Stop Reading AI Code. Score It.

The question was never read or measure. It was human or agent. The reviewer does not have to be you.

Two posts on X kicked this off. First, Cory House shared a scorecard prompt he was running on his codebase. I built my own version that weekend: fully agent-driven, qualitative scoring against a fixed rubric, the same nine-dimension shape that produced the numbers in I Scored Four Codebases. The Humans Lost. Second, a clip of Uncle Bob saying he no longer reads code. He measures it.

Two ideas pulling in the same direction. One says: let an agent score your code against a rubric you trust. The other says: the signal you are hunting for is in the metrics, not the lines. The combination sent me looking at what code metrics could actually tell us about an AI-generated codebase, and whether they could be made deterministic enough to take the human reviewer out of the loop entirely.

Most of the conversation around both posts split predictably. The static-analysis camp claimed vindication. The craftsmanship camp claimed senility. Both missed it. The interesting question is not read or measure. It was never read or measure. It is human or agent. Once you accept that the reviewer does not have to be you, the rest of the argument falls out.

The Reframe

Reading code was never the goal. The goal was a confident answer to two questions: is this clean, and does it work. Reading was the cheapest tool we had to answer them. It is no longer the cheapest tool, and for AI-generated code it is no longer even a good tool. Volume defeats it. Attention does not scale. A senior engineer reviewing thirty AI-generated PRs a day is not doing review. They are skimming.

The replacement is not metrics. The replacement is an agent running an explicit rubric. Metrics are part of the rubric, not all of it. Architecture, security, testing strategy, error handling, documentation, dependency hygiene, and async patterns are still scored qualitatively against fixed anchors. The agent reads the code and applies the anchors. Reading still happens. It just does not have to be human reading.

That is the move. Human or agent. Once it is an agent, measurement becomes part of the rubric, not a substitute for the rubric.

The Rubric

Nine dimensions. Architecture and SOLID. Code Quality. Testing. Security. Error Handling. Documentation. Dependency Management. Performance and Async. Maintainability.

Two are deterministic today: Code Quality and Maintainability. They compute from a CSV of MSBuild-emitted metrics. The other seven are qualitative, scored against fixed 0-to-10 anchors (10 = best-in-class, 8 = strong, 6 = adequate, 4 = weak, 2 = poor, 0 = absent or harmful). Reading still happens. The agent reads source files to apply the anchors, names the offenders, and cites the evidence.

The roadmap is more dimensions deterministic over time. The next two pieces under active design are test-coverage metrics (counts and ratios for unit, integration, and architecture tests) and architecture-test enforcement. The qualitative slice shrinks. It does not disappear, but it shrinks.

The output is the part that turns the rubric into a conversation. Five sections: a scorecard table, deterministic detail with raw measurements, top offenders by metric (worst five per metric, archetype-tagged), top three issues with score-lift projections, and projected scores after addressing them. The score-lift projection is the interesting one. Refactor these three classes, and Code Quality goes from 6.3 to 7.3. That is no longer a vibe. That is a target with a delta and a list of files.

Last week I had Claude score four codebases against this rubric. All qualitative. This is the version where two of the nine compute themselves.

Three Signals, Not One

A single threshold misses different shapes of badness. Each metric is evaluated on three signals.

Population. The percentage of classes past a fixed threshold. Catches broadly degraded codebases.

Tail. The 90th-percentile complexity ratio (or 10th-percentile MI). Catches the shape of the worst slice, not just outliers.

Extreme. The count of classes past an extreme threshold, expressed as a rate. Catches spot-degraded codebases where most code is fine but a handful of classes are unsalvageable. This is the signal that distinguishes needs refactoring from needs replacement.

A codebase scoring 6/6/2 is broadly fine but has landmines. 2/6/6 is broadly degraded but salvageable. 2/2/2 is broken throughout. The three signals tell different stories that aggregate scores hide.

This move is what makes “score the code” defensible against Goodhart, before Goodhart even gets named. To game three signals at once you have to actually fix the code. One number is a target. Three numbers in tension are a description.

Archetype Awareness

A Validator at decomposition ratio 5 is not the same problem as an Entity at decomposition ratio 5. The skill knows. It tags every class with an archetype based on naming patterns (Repository, Service, Controller, Validator, Mapper, Entity, DTO, Config, DbContext, and so on), and verifies the ambiguous ones by reading the source file.

Headline scores are archetype-agnostic. The prose is not. “This controller has too much logic” hits differently than “this validator is doing imperative work and should be split.” The number is the same. The fix is not.

Archetypes are how the rubric earns its keep against the obvious “context-free metrics are dumb” critique. The metrics are not context-free. The agent that interprets them is not context-free either.

What This Misses

Four counters worth taking seriously.

1. Goodhart’s law. Metrics as targets become bad metrics. Three signals per metric, not one. Population, tail, extreme. To game all three you have to actually fix the code. And the seven qualitative dimensions have no numerical targets to game. They are scored against fixed anchors by an agent reading the source. There is no number to optimize toward. There is a description to live up to.

2. You cannot measure architecture or design intent. Today, in this rubric, that is true. Architecture is qualitative. The agent reads the source, applies the anchors, names the offenders, and cites the evidence. That is reading. Just not human reading.

The bigger answer is that this counter has a shorter shelf life than people assume. Architecture tests (NetArchTest in .NET, ArchUnit elsewhere) already enforce a meaningful chunk of what architecture means in practice. Layering. Dependency direction. Public-API contracts. Naming conventions. Milan Jovanovic’s piece on architecture tests is the best short read on this. These tests run in CI. They fail builds. They are not opinions, and they are not the agent’s read of the rubric. They are deterministic, just like cyclomatic complexity is deterministic.

What architecture tests do not capture is the genuinely subjective slice: is the abstraction right for the problem, is the layering pragmatic, does the design fit the team. That part stays qualitative, maybe forever. But the structural slice (the part reviewers used to write “this controller should not reference the repository directly” about in PRs) is moving into the deterministic column. It is on the roadmap for this rubric. Enforcing architectural rules by reading code is the next thing that disappears from human reviewers’ jobs.

3. Tests only prove what you thought to test. Novel bugs slip past. Two responses. First, test metrics themselves are next on the deterministic roadmap: coverage, integration coverage, and architecture-test counts. Second, the outcome check is the failsafe. The human still has to verify the software does what it is intended to do. That is where novel-bug risk gets caught, at the level of behavior, not lines.

4. Calibration is fragile. New codebases will break the thresholds. The thresholds were calibrated against four codebases. The point was never that they are final. The point is the ranking held. The system put the bad codebases at the bottom and the good ones at the top with a tighter spread than a human reviewer produces. New codebases extend the calibration set. They do not break the system.

A Note on Status

I built this over a weekend. The team is not running it in production. Two of nine dimensions are deterministic; seven are not. The roadmap is more dimensions deterministic over time, particularly testing metrics and architecture tests, and the framing of the rubric will keep evolving.

None of that changes the argument. The argument is structural. The reviewer does not have to be you. That does not depend on the package shipping. It depends on the rubric being scoreable by an agent, and four codebases say it is.

The Last Thing You Verify

There is one dimension that is not in the rubric: does the software do what it is intended to do? You can pass every test, hit every metric, and still fail if the software does not deliver value. Tests verify the rules you wrote down. They do not verify you wrote down the right rules. Metrics verify the shape of the code. They do not verify the code is solving the right problem.

That is the one verification step that does not move to the agent. The human role does not disappear. It moves up the stack. From did this PR introduce a complexity bomb to did this feature change the metric we actually cared about. From line-by-line review to outcome verification. From craftsmanship police to product judgment.

Quality is not a human trait. Reviewing AI-generated code is not a human job either. The job that is left is making sure the software does what it should. That job, for now, is still yours.

Appendix: The Scoring System

Skip this section if you trust the rubric and just want the argument. Read it if you want the receipts.

A. The Nine Dimensions

#	Dimension	Type	What it covers
1	Architecture and SOLID	Qualitative	Layering, boundaries, interfaces, dependency inversion, and single responsibility.
2	Code Quality	Deterministic	Concentrated complexity and single-method bombs.
3	Testing	Qualitative	Coverage, test quality, and integration coverage.
4	Security	Qualitative	Secrets, auth, input validation, and CVE exposure.
5	Error Handling	Qualitative	Exception strategy, logging, and observability.
6	Documentation	Qualitative	README, ADRs, runbooks, and doc hygiene.
7	Dependency Management	Qualitative	Currency, central management, and version consistency.
8	Performance and Async	Qualitative	Async usage, query efficiency, and caching.
9	Maintainability	Deterministic	Maintainability index distribution and bottom-tail health.

The overall score is the unweighted mean of the nine dimensions, reported to one decimal place.

B. Qualitative Scoring Anchors

Same anchors apply across every qualitative dimension and across the deterministic ones, which compute to the same 0-to-10 scale via threshold tables instead of judgment.

Score	Meaning
10	Best-in-class. Industry exemplar.
8	Strong. Minor gaps, no systemic issues.
6	Adequate. Inconsistent in places but functional.
4	Weak. Real problems that will compound under change.
2	Poor. Will block scaling, onboarding, or safe modification.
0	Absent or actively harmful.

C. The Three Deterministic Metrics

Three per-class metrics, derived from a CSV emitted by an MSBuild target.

Decomposition ratio (class_cc / member_count). Captures concentrated complexity. A class with cyclomatic complexity 100 spread across 50 members (ratio 2.0) is a healthy DTO or dispatcher. The same total spread across 3 members (ratio 33) is a god class waiting to happen. The ratio captures something raw totals cannot: whether complexity is decomposed or concentrated.

Max member cyclomatic complexity (max(member_cc)). Captures single-method complexity bombs. A 50-line method with CC of 30 is a problem regardless of whether the rest of the class is clean. This metric finds the worst method per class.

Maintainability Index (Microsoft’s MI, type-row). The composite formula combining cyclomatic complexity, Halstead volume, and lines of code into a 0-to-100 score where higher is healthier. Used to capture overall headroom.

D. The Three-Signal Model

Each metric is evaluated on three signals rather than a single number.

Population. How widespread is the issue? Percentage of classes past a fixed threshold.
Tail. How bad does typical bad code get? The p90 complexity ratio, or the p10 MI.
Extreme. How many outright disasters exist? Count of classes past an extreme threshold, expressed as a rate.

A codebase scoring 6/6/2 is broadly fine but has landmines. 2/6/6 is broadly degraded but salvageable. 2/2/2 is broken throughout. Three signals tell different stories that aggregate scores hide.

E. Threshold Tables (Full Receipts)

These are the load-bearing numbers that turn raw measurements into 0-to-10 scores.

Decomposition ratio

Score	% > 4 (population)	p90 (tail)	count > 15, % of n (extreme)
10	≤ 1%	≤ 2.0	≤ 0.1%
8	≤ 3%	≤ 2.5	≤ 0.5%
6	≤ 6%	≤ 3.5	≤ 1.0%
4	≤ 10%	≤ 5.0	≤ 2.0%
2	≤ 15%	≤ 7.0	≤ 4.0%
0	> 15%	> 7.0	> 4.0%

Max member cyclomatic complexity

Score	% > 15 (population)	p90 (tail)	count > 30, % of n (extreme)
10	≤ 0.5%	≤ 4	≤ 0.2%
8	≤ 2%	≤ 6	≤ 0.6%
6	≤ 4%	≤ 9	≤ 1.2%
4	≤ 7%	≤ 12	≤ 2.5%
2	≤ 10%	≤ 16	≤ 4.0%
0	> 10%	> 16	> 4.0%

Maintainability Index (higher is better, so the tail signal uses p10 instead of p90)

Score	% < 60 (population)	p10 (tail)	count < 40, % of n (extreme)
10	≤ 1%	≥ 75	≤ 0.2%
8	≤ 3%	≥ 70	≤ 0.5%
6	≤ 6%	≥ 65	≤ 1.2%
4	≤ 10%	≥ 58	≤ 2.5%
2	≤ 15%	≥ 52	≤ 4.0%
0	> 15%	< 52	> 4.0%

F. How Metrics Aggregate

Each metric’s score is the mean of its three signal scores, to one decimal.

Code Quality (Dimension 2) = mean of (decomposition score + max member CC score).
Maintainability (Dimension 9) = MI score (the only metric folded into this dimension).

The split reflects different concerns. Code Quality is about complexity shape. Maintainability is about change capacity.

G. Class Archetype Tagging

The skill tags every class with an archetype based on naming patterns and namespace. The headline scores are archetype-agnostic; the prose interpretation is archetype-aware. Why archetypes matter: a controller with decomposition 5 is concerning; an entity with the same ratio is alarming; a DTO with that ratio means somebody put logic in a data class.

Archetype	Naming patterns	Threshold expectation
Entity / Model	DAL classes not matching other patterns	Near-zero logic. Any single-method CC > 10 is severe.
DTO / Model	`Request`, `Response`, `Dto`, `Model`, `ViewModel`, and `Result`	Same as Entity. Should have no logic.
Config	`Options`, `Settings`, and `*Config` (non-DAL)	Same. Should have no logic.
Repository	`*Repository`	Max member CC up to 15 is normal (complex queries). Coupling naturally elevated.
Mapper	`Mapper`, `Mapping`	Inherently branchy. Max member CC up to 20 acceptable.
Validator	`*Validator`	Two shapes: declarative (FluentValidation chains) or imperative (reclassified to Service).
Service	`Service`, `Manager`, `Job`, `Worker`, `Strategy`, `Policy`, and `*Rule`	Decomp over 4 = fat methods. Max member CC over 10 = insufficient decomposition.
Controller	`Controller`, `Hub`	Should be thin. Max member CC over 8 concerning, decomp over 3 concerning.
Infrastructure	`Filter`, `Handler`, `Middleware`, `Attribute`, `Provider`, `Resolver`, `Decorator`, and `Interceptor`	Max member CC up to 12 acceptable.
Helper / Extension	`Extensions`, `Helper`	Variable. Large helpers should be split by concern.
Builder / Factory	`Builder`, `Factory`	Max member CC up to 10.
DbContext	`*Context` in DAL	Coupling is meaningless (touches every entity by design).
God / Legacy	Reclassified during verification	Auto-promoted to top-3 issue regardless of numerical rank.

Two-pass tagging: heuristic first (naming rules, first match wins), then verification by reading source files for classes whose ratios are anomalous for their tag.

H. Calibration Baseline

The threshold tables were calibrated against the same four codebases scored qualitatively in I Scored Four Codebases. The Humans Lost. Each was scored qualitatively first, then fit to the deterministic ratios. The published version uses the same anonymized descriptors as the prequel.

Codebase	Type	Qualitative score	Deterministic score
~250K LOC .NET 10 greenfield library	active	9.2	9.8
~95K LOC .NET 10 greenfield web app	active	9.1	6.3
~190K LOC .NET 4.8 legacy web app	legacy	2.3	4.0
~110K LOC .NET 4.8 legacy web app	legacy	2.3	1.8

The ranking holds across both systems. Two rows do interesting work in this table.

The 95K greenfield web app scored 9.1 qualitatively and 6.3 deterministically. A 2.8-point compression. The deterministic system is more honest about complexity shape than a human reviewer rounding upward on a codebase that “feels good.” The qualitative score answers should we keep this? The deterministic score answers how healthy is the shape of this code right now? They are different questions and they should not always agree.

The two legacy web apps tied at 2.3 qualitatively but separated by 2.2 points deterministically (4.0 vs 1.8). The deterministic system sees concentrated complexity that the qualitative score smoothed over. Same qualitative verdict, different remediation cost.

Both of those are arguments for the deterministic system. They are evidence that measurement catches what judgment-against-anchors compresses out.

I. Filters and Output

Before computation, the skill excludes test projects, Aspire orchestration (*.AppHost, *.ServiceDefaults), benchmarks and samples, generated code (Migrations, Razor view types, and anonymous types), composition roots (Program, Startup), and classes with zero members.

Output is five sections: scorecard table, deterministic detail with raw measurements, top offenders by metric (worst 5 per metric, archetype-tagged), top three issues with score-lift projections, and projected scores after addressing the named offenders.

The score-lift projection is the part that turns numbers into action: refactor CustomersController and ContractCreationService, and Code Quality goes from 6.3 to 7.3.

← All writing