LN: CodeRabbit (2025) — State of AI vs Human Code Generation Report

Bibliographic Reference

Citation: Loker, D. & CodeRabbit. (2025, December 17). State of AI vs Human Code Generation Report. CodeRabbit Blog. https://www.coderabbit.ai/blog/state-of-ai-vs-human-code-generation-report Author: David Loker | Read time: ~8 min | Type: Industry empirical report


Pass 1 — Bird’s Eye View (5 Cs)

CAssessment
CategoryEmpirical industry report / benchmarking study
ContextCodeRabbit (automated code review platform) analysed 470 open-source GitHub PRs (320 AI co-authored + 150 human-only) to quantify quality differences between AI and human code contributions
CorrectnessIndustry-authored; no formal peer review, but methodology is transparent and findings are internally consistent with 2025 postmortem literature. Authors acknowledge that AI-labelled PRs were identified via signal detection, not direct confirmation
Contributions(1) First large-scale quantitative comparison of AI vs human PR quality across 10 issue categories; (2) Establishes per-100-PR rate ratio methodology; (3) Identifies readability as the single largest AI-specific failure mode (3×); (4) Provides 7 actionable mitigation strategies
ClarityExcellent. Dense with numbers but well-organised. Findings are clearly separated from interpretation.

Relevance: ⭐⭐⭐⭐⭐

Directly relevant to PUMA: establishes empirical baselines for AI agent output quality that inform PUMA’s QA Agent design (BMAD), HITL thresholds, and the PUMA SmartPMO automated review pipeline.


Pass 2 — Content

Study Scope & Methodology

Dataset:

ParameterValue
Total PRs analysed470
AI co-authored PRs320
Human-only PRs150
Normalisation methodPer-100-PR statistical rate ratios
SourceOpen-source GitHub pull requests

Methodological Caveat

AI-authored PRs were identified via signal detection, not direct confirmation. The authors acknowledge they cannot guarantee all human-labelled PRs were exclusively human-authored. Results should be interpreted as “AI co-authored vs primarily human” rather than a strict binary.

Context Framing: A separate benchmark referenced in the report found that while PRs per author increased 20% year-over-year (driven by AI assistance), incidents per PR increased 23.5% simultaneously — establishing the motivation for this study.


10 Key Findings (AI vs Human, per 100 PRs)

Finding 1 — Overall Issue Volume

MetricAI PRsHuman PRsRatio
Issues per PR10.836.45~1.7×

High-issue outliers are more prevalent in AI PRs, creating elevated review workloads. The distribution is skewed rather than uniform, meaning some AI PRs account for disproportionate review burden.


Finding 2 — Severity Escalation

  • AI PRs show 1.4–1.7× more critical and major findings compared to human PRs
  • The severity escalation means AI issues are not just more numerous but more consequential

Finding 3 — Logic & Correctness Issues

  • 75% more common in AI-generated code
  • Categories:
    • Business logic mistakes
    • Incorrect dependencies
    • Flawed control flow
    • Misconfigurations
  • Flagged as the most expensive category to fix — most likely to cause downstream incidents and production failures

High-Risk Category

Logic and correctness errors are expensive because they pass surface-level checks (compilation, unit tests) but manifest as subtle functional failures in production.


Finding 4 — Readability Issues

  • 3× spike in AI contributions — the single largest difference across the entire dataset
  • Characteristics:
    • Code looks consistent but violates local naming patterns
    • Clarity conventions drift toward generic training-data defaults
    • Structural norms (file organisation, section grouping) are inconsistent with repo idioms

Key Insight

Readability is not about formatting (which is separately tracked). It is about cognitive load — AI code requires more reviewer effort to understand even when syntactically valid.


Finding 5 — Error Handling Gaps

  • Nearly 2× more common in AI PRs
  • Specific deficiencies:
    • Missing null checks
    • Absent early returns
    • Inadequate guardrails
    • Incomplete exception handling logic
  • Directly correlated with real-world outage causes — postmortems from 2025 incidents frequently cite missing error paths

Finding 6 — Security Vulnerabilities

  • Up to 2.74× higher in AI-generated code
  • Primary pattern: improper password handling and insecure object references
  • No vulnerability type was unique to AI — all categories showed amplification rather than novel attack surfaces
  • Models recreate legacy or outdated security patterns from training data when not given explicit security guidance

Finding 7 — Performance Regressions

  • Small in absolute count but heavily skewed toward AI
  • Excessive I/O operations: ~8× more common in AI PRs
  • Root cause: AI models prefer simple, readable patterns over resource-efficient ones
    • Simple loops instead of batched operations
    • Repeated I/O calls instead of cached reads
    • Unoptimised data structures (list scans instead of hash lookups)

Finding 8 — Concurrency & Dependency Issues

  • ~2× more frequent in AI PRs
  • Problems:
    • Incorrect ordering of dependent operations
    • Faulty dependency flow
    • Concurrency primitive misuse (e.g., mutex acquisition order, race conditions)

Finding 9 — Formatting Problems

  • 2.66× more common in AI-generated code
  • Occurs despite teams using formatters and linters — issues survive automated tooling
  • Types:
    • Spacing inconsistencies
    • Indentation errors
    • Structural drift
    • Style violations specific to repo conventions

Finding 10 — Naming Inconsistencies

  • Nearly 2× more frequent in AI contributions
  • Characteristics:
    • Unclear naming (generic variable names from training patterns)
    • Mismatched terminology (domain-specific terms replaced with generic equivalents)
    • Identifiers that don’t match existing codebase vocabulary
  • Impact: increased cognitive load, making code review slower and less reliable

Summary Table — AI vs Human Issue Rates

CategoryAI vs Human RatioSeverity
Overall issue volume~1.7×Medium
Critical/major findings1.4–1.7×High
Logic & correctness~1.75× (75% more)Critical
ReadabilityHigh
Error handling~2×Critical
Security vulnerabilitiesup to 2.74×Critical
Excessive I/O operations~8×Medium
Concurrency/dependency~2×High
Formatting2.66×Low
Naming inconsistencies~2×Medium

Root Cause Analysis

The report attributes AI code quality issues to five structural causes:

Root CauseMechanismAffected Categories
Lack of local business contextModels operate statistically, miss system-specific rules that senior engineers internaliseLogic, naming, readability
Surface-level correctnessCode appears functional but skips control-flow protections and misuses dependency orderingError handling, concurrency
Weak repository idiom adherenceNaming patterns, architectural norms, and formatting drift toward training data defaultsNaming, formatting, readability
Degraded security practicesWithout explicit guidance, models reproduce legacy/outdated security patternsSecurity vulnerabilities
Efficiency trade-offsModels default to simple, readable patterns over resource-efficient implementationsPerformance, I/O

Strategy 1 — Enhanced Context

  • Provide prompt snippets with repository-specific conventions
  • Include instruction capsules with business logic constraints
  • Attach configuration schemas to reduce misconfigurations and logic drift
  • PUMA mapping: CLAUDE.md, puma-core skill, context engineering for agent tasks

Strategy 2 — Policy-as-Code

  • CI-enforced formatters and linters
  • Style guides as machine-readable constraints
  • Eliminate entire categories of AI-driven issues before code review
  • PUMA mapping: SDD Constitution §§, automated checks in experiment pipeline

Strategy 3 — Correctness Safety Rails

  • Mandate tests for all non-trivial control flow
  • Require nullability and type assertions
  • Standardise exception-handling rules across the codebase
  • Explicitly prompt for guardrails in AI coding prompts
  • PUMA mapping: PUMA Constitution §§ quality requirements, BMAD QA Agent audit

Strategy 4 — Security Hardening

  • Centralise credential handling via approved patterns
  • Block ad-hoc password usage with automated detection
  • Automate SAST (Static Application Security Testing) and security linting
  • PUMA mapping: HITL escalation for security-sensitive code paths

Strategy 5 — Performance Optimisation

  • Provide guidelines for batching I/O operations
  • Document appropriate data structure choices per use case
  • Include performance hints in system prompts for compute-sensitive tasks
  • PUMA mapping: CodeCarbon integration, PN-ComputationalSustainability

Strategy 6 — AI-Aware PR Checklists

Reviewers should specifically verify:

  • Error path coverage (all failure modes handled)
  • Concurrency primitive correctness (mutex/async ordering)
  • Configuration value validation (no hardcoded magic values)
  • Approved credential handling (no ad-hoc password patterns)
  • PUMA mapping: BMAD QA Agent prompt checklist; Marco Veritas compliance review

Strategy 7 — Automated Code Review

  • Deploy AI code review tools (e.g., CodeRabbit itself) to standardise quality across different AI coding tools
  • Reduces reviewer fatigue from elevated AI PR review burden
  • Creates a feedback loop: AI reviews AI, with human oversight on flagged items
  • PUMA mapping: SmartPMO automated monitoring pipeline (Stage 5)

Core Conclusion

“AI accelerates output, but it also amplifies certain categories of mistakes.”

The report establishes three empirical properties of AI-generated code:

  1. Consistently more variable in quality
  2. More error-prone across all 10 measured categories
  3. More likely to introduce high-severity issues without proper safeguards

Report's closing statement

“Quality isn’t automatic. It requires deliberate engineering. Even when using AI tools.”


PUMA Integration

Evidence Base for PUMA Quality Design

The CodeRabbit report provides the empirical foundation for several PUMA architectural decisions:

CodeRabbit FindingPUMA Application
1.7× issue volume in AI PRsJustifies HITL threshold: AI outputs require systematic human review, not spot-checking
2.74× security vulnerabilitiesSupports BMAD QA Agent as a non-optional pipeline step
75% more logic errorsMandates Correctness Safety Rails in PUMA experiment protocol
20% more PRs → 23.5% more incidentsQuantifies the cost of AI acceleration without quality guardrails
~8× I/O performance regressionsInforms PUMA sustainability protocol (CodeCarbon + efficiency prompting)
3× readability issuesHighlights need for author’s-voice rewriting (Marco Veritas §3)

BMAD QA Agent — Informed Checklist

The BMAD QA Agent prompt should incorporate the AI-Aware PR Checklist (Strategy 6) as a structured audit:

## AI Code Quality Audit (CodeRabbit 2025)
- [ ] Logic & control flow: all branching paths verified
- [ ] Error handling: null checks, early returns, exception handlers present
- [ ] Security: no ad-hoc credential handling; SAST clean
- [ ] Performance: no unnecessary I/O loops; appropriate data structures
- [ ] Naming: matches repository vocabulary, not generic AI defaults
- [ ] Formatting: passes repo-specific linter, not just universal formatter
- [ ] Concurrency: dependency ordering verified if async operations present

SmartPMO Stage 5 — Automated Review Pipeline

The 1.7× issue escalation finding directly motivates SmartPMO’s code quality monitoring:

  • Ingest: Each AI-authored change triggers quality metric collection (issues per PR rate)
  • Query: PM asks “What is the AI vs human issue ratio for this sprint?” → wiki answers from accumulated data
  • Alert: If AI issue rate exceeds 2× human baseline, auto-escalate to human review

Thesis Relevance (PUMA Chapter 1)

The finding that PRs per author increased 20% while incidents per PR increased 23.5% is a precise quantification of the AI quality-speed tension — directly supporting PUMA’s thesis argument that:

“AI acceleration of PM tasks requires quality safeguards to avoid translating velocity gains into incident amplification.”


MOCs