A Codebase Health Scorecard: Metrics That Tell You When to Refactor vs Rewrite

#codebase health metrics
Sandor Farkas - Founder & Lead Developer at Wolf-Tech

Sandor Farkas

Founder & Lead Developer

Expert in software development and legacy code optimization

Stop Arguing on Instinct

Every few months, the same meeting happens somewhere in a software company. An engineer complains the codebase is impossible to work with. A manager asks if they should rewrite it. Someone senior says rewrites always fail. The argument runs in circles for an hour and ends where it started - with a vague agreement to "pay down some technical debt" that nobody acts on.

The core problem is not that the codebase is bad. The core problem is that nobody has measured it. The decision to refactor or rewrite is one of the most expensive a development team makes, and in most organisations it gets made without a single piece of quantitative data.

A codebase health scorecard changes that. It replaces opinion with measurement and gut feel with a defensible number. This post walks through the metrics that go into a practical scorecard, how to collect them without weeks of work, and what the scores actually tell you.

What a Codebase Health Scorecard Is Not

Before getting into the metrics, a clarification: a health scorecard is not a test coverage report, a static analysis summary, or a SonarQube dashboard. Those are inputs. The scorecard synthesises those inputs into a structured assessment across multiple dimensions, weighted by their impact on the actual decision you are trying to make.

The goal is a single page that a CTO or technical lead can walk into a stakeholder meeting with and say: here is where we are, here is how we got here, and here is what the data says we should do next.

The Five Dimensions of Codebase Health

1. Complexity and Cognitive Load

The most direct measure of how hard a codebase is to work with is cyclomatic complexity - the number of independent paths through a function. A function with complexity above 10 is hard to reason about. Above 20, it is nearly impossible to modify safely. A codebase where the average complexity across all functions sits above 8 is signalling real structural problems.

Cognitive complexity, a refinement used by tools like SonarQube, weights nested structures more heavily than simple branches. It is a better proxy for developer frustration than raw cyclomatic complexity, because deeply nested code is harder to hold in your head even when the raw branch count is moderate.

How to measure: run phploc for PHP, radon for Python, or the built-in complexity analysis in SonarQube. For JavaScript and TypeScript, eslint with the complexity rule or code-complexity npm packages give you function-level data quickly.

Scorecard bands for average complexity across the codebase:

  • Below 5: healthy
  • 5 to 8: manageable, monitor the trend
  • 8 to 12: high - refactoring priority
  • Above 12: rewrite signal

2. Test Coverage and Test Quality

Raw coverage percentage is the most commonly cited metric and also the most misunderstood. A codebase at 85% coverage can still be untestable if that coverage consists of tests that assert trivial behaviour while leaving business logic completely unchecked.

What matters more than the percentage is coverage of the critical paths - the code that processes money, handles authentication, performs data mutations, or drives the core business logic. You can have 40% overall coverage and a healthy codebase if every critical path is covered. You can have 90% overall coverage and a dangerously brittle codebase if the tests are shallow.

Secondary signals that matter alongside coverage:

  • Mutation score: what percentage of deliberate code mutations cause at least one test to fail? A mutation score below 50% suggests your tests are not actually verifying logic.
  • Test-to-code ratio: if your test suite is smaller than your production code, that is usually a warning sign for a mature codebase.
  • Flaky tests: a test suite with more than 2-3% flaky tests is actively eroding developer trust and masking real failures.

For a codebase with near-zero test coverage, the question to ask is whether the code is even testable. Highly coupled, dependency-injected-nowhere, database-calls-inside-business-logic code cannot be tested without a rewrite of the architecture - not just a test-writing sprint.

Scorecard bands:

  • Critical path coverage above 80%, mutation score above 60%: healthy
  • Critical path coverage 50 to 80%, mutation score 40 to 60%: manageable
  • Critical path coverage below 50% or mutation score below 40%: high risk
  • Near-zero coverage on code that cannot be unit tested without structural changes: rewrite signal

3. Dependency Health

Dependencies are the slow poison in most long-lived codebases. A package that was current in 2018 is not just outdated - it may contain known security vulnerabilities, may be incompatible with current versions of the framework, and may be abandoned with no maintainer to patch future issues.

Metrics to collect:

  • Outdated direct dependencies: how many of your direct dependencies are more than one major version behind their current release?
  • Known CVEs: run npm audit, composer audit, or safety check (Python) to get a count of known vulnerabilities, broken down by severity.
  • Abandoned packages: for each dependency, check if the upstream repository has had a commit in the last 24 months. A widely-used package with no recent activity is a risk.
  • Dependency depth: deeply nested transitive dependencies are hard to audit and often impossible to update independently. A node_modules tree with 1,400 packages for a medium-sized application is a maintenance liability.

Dependency health is especially useful as a leading indicator. A codebase with healthy application logic but 40 outdated direct dependencies and three critical CVEs is heading toward a crisis even if it is currently functional.

Scorecard bands:

  • No critical CVEs, fewer than 20% of direct dependencies more than one major version behind: healthy
  • One or two critical CVEs with planned remediation, 20 to 40% of dependencies outdated: manageable
  • Multiple critical CVEs, more than 40% of dependencies outdated, or key dependencies abandoned: high risk

4. Change Failure Rate and Deployment Pain

The health of a codebase shows up directly in deployment behaviour. A codebase that is well-structured, well-tested, and has clear separation of concerns will have a low change failure rate - the percentage of deployments that cause an incident or require a rollback. A codebase that is tightly coupled, opaquely structured, and poorly tested will fail regularly.

DORA metrics give you the quantitative framework here:

  • Change failure rate: percentage of deployments that result in a production incident. Industry benchmark for high-performing teams is below 5%.
  • Mean time to recovery: how long between a production failure and its resolution. Above two hours is a signal that the codebase is hard to debug and isolate.
  • Deployment frequency: high-performing teams deploy multiple times per day. If your team deploys once a fortnight because deployments are risky, that is a codebase health problem.

The most direct way to measure this is from your incident log and deployment pipeline. If that data is not being collected systematically, its absence is itself a signal.

Scorecard bands:

  • Change failure rate below 5%, MTTR under one hour: healthy
  • Change failure rate 5 to 15%, MTTR one to four hours: manageable
  • Change failure rate above 15%, or MTTR regularly above four hours: high risk

5. Architectural Coupling

Coupling is the hardest metric to measure automatically but often the most informative. A codebase where every component knows about every other component cannot be changed safely - modifying one piece causes ripples across the system that are impossible to predict or test in advance.

Useful proxies for coupling:

  • God classes: classes or modules with more than 500 lines, more than 20 public methods, or dependencies on more than 10 other classes. Run a simple line-count scan and flag anything over 400 lines for closer inspection.
  • Circular dependencies: madge for JavaScript, deptrac for PHP, and similar tools will graph your dependency tree and flag circular imports. Even a handful of circular dependencies indicate structural problems.
  • Feature envy: code in one module that primarily manipulates data from another module is a signal that the module boundaries are wrong.
  • Cross-cutting database access: if your controllers, services, and repositories all have direct database calls, the data layer is not isolated and cannot be replaced or tested independently.

For architecturally coupled codebases, the key question is whether the coupling is incidental - poor discipline that can be fixed with refactoring - or structural, baked into the fundamental design choices of the system. Structural coupling usually cannot be resolved without a partial or full rewrite of the affected subsystem.

Calculating the Score

Assign each dimension a score from 1 to 4 based on the bands above:

  • 4: healthy
  • 3: manageable
  • 2: high risk
  • 1: rewrite signal

Weight the dimensions by their relevance to your specific situation. For a product under active development, architectural coupling and change failure rate deserve higher weight. For a system that primarily serves as a data store, dependency health and test coverage of data mutation paths matter more.

A total weighted score above 14 out of 20 suggests the codebase is fundamentally sound and targeted refactoring will be productive. Between 8 and 14, you are looking at a structured remediation plan - specific subsystems or patterns that need systematic attention. Below 8, the data is pointing toward a partial or full rewrite of the affected areas.

The score is a starting point for a conversation, not a verdict. Two codebases can share the same score while requiring very different responses depending on team capacity, business context, and the cost of change.

Common Patterns the Scorecard Surfaces

Running this scorecard across different codebases, a few patterns come up repeatedly.

The brittle high-coverage codebase: test coverage looks fine, but mutation score is low and change failure rate is high. The tests are testing implementation details rather than behaviour. Refactoring is often the right answer, but it starts with improving the tests before touching the application code.

The abandoned dependency trap: application logic is reasonable, but the dependency tree is years out of date and full of known CVEs. This is a highly tractable problem that looks worse than it is. A dependency audit and upgrade sprint, combined with good test coverage, usually resolves it without any architectural changes.

The tangled monolith: complexity is high, coupling is severe, test coverage is low, and the codebase has no clear module boundaries. This is the genuinely hard case. Targeted refactoring often has diminishing returns here because each change surface is so large. Phased extraction of well-bounded domains - starting with the ones that change most frequently - is usually more effective than either a full rewrite or unfocused refactoring.

Turning the Scorecard into a Conversation

The scorecard's real value is not the number - it is making the conversation with non-technical stakeholders precise. Instead of "the code is hard to work with," you can say: our change failure rate is 18%, our critical-path test coverage is 34%, and we have three critical CVEs that have been unresolved for four months. That is a different conversation.

If you are assessing a codebase for code quality consulting purposes, as a pre-acquisition technical due diligence review, or as part of a legacy code modernisation programme, running this scorecard before scoping the work gives you a baseline to measure progress against and a clear story about why certain changes are being prioritised.

The output should be a one-page summary: dimension scores, the two or three highest-risk findings, and a recommended response - refactor, partial rewrite of specific subsystems, or full rewrite. That one page makes the decision defensible and turns an emotional debate into an engineering problem.

Getting Started

If you have not measured your codebase systematically before, the fastest starting point is:

  1. Run a complexity scan with a tool appropriate to your language stack. Flag the top 20 most complex files.
  2. Pull your last 90 days of deployment data and count failures and rollbacks.
  3. Run a dependency audit and get a CVE count.

Those three steps take a few hours and will tell you more about the actual state of your codebase than most half-day workshops.

If you would like a structured assessment with benchmarks against comparable systems, get in touch at hello@wolf-tech.io or visit wolf-tech.io - we have run this kind of review dozens of times and can usually turn around a written scorecard and recommendation within a week.


A codebase health scorecard does not make the refactor-versus-rewrite decision for you. But it gives you the evidence to make that decision well, defend it to stakeholders, and track whether the work you do next is actually moving the needle.