AI-Generated Tests That Hide Bugs: What We Found Auditing Vibe-Coded Codebases

#AI generated tests
Sandor Farkas - Founder & Lead Developer at Wolf-Tech

Sandor Farkas

Founder & Lead Developer

Expert in software development and legacy code optimization

A green test suite is supposed to mean something. When you open a vibe-coded codebase for the first time and see four hundred passing tests, your first instinct is relief. Look closer, and that relief evaporates. AI generated tests have a distinct failure signature — they are syntactically plausible, structurally familiar, and functionally hollow. They pass because the tooling records a test run, not because anything real was verified.

We see this pattern in nearly every codebase audit we run on projects that started with a language model at the keyboard. This post catalogs what we actually find — not theoretical concerns, but recurring patterns from real codebases — and explains why these failures are harder to spot than classic test anti-patterns.

What Makes AI Generated Tests Different From Ordinary Bad Tests

Before cataloging the failure modes, it is worth naming what makes this problem unusual. Bad tests written by humans tend to be bad in legible ways: a missing assertion, a test that was never hooked up to CI, an integration test that somebody disabled after it became flaky. Bad tests written by language models are bad in invisible ways, because the model's goal is to produce output that looks correct to another model, to a linter, and to a reviewer who is moving quickly.

A human developer who skips a real assertion usually does so because of time pressure or inattention. A language model skips real assertions because asserting nothing is the path of least resistance when the model does not know the expected output of the function it is testing. The test still looks like a test. It imports the class, instantiates the object, calls the method, and writes an expect or assert line at the end. The assertion just happens to assert something that is always true.

This is the core of the problem: AI generated test suites optimize for surface-level plausibility rather than actual verification. You get tests that would pass a PR review at a glance but provide no regression protection.

Failure Mode 1: Assertions Against Mocked Return Values

The most common pattern we find is a test that mocks a dependency, configures the mock to return a specific value, calls the code under test, and then asserts that the return value matches what the mock was configured to return. The assertion is circular — it can only fail if the mock itself is broken, which is not possible because the mock is controlled by the test.

Here is a simplified PHP example in the style we encounter:

public function testCalculateOrderTotal(): void
{
    $pricingService = $this->createMock(PricingService::class);
    $pricingService->method('getUnitPrice')->willReturn(49.99);

    $calculator = new OrderCalculator($pricingService);
    $result = $calculator->calculateTotal(quantity: 2);

    $this->assertEquals(99.98, $result);
}

This test looks reasonable. It is mocking an external service, which is good practice. But consider what it is actually verifying: if someone changes OrderCalculator::calculateTotal to multiply by three instead of two, or to add a flat fee, this test catches nothing — as long as the method still calls getUnitPrice and multiplies by something. The assertion value was hardcoded by the model to match the expected output of the correct implementation, not derived from the logic of what the calculator should do.

The real test would set up data, exercise the calculator through multiple quantity values, and assert the mathematical relationship — not a single hardcoded product of two hardcoded numbers.

Failure Mode 2: Tests That Never Execute the Relevant Code Path

Language models write tests for the function they were asked to test. They do not necessarily write tests that exercise the interesting paths through that function. What we commonly see is a test that calls the happy path once, produces a passing assertion, and leaves every conditional branch untested. Coverage tools report the file as covered. The branches that contain bugs — typically the error-handling paths, the edge cases, and the scenarios where an external call fails — are never executed.

In a typical audit, a project might show 78% line coverage that reduces to under 30% branch coverage on the same codebase. The difference is almost entirely composed of AI-generated tests that hit the opening lines of each function but return before reaching any of the conditional logic.

This matters in production because the branches that go untested are disproportionately the failure paths. A bug in the happy path usually surfaces during development. A bug in the error-handling path surfaces when a real user encounters an edge case at 2am on a Sunday.

Failure Mode 3: Implementation Mirroring

A subtler failure mode is what we call implementation mirroring: tests that reproduce the implementation logic rather than specifying the expected behavior from the outside. The test and the code under test are logically identical, so they share all the same bugs.

In a JavaScript example:

it('should format currency', () => {
  const amount = 1234.5;
  const result = formatCurrency(amount);
  expect(result).toBe(`$${amount.toFixed(2)}`);
});

This assertion is fine as long as the implementation is correct — but the string template in the assertion is essentially the same logic as the function being tested. If the function has a bug (perhaps it rounds incorrectly in some locales, or drops the dollar sign for negative values), this test misses it because it makes the same assumptions as the buggy implementation.

A real behavioral specification would describe what the output should be for specific inputs, derived from product requirements — not derived by running the function mentally and writing down the output.

Failure Mode 4: Test Isolation That Breaks in Integration

Language models default to mocking everything. This produces tests that are fast, deterministic, and meaningless. When we run the same codebase against a real database or a real HTTP client, a meaningful percentage of those tests assumptions turn out to be false.

We consistently find three categories of integration failures hidden by over-mocked unit tests. SQL queries that work against the mock but fail against the real schema because the mock did not model constraints or column types accurately. HTTP clients that return mock payloads shaped by the model's assumptions about an API rather than the API's actual response format. Service layer code that assumes a specific transaction boundary that the mock ignores.

The irony is that the mocking is often excessive precisely because the model wanted to produce "well-isolated" unit tests. A code quality audit of a vibe-coded codebase will almost always recommend replacing a significant portion of the mock-heavy unit suite with a smaller number of integration tests that actually validate the contracts between layers.

Failure Mode 5: Test Names That Describe Code, Not Behavior

This is the least damaging failure mode individually but the most revealing signal of AI test generation at scale: test names like testCalculateOrderTotal, should_return_value, or it handles the case. These are names that describe a call to a function rather than a behavior that a user or system depends on.

Good test names are specifications: invoice_total_includes_tax_when_customer_is_vat_registered, payment_fails_with_insufficient_funds_error, concurrent_updates_do_not_produce_negative_inventory. When a test with one of these names fails, you know immediately what broke and why it matters. When testCalculateOrderTotal fails, you have to read the test to understand what it was even checking.

Test names are free documentation. AI-generated test suites systematically under-use them.

Why Coverage Numbers Are Not the Answer

The natural response to all of this is to raise the coverage threshold. If tests are shallow, require more coverage. This is the wrong instinct. Coverage measures what lines of code were visited during a test run — it says nothing about what was verified. A test suite that visits every line by calling every function with a single input and making no assertions can achieve 100% coverage.

The useful metrics are branch coverage and mutation score. Branch coverage requires that every conditional in the codebase has been evaluated in both directions. Mutation testing runs hundreds of small automated changes to the source code — flipping comparisons, removing return statements, swapping operators — and checks whether your test suite detects each change. If a mutation goes undetected, you have a test gap. These tools are slower and noisier than line coverage, but they measure something real.

Before trusting an AI-generated test suite, run it through a mutation framework. For PHP, Infection is the standard tool. For JavaScript, Stryker. The mutation score on a vibe-coded project typically lands between 30–50% on the first pass. A production-grade suite should be above 70%.

What a Trustworthy Test Strategy Looks Like

After auditing the existing tests, a rescue engagement typically rebuilds the test strategy around three principles.

The first is behavior over implementation. Tests should be written against the public interface of a component, specifying what the component does for the system and its users, not how it does it. The implementation can be refactored without touching a test suite built on behavior; a test suite built on implementation has to be rewritten every time the internals change.

The second is proportional integration testing. Not everything needs to be an isolated unit test. For database queries, for third-party API calls, for file I/O: integration tests that hit the real system in a controlled environment give far better regression protection per line of test code than unit tests with complex mock setups. A legacy code modernization engagement almost always includes replacing fragile mock towers with lean, database-backed integration tests that validate the contracts between layers.

The third is explicit failure specification. Every test for error-handling code should verify that the right error is produced for the right input — not just that the function returns something without throwing. The failure paths are where real bugs live.

Getting an Honest Assessment

If your project has a CI pipeline that passes and a test file count that looks healthy, but was built with significant AI assistance, the test suite almost certainly contains a meaningful proportion of the failure modes described here. That is not a reason for alarm — it is a reason for a structured review before you hit a scaling milestone, enterprise procurement, or investor due diligence that involves a technical review.

The fastest way to get an honest read is to run mutation testing on a representative module and look at the score. If it is below 50%, you have a coverage theater problem. A code quality consulting engagement can scope what a realistic fix looks like, estimate the remediation effort, and prioritize the modules where test gaps carry the most production risk.

If you want to talk through what you are seeing in your own codebase, reach out at hello@wolf-tech.io or visit wolf-tech.io to start a conversation. A 30-minute call is enough to tell you whether you have a problem worth solving now or one that can wait.