The Vibe Code Audit: A Structured Review Process for AI-Generated Codebases Before They Go to Production

#vibe code audit
Sandor Farkas - Founder & Lead Developer at Wolf-Tech

Sandor Farkas

Founder & Lead Developer

Expert in software development and legacy code optimization

You hired an AI pair programmer — Cursor, Claude Code, GitHub Copilot — and in two weeks you had a working SaaS prototype. Fast feature delivery, reasonable structure, tests that turn green. Then real users showed up.

A vibe code audit is how you find the problems before they find your users.

This is the structured review process Wolf-Tech uses when a founder or CTO brings us an AI-assisted codebase and asks: "Can we actually ship this?" The audit covers seven dimensions. Each one exposes a different category of failure that AI code generation produces at scale.

Why AI-generated code needs a dedicated audit framework

Standard code review checklists were written for human-authored code. They assume a developer made deliberate architectural choices and understood the consequences. AI assistants work differently: they optimise for plausible-looking code that satisfies the immediate prompt, not for long-term coherence.

The patterns that result are distinct enough to deserve their own checklist. You will rarely find a single catastrophic bug. Instead you will find dozens of small decisions — each technically defensible in isolation — that combine into a system that is hard to extend, fragile under load, and expensive to secure after the fact.

A vibe code audit is not a general code review. It is a targeted search for the specific failure modes that LLM-generated code produces most reliably.

Dimension 1: Architectural coherence

The first thing we examine is whether the codebase has a consistent internal logic or whether it looks like it was written by several different developers who never spoke to each other.

AI assistants have no memory across sessions. Ask Cursor to build a user authentication module on Monday, and it will make ten small decisions — where to store session state, how to name service classes, how to handle errors. Ask it to add a billing module on Friday, and it will make ten different small decisions. By the time you have a dozen features, the codebase contains three or four implicit architectural styles fighting each other.

What to look for:

  • Inconsistent module boundaries. Does the authentication code call the billing service directly? Does the notification module have database access? Modules that violate their own boundaries are a sign of prompt-by-prompt development with no overarching architecture.
  • Contradictory naming conventions. A codebase that mixes UserService, user_repository, UserManager, and UserHelper across equivalent abstractions was not designed — it was assembled.
  • Multiple error-handling strategies. Exceptions in some places, return codes in others, bare console.error calls in a third. AI will use whatever pattern appeared in the nearest context.

Remediation at this stage is architectural: you need to decide which pattern wins, document it, and systematically refactor toward it. Tools like PHPStan (for PHP), TypeScript strict mode, and ESLint enforce consistency going forward but cannot fix the accumulated divergence retroactively.

Dimension 2: Security surface

This is the dimension where vibe-coded projects carry the highest risk of immediate, serious harm.

LLMs generate code that works under the happy path. Security is almost always a constraint added by a separate prompt — which means it only gets applied where the developer explicitly asked for it. The gaps are predictable:

SQL injection remains the most common finding. AI assistants frequently use query builders or ORMs incorrectly, falling back to string interpolation when a query gets complex. Even one raw query with user-controlled input is enough.

CSRF protection is frequently absent on state-changing endpoints that handle custom headers (because the developer told the AI it was an API, and the AI assumed APIs do not need CSRF protection — which is only true for stateless, token-authenticated APIs, not session-based hybrid apps).

Hardcoded credentials and secrets appear regularly in AI-generated code because the fastest way to make a proof-of-concept work is to put the API key directly in the source. These frequently survive the transition from prototype to production unchanged.

Permission escalation is common in multi-tenant applications. AI assistants often generate resource lookup code that fetches by primary key without verifying that the authenticated user has access to that resource. The check gets added for the most obvious endpoints and forgotten on the rest.

Dependency vulnerabilities deserve their own section (Dimension 4), but insecure defaults in direct dependencies — outdated cryptographic algorithms, permissive CORS configuration, missing security headers — are security issues that arrive with the first npm install.

We run automated scanners (Semgrep, npm audit, Rector security rules) as a baseline, but the manual review of authentication flows and multi-tenant resource access is not replaceable. For clients considering Wolf-Tech's code quality consulting, security surface is always the first deliverable.

Dimension 3: Test coverage audit

Test suites in vibe-coded projects frequently pass at 80–90% coverage while providing minimal protection. The reason: AI assistants write tests that verify the implementation they just wrote, not the behaviour the system is supposed to have.

A test that calls calculateTotal() and asserts it equals the value returned by the same function called with the same inputs is not a test — it is a tautology. We see this pattern constantly. The coverage number is real; the safety net is not.

What a genuine test coverage audit examines:

  • Behavioural vs structural coverage. Does the test suite cover the business rules — the conditions under which a charge is refunded, the state transitions that must not happen, the invariants that the domain enforces? Or does it cover code paths?
  • Edge case coverage. Empty inputs, boundary values, concurrency scenarios. AI tends to test the happy path.
  • Mutation testing scores. Running a mutation testing tool (Infection for PHP, Stryker for JS/TS) reveals whether tests actually catch code changes. A suite that passes after a mutation tool changes 40% of conditional logic is not protecting you.
  • Integration vs unit mix. Vibe-coded projects often have many unit tests and no integration tests. A system can have perfect unit test coverage and completely broken inter-module behaviour.

The audit does not recommend rewriting the test suite. It identifies the specific behavioural gaps — the untested business rules and the integration boundaries with no coverage — and prioritises them by risk.

Dimension 4: Dependency hygiene

AI assistants resolve dependencies optimistically: whatever version they were trained on, installed with --save, and moved on. By the time you ship, the dependency landscape has changed.

The dependency audit covers:

  • Abandoned packages. Dependencies with no releases in 18+ months, deprecated by their maintainers, or superseded by a different package entirely. AI assistants trained on 2023–2024 data still recommend packages that have since been archived.
  • Duplicate dependencies. In JavaScript projects particularly, AI-generated codebases frequently install multiple packages that do the same thing — two HTTP clients, two date libraries, two validation schemas — because different AI sessions picked different solutions to the same problem.
  • Known vulnerabilities. npm audit and composer audit give you the baseline; the manual work is assessing whether your code actually exercises the vulnerable paths.
  • Licence compatibility. AI assistants do not check licences. A GPL-licensed package pulled into a proprietary SaaS can create legal exposure. This comes up more often than most founders expect.

Dependency hygiene is the quickest dimension to remediate and often produces immediate performance improvements simply from removing duplicate dependencies and replacing abandoned packages with maintained alternatives.

Dimension 5: TypeScript and PHPStan strictness gaps

AI assistants write type-suppression-heavy code when types get difficult. The tell-tale signs:

In TypeScript: as any, @ts-ignore, as unknown as SomeType, and broad catch (e: any) blocks. Each suppression is a place where the type system stops protecting you and a runtime error starts becoming possible.

In PHP: @phpstan-ignore, mixed return types, arrays used as untyped bags of data, and nullable returns on methods that are documented as always returning a value.

We run PHPStan at maximum strictness and TypeScript in strict mode and treat every suppression as a finding that needs a documented justification. Some suppressions are legitimate (third-party library types that are simply wrong); most are not. The ratio of suppressed-to-justified gives you a rough measure of how much the type system is actually protecting you.

This dimension often uncovers actual bugs — places where an AI assistant used a type suppression to silence a type error that was correctly identifying a runtime problem.

Dimension 6: Performance hotspots

Performance problems in vibe-coded applications fall into three categories, all predictable.

N+1 queries are the most common. AI assistants generate clean-looking ORM code that hides loops over database calls. A product listing that loads 50 items and makes 51 queries looks fine in development and crawls in production. We profile with query logging enabled and look for query counts that scale linearly with result set size.

Missing database indexes follow naturally from prompt-by-prompt development. The AI creates a table schema and moves on. When a query is later written against that table, the AI does not look back at the schema — it just writes the query. Indexes are rarely added unless explicitly requested.

Synchronous I/O in async paths is a PHP and Node.js problem. Blocking calls — synchronous file reads, synchronous HTTP requests, database calls inside event handlers — collapse throughput under load in ways that are not visible until traffic arrives.

We do not expect performance perfection at audit time. We look for the issues that will cause failures at modest scale (100–1,000 concurrent users), because those are the ones that hit without warning and require emergency remediation at the worst possible time.

If the codebase has patterns that require deeper architectural intervention — deep coupling, no query optimisation layer, monolithic data access — we flag it separately under Wolf-Tech's legacy code optimisation service, since the remediation path differs from a greenfield project.

Dimension 7: Operational readiness

The final dimension is the one founders most consistently miss: can you actually run this in production?

Logging. AI-generated code often has no structured logging at all, or inconsistent console.log calls scattered across the codebase with no log levels, no correlation IDs, and no way to trace a request through multiple services. When something breaks at 2am, you need logs that tell you what happened, not noise.

Error handling. Unhandled promise rejections, uncaught exceptions, swallowed errors that return HTTP 200 with an empty body. AI assistants frequently generate code where error states are technically handled (the error is caught) but not communicated (the user sees nothing, the developer hears nothing).

Graceful shutdown. Does the application handle SIGTERM correctly? Does it finish in-flight requests before closing? Does it flush message queues? Cloud deployments restart containers constantly; a process that cannot shut down cleanly loses requests on every deploy.

Configuration management. Are all environment-specific values in environment variables? Is there a startup check that fails loudly if required configuration is missing, rather than running half-configured and producing mysterious behaviour?

Health checks. Does the application expose /health or /ready endpoints that return meaningful status? Does the health check actually test the database connection, or just return HTTP 200?

Operational readiness is the dimension that determines whether you can sleep during your first week of real traffic. It is also the dimension most AI assistants simply skip — they were never asked.

The audit questionnaire

Before starting a vibe code audit engagement, Wolf-Tech sends clients a short questionnaire. The answers shape where we spend time:

  1. How was the codebase generated? (Single AI assistant, multiple sessions, human + AI hybrid?)
  2. Has it been deployed to production? If yes, what breakage has already occurred?
  3. Does a test suite exist? What does the coverage report show?
  4. Are there any known performance problems or scaling concerns?
  5. What are the highest-risk features? (Payments, authentication, data export, multi-tenant data isolation?)
  6. Is there a security requirement from a customer, investor, or regulator driving this review?

The answers determine the audit sequence. A codebase being reviewed before Series A due diligence gets security and architecture first. A codebase that already has a performance incident gets performance hotspots as the first deliverable. A codebase that needs to pass an enterprise security questionnaire gets an output mapped directly to that questionnaire's categories.

Remediation priority framework

Not everything found in a vibe code audit gets fixed immediately. A pragmatic remediation sequence:

Immediate (before production): SQL injection and other injection vulnerabilities, hardcoded credentials, authentication bypass paths, missing tenant isolation checks.

Within two weeks: CSRF gaps, missing permission checks on secondary endpoints, abandoned dependencies with known CVEs, N+1 queries on high-traffic paths.

Within a sprint: Type suppressions hiding real bugs, missing indexes on query paths used by paying customers, error handling gaps on payment and authentication flows.

Planned technical debt: Architectural inconsistencies, test coverage gaps in non-critical areas, operational readiness improvements (structured logging, health checks, graceful shutdown).

The priority framework is not a licence to defer security issues. It is a tool for communicating to stakeholders that "we have findings" does not mean "we cannot ship." Most vibe-coded codebases that go through this audit are shippable with two to four weeks of targeted remediation on the immediate category.

When to commission a vibe code audit

The right time is before you have paying customers — before the cost of remediation includes reputational risk, SLA obligations, and the distraction of fighting fires in production.

The second-best time is when you have paying customers but before you onboard your first enterprise client, process your first significant payment volume, or apply for SOC 2 certification.

If you have a codebase that was substantially AI-generated and is approaching any of these milestones, a structured vibe code audit is the fastest way to understand your actual risk position.

To discuss what a review engagement looks like for your specific codebase, reach out at hello@wolf-tech.io or visit wolf-tech.io.