Load Testing Your SaaS Before It Breaks in Production
The classic SaaS outage happens on a Tuesday morning. The product lands on a popular newsletter, a customer launches a campaign to their whole list, or the sales team closes a deal that onboards five hundred seats at once. Requests queue up, the database starts spilling to disk, worker queues back up, and the dashboard that showed 99.9% uptime last month shows a red bar for the next ninety minutes. Nobody wrote bad code. The application just met a traffic shape it had never been measured against.
Load testing SaaS is the discipline of measuring that shape on purpose, in a controlled environment, before it happens in production. It is not a one-time checkbox for pre-launch. It is a recurring practice that answers three concrete questions: how much traffic can this system handle today, where does it break first, and does it degrade gracefully or catastrophically. Teams that treat load testing as ongoing engineering work—not a QA ritual—are the ones that scale without the Tuesday morning surprise.
This post covers how to design realistic load tests, the two tools that cover almost every need (k6 and Locust), how to interpret the results, and the bottleneck patterns we see repeatedly in PHP/Symfony and Next.js applications.
Why Most Load Tests Miss the Real Limits
A load test is only useful if it resembles the traffic that actually hits production. Three patterns sink most attempts.
The first is hitting a single endpoint in a tight loop. A hammering script pointed at GET /api/products will happily show ten thousand requests per second because a single query plan and a single response fit neatly in MySQL's buffer pool and PHP's opcode cache. Real traffic spreads across dozens of endpoints with different query costs, cache hit rates, and downstream dependencies. The cache-friendly endpoint tells you nothing about the authenticated search endpoint that joins six tables.
The second is using synthetic data that does not reflect production shape. A test database with one thousand users and ten thousand orders will not reveal the ORDER BY created_at DESC LIMIT 20 query that becomes a full table scan once the orders table crosses ten million rows. Load tests need to run against a dataset whose size and distribution matches production or exceeds it.
The third is ignoring think time and session structure. Real users authenticate, navigate, pause to read, submit a form, and log out. A load test that fires requests with no pauses measures your system under a kind of load that never occurs. Worse, it often masks bottlenecks that only appear when connection pools drain and sessions accumulate in Redis.
A realistic load test models user journeys, not endpoints. It uses production-representative data. It includes the authentication flow, the long-running report export, and the webhook that fires on every update. It distributes virtual users across the scenarios in roughly the same ratio your analytics show for real users.
Choosing a Tool: k6 vs. Locust
For most teams, the choice is between k6 and Locust. Both are open source, both produce credible results, and both scale to tens of thousands of virtual users on modest hardware.
k6 is written in Go, scripted in JavaScript, and designed around a streaming metrics pipeline. It has excellent CI/CD integration, produces structured output that ships directly to Prometheus or Grafana Cloud, and scales efficiently because each virtual user is a goroutine rather than a process. We default to k6 for teams that already use JavaScript heavily, want results in their existing observability stack, or need to run tests from CI pipelines as part of release gates.
Locust is written in Python, scripted in Python, and ships with a web UI that lets non-engineers trigger runs and watch live charts. It is the pragmatic choice when your QA team writes Python, when you want complex scenario logic (conditional branching, parsing responses to drive subsequent requests), or when the test framework needs to be approachable for people outside the platform team.
Both tools are capable of performance testing against REST APIs, GraphQL, WebSockets, and traditional HTML applications. The decision should be driven by the skillset of the people who will maintain the tests six months from now, not by theoretical throughput differences.
A minimal but useful k6 scenario looks like this:
import http from 'k6/http';
import { check, sleep, group } from 'k6';
export const options = {
stages: [
{ duration: '2m', target: 50 }, // ramp up
{ duration: '10m', target: 200 }, // steady load
{ duration: '2m', target: 500 }, // spike
{ duration: '5m', target: 0 }, // ramp down
],
thresholds: {
http_req_failed: ['rate<0.01'],
http_req_duration: ['p(95)<800', 'p(99)<2000'],
},
};
export default function () {
group('login and browse', () => {
const login = http.post(`${__ENV.BASE_URL}/api/login`, {
email: `user_${__VU}@loadtest.local`,
password: 'test-password',
});
check(login, { 'login 200': (r) => r.status === 200 });
const token = login.json('token');
const headers = { Authorization: `Bearer ${token}` };
http.get(`${__ENV.BASE_URL}/api/dashboard`, { headers });
sleep(3);
http.get(`${__ENV.BASE_URL}/api/orders?limit=20`, { headers });
sleep(5);
});
}
The shape matters as much as the number: ramp up, hold, spike, ramp down. The thresholds turn the test into a pass/fail check that CI can enforce. Without thresholds, the test is a graph; with thresholds, it is a gate.
Interpreting Results: Percentiles, Not Averages
Average response time is the wrong headline metric. A service that responds in 100ms for 99% of requests and 30 seconds for the other 1% has a 400ms average—which looks fine—while 1% of users experience catastrophic latency. The metrics that matter are p50 (median, the typical user experience), p95 (the slow user experience), p99 (the slowest real experience), and error rate.
A healthy load test result has a flat p95 line as virtual users ramp up, a small widening of the gap between p50 and p99 under peak load, and an error rate close to zero. A failing result shows one of three shapes. A hockey stick in p95 around a specific concurrency level points to a connection pool, worker limit, or thread pool exhaustion. A rising error rate with stable latency points to timeouts, upstream dependencies failing, or rate limits. A cliff—latency fine up to a threshold, then requests simply stop succeeding—usually points to the event loop, FPM worker pool, or queue workers saturating completely.
Run tests long enough to reveal garbage collection pauses, log rotation effects, and cache eviction patterns. A ten-minute test will miss many problems that only appear at thirty minutes.
The Bottleneck Patterns We See Repeatedly
Across code quality audits and performance reviews on PHP/Symfony and Next.js applications, the same bottleneck patterns appear with striking consistency. Knowing where to look first saves days of investigation.
The N+1 query problem in Doctrine ORM is the most frequent root cause of PHP backend slowdowns under load. A controller that renders a list of orders, each of which triggers a lazy load of the customer, each of which lazy-loads the billing address, produces hundreds of queries for a single page. At low traffic the pool absorbs it; at high traffic the connection pool drains and every request starts waiting for a free connection. The fix is explicit fetch='EAGER' joins or repository methods that return fully hydrated entities, and it is usually a two-line change.
PHP-FPM worker exhaustion is the second most common failure mode. A server configured with fifty FPM workers can only handle fifty concurrent requests. If one endpoint waits four seconds for an upstream service, fifty concurrent users of that endpoint consume every worker and every other request queues. The symptom is a sudden cliff in the load test around the worker count. The fix is a combination of increasing workers, moving slow calls to background jobs, and adding timeouts so a misbehaving upstream cannot eat the entire pool.
Missing or misconfigured caching is the third. OPcache should be enabled in production with generous memory and sufficient max_accelerated_files. Symfony's HTTP cache, Doctrine's second-level cache, and Redis for session and query caching should all be load-tested specifically—many teams turn them on and never verify hit rates under realistic load.
Database indexes that are perfect at one million rows are useless at one hundred million. The load test should run against a dataset at the size you expect in twelve to eighteen months, not today's size. Slow query logs captured during the load test point directly at the queries that will break first as data grows.
Background job queue saturation is a quieter failure mode but no less dangerous. If the web tier accepts orders faster than workers can process them, the queue grows unboundedly. Load tests that ignore the worker tier will declare victory even as the backlog silently climbs. Include queue depth in your load test dashboards.
Designing for Graceful Degradation
Every system has a breaking point. The engineering goal is not to eliminate it—which is impossible—but to ensure that hitting it does not cascade into a full outage.
Practical patterns that make the difference: circuit breakers on every upstream HTTP call so a slow dependency cannot consume all workers; request timeouts at every layer (HTTP client, database, cache); bulkheads that prevent one heavy endpoint from starving others, typically via separate FPM pools or separate services; rate limiting at the edge so a single client cannot push the system past its measured limits; and read-only fallbacks for dashboards and reports so read traffic survives when writes fail.
The load test is what validates these patterns. Kill a Redis instance mid-test and watch whether the application keeps serving or dies. Add five hundred milliseconds of latency to the payment provider and see if the checkout endpoint times out cleanly or exhausts workers. These are the questions that only a controlled test can answer honestly.
Load Testing as a Continuous Practice
The teams that handle scaling well run load tests on a schedule, not before launches. A weekly run against staging, with the same scenarios and thresholds, produces a trend line: performance is improving, stable, or regressing. When it regresses, the commit range is small and the cause is easy to find. When it only runs before a big launch, every regression is a surprise and every fix is done under pressure.
Integrating load tests into CI as a nightly job or a pre-release gate is a one-time investment with a permanent payoff. k6 Cloud, Grafana k6, and Locust with Prometheus export all support this pattern out of the box.
Teams planning for significant traffic growth, preparing for enterprise onboarding, or recovering from a recent incident often benefit from an external perspective on their performance engineering. If your team is approaching that kind of moment, we help mid-size European SaaS companies design load testing programs and remediate the bottlenecks they surface as part of broader custom software development and digital transformation engagements. Contact us at hello@wolf-tech.io or visit wolf-tech.io for a free consultation.

