Streaming LLM Responses in Next.js: SSE Patterns That Stay Stable Under Load

July 1, 2026#streaming LLM responses Next.js

Sandor Farkas

Founder & Lead Developer

Expert in software development and legacy code optimization

Token-by-token streaming is what makes AI features feel fast. The first character appears almost instantly, progress is visible, and users can start reading before the model finishes generating. But if you have shipped streaming LLM responses in Next.js and called it done at the "hello world" stage, you are one traffic spike away from a wall of connection errors.

This post is about the gap between a streaming demo and a streaming feature that holds up in production.

Why SSE for Streaming LLM Responses in Next.js

WebSockets are the obvious candidate if you think "real-time." But for LLM streaming the traffic is almost entirely one-directional: the server pushes tokens, the client reads them. Server-Sent Events (SSE) are a better fit here. SSE uses plain HTTP, works through most proxies and CDN edge nodes without special configuration, and gives you automatic reconnection at the browser level for free.

The standard pattern in a Next.js Route Handler looks roughly like this:

// app/api/chat/route.ts
export async function POST(req: Request) {
  const { prompt } = await req.json();

  const stream = await openai.chat.completions.create({
    model: 'gpt-4o',
    messages: [{ role: 'user', content: prompt }],
    stream: true,
  });

  const encoder = new TextEncoder();
  const readable = new ReadableStream({
    async start(controller) {
      for await (const chunk of stream) {
        const token = chunk.choices[0]?.delta?.content ?? '';
        controller.enqueue(encoder.encode(`data: ${JSON.stringify({ token })}\n\n`));
      }
      controller.enqueue(encoder.encode('data: [DONE]\n\n'));
      controller.close();
    },
  });

  return new Response(readable, {
    headers: {
      'Content-Type': 'text/event-stream',
      'Cache-Control': 'no-cache',
      'Connection': 'keep-alive',
    },
  });
}

This works in development. It starts failing at scale.

The Three Problems Naive SSE Implementations Hit

1. Backpressure Is Ignored

The for await loop above pulls tokens from the OpenAI stream as fast as they arrive and pushes them into the ReadableStream controller without checking whether the client is keeping up. If the client is on a slow connection, the browser's TCP receive window eventually fills up, but the server-side controller keeps enqueueing chunks. Node.js buffers them in memory. Under enough concurrent requests, this eats RAM until the process is killed or the OOM killer steps in.

The fix is to respect the ReadableStream pull model. Instead of start, use the pull method, which is called only when the consumer is ready for more data:

const readable = new ReadableStream({
  async pull(controller) {
    const { value, done } = await streamIterator.next();
    if (done) {
      controller.enqueue(encoder.encode('data: [DONE]\n\n'));
      controller.close();
    } else {
      const token = value.choices[0]?.delta?.content ?? '';
      if (token) {
        controller.enqueue(encoder.encode(`data: ${JSON.stringify({ token })}\n\n`));
      }
    }
  },
});

This gives backpressure for free: the stream pauses when the browser stops consuming.

2. Cancellation Is Not Propagated

When a user closes a tab or navigates away, the browser closes the SSE connection. The Route Handler receives an abort signal on req.signal. The naive implementation ignores it, so the OpenAI request keeps running, consuming tokens and API spend for a response nobody will ever read.

Wire the abort signal through:

export async function POST(req: Request) {
  const { prompt } = await req.json();
  const abortController = new AbortController();

  req.signal.addEventListener('abort', () => {
    abortController.abort();
  });

  const stream = await openai.chat.completions.create(
    { model: 'gpt-4o', messages: [{ role: 'user', content: prompt }], stream: true },
    { signal: abortController.signal }
  );

  // ... rest of the handler
}

Now when the client disconnects, the upstream API call is cancelled within milliseconds. At any reasonable traffic level, this makes a measurable difference to your OpenAI bill.

3. Reconnection Sends Duplicate Tokens

The browser's EventSource reconnects automatically after a dropped connection. By default it replays the stream from the beginning because the server has no record of where the client left off. The client either displays duplicate tokens or your frontend deduplication logic adds complexity that is hard to test.

The SSE spec provides id: fields for exactly this reason. Assign each event an incrementing ID and handle Last-Event-ID on the server:

// Server: assign event IDs
let eventId = 0;
controller.enqueue(
  encoder.encode(`id: ${eventId++}\ndata: ${JSON.stringify({ token })}\n\n`)
);

// Server: read Last-Event-ID from reconnect
const lastId = req.headers.get('last-event-id');
const startFrom = lastId ? parseInt(lastId, 10) + 1 : 0;

For short-lived one-shot completions you can take a simpler approach: assign a unique session ID per request, buffer completed tokens server-side in a short-lived store (Redis works well; an in-memory LRU cache works for single-instance deploys), and replay only the missed segment on reconnect.

Handling Concurrency at the Edge

Next.js Route Handlers run on the Node.js runtime by default, but the streaming pattern above also works on the Edge runtime (export const runtime = 'edge'). The Edge runtime has tighter memory limits but lower cold-start latency, which suits short streaming responses.

The thing to watch: the Edge runtime has no persistent memory, so in-memory reconnection buffers do not survive across invocations. If you need reconnection replay and you are deploying to the Edge, the buffer must be external - Redis, Upstash, or a similar low-latency KV store.

For most teams shipping their first AI feature, the Node.js runtime with a short-lived in-process buffer is a reasonable starting point. Move to Edge plus an external store when you need global distribution or sub-100ms time-to-first-token for geographically distributed users.

Error Handling and Graceful Degradation

SSE has no built-in error channel. A common mistake is to close the stream silently on error, leaving the client waiting for tokens that will never arrive.

Send an explicit error event before closing:

try {
  for await (const chunk of stream) {
    // ... enqueue tokens
  }
} catch (err) {
  const message = err instanceof Error ? err.message : 'Upstream error';
  controller.enqueue(
    encoder.encode(`event: error\ndata: ${JSON.stringify({ message })}\n\n`)
  );
  controller.close();
}

On the client, listen for the error event type explicitly:

source.addEventListener('error', (e) => {
  const { message } = JSON.parse(e.data);
  setError(message);
  source.close();
});

This gives the UI something actionable to show rather than an indefinite loading spinner, which users correctly interpret as a broken experience.

Testing Streaming Endpoints Under Load

A few tools worth having in your workflow before you call a streaming endpoint production-ready:

k6 is well suited for SSE load testing. You can script a virtual user that opens an SSE connection, reads all events, and asserts the [DONE] sentinel arrives within a timeout. Running 50-100 concurrent virtual users before launch surfaces most of the backpressure and memory issues described above.

Wiremock can simulate slow upstream responses, which is useful for testing backpressure handling without incurring actual API costs.

For unit testing the Route Handler itself, the @edge-runtime/jest-environment package lets you run Edge-compatible tests without a full browser. The ReadableStream pull semantics can be verified in isolation, giving you confidence that the streaming logic behaves correctly before wiring it into an integration test.

When to Move Beyond a Next.js Route Handler

For a small team building a single AI feature, a Next.js Route Handler is the right starting point. It ships fast, shares authentication middleware, and keeps the infrastructure simple.

When you start hitting limits - multiple concurrent model calls per user, real-time collaboration features, or streams that need to be fanned out to multiple consumers - a dedicated service becomes worth the investment. Technologies like nats.io or Redis Streams handle fan-out naturally and can buffer for reconnecting clients without you managing that logic yourself.

At that point the Next.js layer becomes a thin proxy: it authenticates the request, opens a connection to the streaming service, and forwards events to the browser. The business logic lives elsewhere and can scale independently of the frontend.

Wrapping Up

Streaming LLM responses in Next.js is not complicated at the demo level. Making them reliable under real traffic requires three things the naive pattern skips: respecting backpressure, propagating cancellation, and handling reconnection without duplicate tokens.

None of this is clever engineering. It is the kind of detail that separates a feature your users can depend on from one that works in demos and degrades silently in production.

If you are building AI-powered features into an existing product and want a second opinion on the architecture before you commit to a pattern, reach out at hello@wolf-tech.io. The approach that works in development and the one that holds up at scale are often different, and catching the gap early is cheaper than fixing it after launch.

Find out more about how Wolf-Tech approaches custom software development and web application development with reliability as a first-class concern.