Logo

Optimizing Next.js on EKS: Tips I’ve learned as an SRE

4 min read
Node.jsKubernetesNext.js

Table of Contents

Introduction

I used to be a frontend engineer building Next.js apps, and now I maintain those apps as an SRE running on EKS. Next.js is easy to get running locally, but running it reliably in production is a whole different game. You need proper SRE thinking around design, monitoring, and performance.

This post shares my hard-earned lessons specifically around Next.js + EKS based on real production issues we faced and solved.

1. Designing health checks (readiness / liveness / startup)

Always split readiness and liveness. When your app has external dependencies (like DBs or APIs), check them in readiness, and stop sending traffic if they fail. Meanwhile, liveness should be just a process heartbeat, and shouldn't be overly aggressive. For SSR apps with slow boot times, startupProbe adds more stability.

Handler example (with external deps)

1app.get('/health/readiness/', async (_, reply) => {
2 const ok = await checkExternalDeps() // check DB/API
3 reply.code(200).send({ status: 'READY' })
4})

Handler example (no deps)

For simple apps with no dependencies, combining them is fine.

1app.get(/^\/health\/(liveness|readiness)\/?$/, (_, reply) => {
2 reply.code(200).send({ status: 'READY' })
3})

Probe example (EKS)

1readinessProbe:
2 httpGet: { path: /health/readiness/, port: 3000 }
3 initialDelaySeconds: 3
4 periodSeconds: 10
5livenessProbe:
6 httpGet: { path: /health/liveness/, port: 3000 }
7 initialDelaySeconds: 10
8 periodSeconds: 30
9startupProbe:
10 httpGet: { path: /health/liveness/, port: 3000 }
11 failureThreshold: 30
12 periodSeconds: 5

Quick rules of thumb:

2. Misconceptions about SSR and scaling

More CPU ≠ faster SSR. Node.js is single-threaded, so even if your pod gets 2 cores, your app might only use 1. You’ll need cluster or a process manager like PM2 to match processes to CPU cores.

I go deeper into this here:

https://zoelog.vercel.app/articles/infrastructure/enable-multi-cpu-nodejs

3. SSR VS CSR chaos testing scenarios

SSR and CSR behave very differently during failures — which means you need to test both paths.

Here’s a test matrix I use:

CaseScenarioExpected (SSR)Expected (CSR)
API failureAPI 5xxPage 5xx or error fallbackPage 200, but component is hidden/placeholder shown
LatencyAPI 3s→10sSSR times out or retriesUI loading shown, then timeout
Auth expiredSession deadRedirect or error pageClient detects and prompts re-login

In prod, we often accept partial degradation. I recommend visual regression or component visibility checks in Playwright E2E tests.

4. Cache design (CDN x SSR)

Placing a CDN in front of your SSR app can drastically reduce load — but you need a clear caching strategy.

Rough diagram

1Client → CDN (Akamai/CloudFront) → ALB → Next.js SSR
2
3 Cache layer goes here

5. Logging (structured + latency + correlation ID)

Out-of-the-box Next.js logging is limited. In a custom server (like Fastify), log structure and latency using onResponse. It helps a ton when debugging.

Minimal Fastify example:

1app.addHook('onResponse', (req, reply, done) => {
2 req.log.info({
3 http_method: req.method,
4 http_status: reply.statusCode,
5 user_agent: req.headers['user-agent'] ?? null,
6 latency: reply.getResponseTime() / 1000,
7 })
8 done()
9})

Docs: https://fastify.dev/docs/latest/Reference/Hooks/

What to measure:

Summary (TL;DR)

  1. Use readiness / liveness / startup probes properly to avoid cascading failures
  2. Use cluster or PM2 to increase concurrency within 1 pod
  3. Test SSR/CSR separately and account for partial failures
  4. Cache wisely (long for static, short/no-cache for SSR HTML)
  5. Add structured logs + correlation ID to boost observability

Related Articles