Logo

Optimizing Next.js on EKS: Tips I’ve learned as an SRE

Node.jsKubernetes

Introduction

I used to be a frontend engineer building Next.js apps, and now I maintain those apps as an SRE running on EKS. Next.js is easy to get running locally, but running it reliably in production is a whole different game. You need proper SRE thinking around design, monitoring, and performance.

This post shares my hard-earned lessons specifically around Next.js + EKS based on real production issues we faced and solved.

1. Designing health checks (readiness / liveness / startup)

Always split readiness and liveness. When your app has external dependencies (like DBs or APIs), check them in readiness, and stop sending traffic if they fail. Meanwhile, liveness should be just a process heartbeat, and shouldn't be overly aggressive. For SSR apps with slow boot times, startupProbe adds more stability.

Handler example (with external deps)

1app.get('/health/readiness/', async (_, reply) => {
2 const ok = await checkExternalDeps() // check DB/API
3 reply.code(200).send({ status: 'READY' })
4})

Handler example (no deps)

For simple apps with no dependencies, combining them is fine.

1app.get(/^\/health\/(liveness|readiness)\/?$/, (_, reply) => {
2 reply.code(200).send({ status: 'READY' })
3})

Probe example (EKS)

1readinessProbe:
2 httpGet: { path: /health/readiness/, port: 3000 }
3 initialDelaySeconds: 3
4 periodSeconds: 10
5livenessProbe:
6 httpGet: { path: /health/liveness/, port: 3000 }
7 initialDelaySeconds: 10
8 periodSeconds: 30
9startupProbe:
10 httpGet: { path: /health/liveness/, port: 3000 }
11 failureThreshold: 30
12 periodSeconds: 5

Quick rules of thumb:

  • startupProbe: delays liveness until app fully boots
  • readiness: removes pod from LB if dependencies break
  • liveness: restarts pod if it's frozen (too aggressive = noisy restarts)

2. Misconceptions about SSR and scaling

More CPU ≠ faster SSR. Node.js is single-threaded, so even if your pod gets 2 cores, your app might only use 1. You’ll need cluster or a process manager like PM2 to match processes to CPU cores.

I go deeper into this here:

https://zoelog.vercel.app/articles/infrastructure/enable-multi-cpu-nodejs

3. SSR VS CSR chaos testing scenarios

SSR and CSR behave very differently during failures — which means you need to test both paths.

  • SSR: if an API fails, the page likely returns HTTP 5xx
  • CSR: even if there's an error, the page might return 200 and silently fail (e.g. empty widget)

Here’s a test matrix I use:

CaseScenarioExpected (SSR)Expected (CSR)
API failureAPI 5xxPage 5xx or error fallbackPage 200, but component is hidden/placeholder shown
LatencyAPI 3s→10sSSR times out or retriesUI loading shown, then timeout
Auth expiredSession deadRedirect or error pageClient detects and prompts re-login

In prod, we often accept partial degradation. I recommend visual regression or component visibility checks in Playwright E2E tests.

4. Cache design (CDN x SSR)

Placing a CDN in front of your SSR app can drastically reduce load — but you need a clear caching strategy.

  • Static assets (/_next/static/*, images, fonts): Use Cache-Control: public, max-age=31536000, immutable → file names include hashes, so cache busting is automatic

  • SSR HTML:
    If personalized or uses auth, use no-store. If you're okay with CDN caching briefly, use short s-maxage (like 30s).

Rough diagram

1Client → CDN (Akamai/CloudFront) → ALB → Next.js SSR
2
3 Cache layer goes here

5. Logging (structured + latency + correlation ID)

Out-of-the-box Next.js logging is limited. In a custom server (like Fastify), log structure and latency using onResponse. It helps a ton when debugging.

Minimal Fastify example:

1app.addHook('onResponse', (req, reply, done) => {
2 req.log.info({
3 http_method: req.method,
4 http_status: reply.statusCode,
5 user_agent: req.headers['user-agent'] ?? null,
6 latency: reply.getResponseTime() / 1000,
7 })
8 done()
9})

Docs: https://fastify.dev/docs/latest/Reference/Hooks/

What to measure:

  • Latency (p50, p90, p99)
  • Error rate (4xx / 5xx)
  • Request count (by route)
  • Correlation ID (for tracing)

Summary (TL;DR)

  1. Use readiness / liveness / startup probes properly to avoid cascading failures
  2. Use cluster or PM2 to increase concurrency within 1 pod
  3. Test SSR/CSR separately and account for partial failures
  4. Cache wisely (long for static, short/no-cache for SSR HTML)
  5. Add structured logs + correlation ID to boost observability