Optimizing Next.js on EKS: Tips I’ve learned as an SRE

Introduction

I used to be a frontend engineer building Next.js apps, and now I maintain those apps as an SRE running on EKS. Next.js is easy to get running locally, but running it reliably in production is a whole different game. You need proper SRE thinking around design, monitoring, and performance.

This post shares my hard-earned lessons specifically around Next.js + EKS based on real production issues we faced and solved.

1. Designing health checks (readiness / liveness / startup)

Always split readiness and liveness. When your app has external dependencies (like DBs or APIs), check them in readiness, and stop sending traffic if they fail. Meanwhile, liveness should be just a process heartbeat, and shouldn't be overly aggressive. For SSR apps with slow boot times, startupProbe adds more stability.

Handler example (with external deps)

1app.get('/health/readiness/', async (_, reply) => {
2  const ok = await checkExternalDeps() // check DB/API
3  reply.code(200).send({ status: 'READY' })
4})

Handler example (no deps)

For simple apps with no dependencies, combining them is fine.

1app.get(/^\/health\/(liveness|readiness)\/?$/, (_, reply) => {
2  reply.code(200).send({ status: 'READY' })
3})

Probe example (EKS)

1readinessProbe:
2  httpGet: { path: /health/readiness/, port: 3000 }
3  initialDelaySeconds: 3
4  periodSeconds: 10
5livenessProbe:
6  httpGet: { path: /health/liveness/, port: 3000 }
7  initialDelaySeconds: 10
8  periodSeconds: 30
9startupProbe:
10  httpGet: { path: /health/liveness/, port: 3000 }
11  failureThreshold: 30
12  periodSeconds: 5

Quick rules of thumb:

startupProbe: delays liveness until app fully boots
readiness: removes pod from LB if dependencies break
liveness: restarts pod if it's frozen (too aggressive = noisy restarts)

2. Misconceptions about SSR and scaling

More CPU ≠ faster SSR. Node.js is single-threaded, so even if your pod gets 2 cores, your app might only use 1. You’ll need cluster or a process manager like PM2 to match processes to CPU cores.

I go deeper into this here:

https://zoelog.vercel.app/articles/infrastructure/enable-multi-cpu-nodejs

3. SSR VS CSR chaos testing scenarios

SSR and CSR behave very differently during failures — which means you need to test both paths.

SSR: if an API fails, the page likely returns HTTP 5xx
CSR: even if there's an error, the page might return 200 and silently fail (e.g. empty widget)

Here’s a test matrix I use:

Case	Scenario	Expected (SSR)	Expected (CSR)
API failure	API 5xx	Page 5xx or error fallback	Page 200, but component is hidden/placeholder shown
Latency	API 3s→10s	SSR times out or retries	UI loading shown, then timeout
Auth expired	Session dead	Redirect or error page	Client detects and prompts re-login

In prod, we often accept partial degradation. I recommend visual regression or component visibility checks in Playwright E2E tests.

4. Cache design (CDN x SSR)

Placing a CDN in front of your SSR app can drastically reduce load — but you need a clear caching strategy.

Static assets (/_next/static/*, images, fonts): Use Cache-Control: public, max-age=31536000, immutable → file names include hashes, so cache busting is automatic
SSR HTML:
If personalized or uses auth, use no-store. If you're okay with CDN caching briefly, use short s-maxage (like 30s).

Rough diagram

1Client → CDN (Akamai/CloudFront) → ALB → Next.js SSR
2                      ↑
3             Cache layer goes here

5. Logging (structured + latency + correlation ID)

Out-of-the-box Next.js logging is limited. In a custom server (like Fastify), log structure and latency using onResponse. It helps a ton when debugging.

Minimal Fastify example:

1app.addHook('onResponse', (req, reply, done) => {
2  req.log.info({
3    http_method: req.method,
4    http_status: reply.statusCode,
5    user_agent: req.headers['user-agent'] ?? null,
6    latency: reply.getResponseTime() / 1000,
7  })
8  done()
9})

Docs: https://fastify.dev/docs/latest/Reference/Hooks/

What to measure:

Latency (p50, p90, p99)
Error rate (4xx / 5xx)
Request count (by route)
Correlation ID (for tracing)

Summary (TL;DR)

Use readiness / liveness / startup probes properly to avoid cascading failures
Use cluster or PM2 to increase concurrency within 1 pod
Test SSR/CSR separately and account for partial failures
Cache wisely (long for static, short/no-cache for SSR HTML)
Add structured logs + correlation ID to boost observability

Optimizing Next.js on EKS: Tips I’ve learned as an SRE

Table of Contents