On-Demand ISR on self-hosted Next.js

Next.js recently (in version 12.2) stabilized a new feature: On-Demand Incremental Static Regeneration. This is supposed to make it possible to invalidate stale statically rendered pages right when the content source changes.

My current project team fetches content from a headless CMS and runs Next.js on an autoscaling container service. We currently use incremental static regeneration with a fixed interval and considered using On-Demand ISR instead. This surfaced some problems that we were lucky to catch before a release.

Background

Static Rendering is one of the key features of Next.js. For pages where this feature is enabled, Next.js will fetch the necessary data (by executing getStaticProps) and pre-render the page during build time.

Incremental Static Regeneration (ISR) extends this concept to pages where the path is not known at build time, e.G. when a new blog post is created and a user visits it, Next.js can fetch its data, render the content on demand and cache it for future requests. It is possible to specify an interval after which Next.js will automatically rebuild the content.

On-Demand ISR is an API to manually invalidate the cached content. With this feature it is possible to have the content backend, e.g. a CMS, call a webhook to rebuild pages immediately when content changes instead of having to

trigger a complete new build on each content change
wait for the revalidation interval until new content comes online.

This works by setting up an API route that will invalidate the relevant content:

// /pages/api/revalidate.ts
export default function handler() {
  // do some authentication here
  revalidate()
}

Self-hosted Next.js

Many of our customer run Next.js on an auto-scaling container platform. Typical candidates are

Kubernetes-based deployments
Google Cloud Run
AWS Fargate
fly.io

these platforms work in a similar way: They spin up a container running the actual Next.js instance. Then the platform automatically watches the container's resource usage and spins up additional instances if metrics such as memory or CPU usage go above some threshold. Requests will be routed to a load balancer that automatically selects an instance to route each request to.

Schemazeichnung der Funktionsweise eines Load-Balancer — Multiple instances with a load balancer

The Problem

Data from ISR will be stored in the local memory and/or file system of each container.

An invalidation request from the content source to /api/revalidate will be routed through the same load balancer as any other request and therefore reach exactly one of the running containers.

Therefore the invalidation request will invalidate only the local cache of one of the containers leaving all other containers with an outdated version of the page.

Schemazeichnung einer Next.js-Instanz mit einer veralteten, gecacheten Response nach einem invalidate-aufruf — A stale cached response after invalidation

Mitigation

To avoid this problem there are several strategies:

Figure out a way to route revalidation requests to all containers, not just to one of them. As far as I know this in nontrivial on at least Cloud Run and Fargate deployments. It also comes with its own set of issues such as what happens if one instance is temporarily unavailable when the invalidation is requested.
Replace the containers instead of just invalidating the cache. This will take a bit longer and require interaction with the container orchestrator but it will work.
Resort to revalidation intervals instead and live with the fact that it'll take a moment for updated content to go online and that your container will re-render your page a couple of times more often than strictly necessary.

Summary

In auto-scaling environments, invalidation-requests will usually hit only one of the active instances.
Distributing the invalidation-request to an unknown number of running instances is hard.

My recommendation is to stick with invalidation intervals for ISR on auto-scaling cloud infrastructure because it provides a seamless experience (with stale-while-revalidate caching) without the pain of distributed cache invalidation.