DevOps Guide11 min readMarch 22, 2026

Keycloak on Kubernetes: Production-Ready High Availability Guide

Deploy a production-grade Keycloak cluster on Kubernetes with PostgreSQL, Infinispan caching, auto-scaling, monitoring, and zero-downtime upgrades. Architecture patterns for 99.99% uptime.

KT

KeycloakPro Team

KeycloakPro Team

Why Single-Node Keycloak Fails in Production

Running Keycloak on a single instance is fine for development. In production, it becomes a liability. Here is what goes wrong:

  • Single point of failure --- One node crash means every application relying on Keycloak for authentication goes down simultaneously. No login, no token refresh, no admin console.
  • Session loss --- User sessions live in memory by default. A restart wipes them all, forcing every user to re-authenticate at once.
  • No horizontal scaling --- A single JVM can only handle so many concurrent token validations. During traffic spikes, latency climbs and timeouts cascade.
  • Upgrade downtime --- Every Keycloak version upgrade requires stopping the instance, running database migrations, and restarting. That is a maintenance window your users will feel.

For any workload beyond a proof of concept, you need a multi-node deployment with failover, session replication, and zero-downtime upgrades. Kubernetes is the natural platform for this.

Architecture Overview: The HA Keycloak Stack

A production-grade Keycloak deployment on Kubernetes consists of four layers:

LayerComponentPurpose
ApplicationKeycloak StatefulSet (3+ pods)Authentication and authorization
CachingInfinispan (embedded or external)Distributed session and cache replication
DatabasePostgreSQL HA (primary + replica)Persistent storage for realms, users, credentials
ObservabilityPrometheus + GrafanaMetrics, alerting, and dashboards

Traffic enters through a Kubernetes Ingress or Gateway API resource, hits the Keycloak pods behind a Service, which share sessions via Infinispan and persist state to PostgreSQL. Prometheus scrapes metrics from all layers.

Choosing Your Deployment Method

You have three options, each with trade-offs:

MethodProsConsBest For
Helm Chart (Bitnami or Codecentric)Fast setup, community-maintained valuesLess control over fine-grained configTeams wanting quick, opinionated deploys
Raw ManifestsFull control, easy to auditMore YAML to maintainTeams with strong K8s expertise
Keycloak OperatorCRD-based, declarative lifecycleNewer, smaller communityGitOps workflows, OpenShift environments

For most teams, the Helm chart approach strikes the right balance. The examples in this guide use raw manifests for clarity, but the concepts translate directly to Helm values.

PostgreSQL HA Setup

Keycloak's database is its single most critical dependency. A standalone PostgreSQL instance is just as dangerous as a single Keycloak node.

Primary-Replica with Streaming Replication

Use a PostgreSQL operator like CloudNativePG or Zalando's Postgres Operator to manage replication automatically:

# CloudNativePG Cluster definition
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: keycloak-db
  namespace: keycloak
spec:
  instances: 3
  primaryUpdateStrategy: unsupervised
  storage:
    size: 20Gi
    storageClass: gp3-encrypted
  postgresql:
    parameters:
      max_connections: "200"
      shared_buffers: "512MB"
      effective_cache_size: "1536MB"
      work_mem: "8MB"
  backup:
    barmanObjectStore:
      destinationPath: "s3://keycloak-backups/postgresql"
      s3Credentials:
        accessKeyId:
          name: s3-creds
          key: ACCESS_KEY_ID
        secretAccessKey:
          name: s3-creds
          key: SECRET_ACCESS_KEY

Connection Pooling with PgBouncer

Keycloak opens many short-lived database connections. PgBouncer reduces connection overhead dramatically:

apiVersion: postgresql.cnpg.io/v1
kind: Pooler
metadata:
  name: keycloak-db-pooler
  namespace: keycloak
spec:
  cluster:
    name: keycloak-db
  instances: 2
  type: rw
  pgbouncer:
    poolMode: transaction
    parameters:
      default_pool_size: "25"
      max_client_conn: "200"

Point Keycloak at the PgBouncer service (keycloak-db-pooler-rw) rather than the database directly.

Infinispan Distributed Caching

Keycloak uses Infinispan for caching user sessions, authentication sessions, offline tokens, and login failures. In a multi-pod deployment, these caches must be replicated across nodes.

Embedded vs External Infinispan

Embedded (default): Keycloak pods form a JGroups cluster and replicate caches directly between themselves. Simpler to set up, works well for up to 5-8 nodes.

External: A separate Infinispan cluster manages caches. Better for large deployments (10+ Keycloak pods) where you want to scale the cache tier independently.

For most deployments, embedded Infinispan with DNS-based discovery is sufficient:

# Environment variables for JGroups DNS_PING discovery
- name: KC_CACHE
  value: "ispn"
- name: KC_CACHE_STACK
  value: "kubernetes"
- name: JAVA_OPTS_KC_HEAP
  value: "-XX:MaxRAMPercentage=70.0"
- name: jgroups.dns.query
  value: "keycloak-headless.keycloak.svc.cluster.local"

The headless Service (shown in the next section) enables pod-to-pod discovery. JGroups uses DNS_PING to find peers and forms the cache cluster automatically.

Cache Tuning

Configure cache owners based on your fault tolerance requirements:

CacheDefault OwnersRecommendationRationale
sessions12Losing a node should not log users out
authenticationSessions12In-flight logins survive node failure
offlineSessions12Offline tokens persist across restarts
loginFailures11Brute-force counters can afford loss

Kubernetes Deployment Configuration

Here is the complete StatefulSet with health probes, resource limits, and auto-scaling:

apiVersion: v1
kind: Service
metadata:
  name: keycloak-headless
  namespace: keycloak
  labels:
    app: keycloak
spec:
  type: ClusterIP
  clusterIP: None
  selector:
    app: keycloak
  ports:
    - name: http
      port: 8080
    - name: jgroups
      port: 7800
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: keycloak
  namespace: keycloak
spec:
  serviceName: keycloak-headless
  replicas: 3
  selector:
    matchLabels:
      app: keycloak
  template:
    metadata:
      labels:
        app: keycloak
    spec:
      terminationGracePeriodSeconds: 60
      containers:
        - name: keycloak
          image: quay.io/keycloak/keycloak:26.0.0
          args: ["start"]
          ports:
            - containerPort: 8080
              name: http
            - containerPort: 7800
              name: jgroups
            - containerPort: 9000
              name: management
          env:
            - name: KC_DB
              value: "postgres"
            - name: KC_DB_URL
              value: "jdbc:postgresql://keycloak-db-pooler-rw:5432/keycloak"
            - name: KC_DB_USERNAME
              valueFrom:
                secretKeyRef:
                  name: keycloak-db-credentials
                  key: username
            - name: KC_DB_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: keycloak-db-credentials
                  key: password
            - name: KC_HOSTNAME
              value: "auth.example.com"
            - name: KC_PROXY_HEADERS
              value: "xforwarded"
            - name: KC_HTTP_ENABLED
              value: "true"
            - name: KC_HEALTH_ENABLED
              value: "true"
            - name: KC_METRICS_ENABLED
              value: "true"
            - name: KC_CACHE
              value: "ispn"
            - name: KC_CACHE_STACK
              value: "kubernetes"
            - name: jgroups.dns.query
              value: "keycloak-headless.keycloak.svc.cluster.local"
            - name: JAVA_OPTS_KC_HEAP
              value: "-XX:MaxRAMPercentage=70.0"
          resources:
            requests:
              cpu: "500m"
              memory: "1Gi"
            limits:
              cpu: "2"
              memory: "2Gi"
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 9000
            initialDelaySeconds: 30
            periodSeconds: 10
            failureThreshold: 3
          livenessProbe:
            httpGet:
              path: /health/live
              port: 9000
            initialDelaySeconds: 60
            periodSeconds: 15
            failureThreshold: 5
          startupProbe:
            httpGet:
              path: /health/started
              port: 9000
            initialDelaySeconds: 15
            periodSeconds: 5
            failureThreshold: 30
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: keycloak-hpa
  namespace: keycloak
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: StatefulSet
    name: keycloak
  minReplicas: 3
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80

Key decisions in this configuration:

  • StatefulSet over Deployment ensures stable network identities for JGroups cluster formation.
  • Separate management port (9000) for health checks keeps probe traffic off the main HTTP port.
  • startupProbe with generous thresholds prevents Kubernetes from killing pods during slow initial startup or database migration.
  • terminationGracePeriodSeconds: 60 gives in-flight requests time to complete during scale-down.

Zero-Downtime Upgrades

Keycloak upgrades involve database schema migrations that must run exactly once. Here is how to handle them safely:

Rolling Update Strategy

  1. Set maxUnavailable: 0 and maxSurge: 1 to ensure all existing pods stay healthy while new ones start.
  2. Use an init container or a separate Job to run database migrations before the new pods start serving traffic.
  3. Enable sticky sessions on your Ingress to prevent mid-authentication flows from being disrupted.
# Ingress annotation for sticky sessions (nginx)
metadata:
  annotations:
    nginx.ingress.kubernetes.io/affinity: "cookie"
    nginx.ingress.kubernetes.io/session-cookie-name: "KC_ROUTE"
    nginx.ingress.kubernetes.io/session-cookie-max-age: "600"

Blue-Green Deployment

For major version upgrades (e.g., 25.x to 26.x), a blue-green approach is safer:

  1. Deploy the new version as a separate StatefulSet (keycloak-green) pointing to a migrated copy of the database.
  2. Run smoke tests against the green deployment.
  3. Switch the Ingress backend from blue to green.
  4. Monitor for 30 minutes, then decommission blue.

This approach adds infrastructure cost but eliminates risk for upgrades with breaking changes.

Monitoring and Alerting

Keycloak exposes Prometheus metrics on the /metrics endpoint when KC_METRICS_ENABLED=true. Here are the critical metrics to watch:

MetricAlert ThresholdMeaning
keycloak_request_duration_seconds (p99)> 2sAuthentication latency too high
keycloak_request_errors_total> 1% of total requestsElevated error rate
jvm_memory_used_bytes / jvm_memory_max_bytes> 85%Memory pressure, risk of OOM
jvm_gc_pause_seconds (p99)> 500msGC pauses affecting response times
vendor_cache_manager_default_cache_keycloak_sessions_statistics_storesdroppingSession store failures
pg_stat_activity_count> 80% of max_connectionsDatabase connection exhaustion

Grafana Dashboard

Use the community Keycloak Grafana dashboard (ID: 19659) as a starting point. Add panels for:

  • Token issuance rate (tokens/sec by grant type)
  • Active sessions per realm
  • Cache hit ratio (should be above 95%)
  • Database query latency (p50, p95, p99)

ServiceMonitor for Prometheus Operator

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: keycloak-monitor
  namespace: keycloak
spec:
  selector:
    matchLabels:
      app: keycloak
  endpoints:
    - port: management
      path: /metrics
      interval: 15s

Disaster Recovery and Backup Strategy

A robust DR plan covers three layers:

Database backups: Use continuous WAL archiving with CloudNativePG's built-in Barman integration (shown in the PostgreSQL section above). This gives you point-in-time recovery to any second within your retention window.

Realm export: Schedule periodic realm exports via Keycloak's admin CLI. These JSON exports capture realm configuration, client definitions, and role mappings --- everything needed to rebuild from scratch.

# Export all realms (run as a CronJob in your cluster)
/opt/keycloak/bin/kc.sh export \
  --dir /tmp/keycloak-export \
  --users realm_file

# Upload to object storage
aws s3 sync /tmp/keycloak-export s3://keycloak-backups/realm-exports/$(date +%F)/

Cluster state: Back up your Kubernetes manifests (or Helm values) in version control. If you lose the cluster entirely, you need to be able to recreate the infrastructure, restore the database, and redeploy.

Recovery time targets:

ScenarioRTORPOStrategy
Single pod failure< 30s0Kubernetes auto-restart + session replication
Full node failure< 2min0Pod rescheduling + persistent sessions
Database primary failure< 60s0Automatic failover to replica
Complete cluster loss< 1hr< 5minRestore from backup to new cluster

Performance Tuning Tips

JVM Settings

Keycloak 26.x runs on Quarkus, which is significantly lighter than the old WildFly distribution. Key JVM tuning parameters:

- name: JAVA_OPTS_KC_HEAP
  value: "-XX:MaxRAMPercentage=70.0"
- name: JAVA_OPTS_APPEND
  value: >-
    -XX:+UseG1GC
    -XX:MaxGCPauseMillis=200
    -XX:+UseStringDeduplication
    -Djgroups.thread_pool.min_threads=20
    -Djgroups.thread_pool.max_threads=200

Connection Pool Sizing

The default connection pool (min 0, max 100) is often too large. Right-size it:

- name: KC_DB_POOL_MIN_SIZE
  value: "10"
- name: KC_DB_POOL_MAX_SIZE
  value: "50"
- name: KC_DB_POOL_INITIAL_SIZE
  value: "10"

A good rule of thumb: set max_size to (2 * CPU cores) + available disk spindles per pod, then multiply by pod count and ensure this stays well below PostgreSQL's max_connections.

Cache Tuning

For high-traffic deployments, increase the Infinispan cache sizes in a custom cache-ispn.xml:

  • sessions cache: Set max entries based on expected concurrent sessions
  • Enable near-caching for read-heavy caches like realms and authorization
  • Use SYNC replication mode for session caches (consistency over speed)

Cost Estimation

Here is what to budget for different scales, using AWS EKS as a reference:

ScaleUsersKeycloak PodsDB InstanceMonthly Cost (est.)
Starter< 10K MAU3x (1 vCPU, 2GB)db.t4g.medium (HA)$350-500
Growth10K-100K MAU5x (2 vCPU, 4GB)db.r6g.large (HA)$800-1,200
Enterprise100K-1M MAU8-10x (2 vCPU, 4GB)db.r6g.xlarge (HA)$2,000-3,500
Large Scale1M+ MAU10-15x (4 vCPU, 8GB) + external Infinispandb.r6g.2xlarge (HA)$5,000-8,000

These estimates include compute, storage, and data transfer. They do not include engineering time for setup, maintenance, monitoring configuration, and on-call coverage --- which is often the largest cost.

Skip the Infrastructure Work

Building and maintaining a production Keycloak cluster is a serious engineering investment. The Kubernetes manifests in this guide are a starting point, but production reality includes certificate rotation, secret management, log aggregation, upgrade testing, capacity planning, and incident response.

KeycloakPro's managed HA cluster service handles all of this. We deploy and operate production-grade Keycloak clusters with 99.99% SLA, automated backups, zero-downtime upgrades, and 24/7 monitoring --- so your team can focus on building your product instead of operating identity infrastructure.

Explore KeycloakPro's HA Cluster plans or talk to our team about your deployment requirements.

Need Help With Keycloak?

Our team specializes in production-grade Keycloak deployments. Get a free 30-minute strategy consultation.

Book a Free Strategy Call