What Keycloak version do you use?

We deploy Keycloak 26.x (latest stable) for all new projects. For existing deployments, we offer migration paths from Keycloak 18+ (including the legacy WildFly-based versions) to the modern Quarkus-based distribution.

How long does a typical project take?

Simple implementations (passkeys, theming) take 5-10 business days. Multi-tenancy and HA clusters typically take 2-3 weeks. Full CIAM overhauls run 4-6 weeks. We provide exact timelines in our fixed-price proposals.

Can you migrate us from Okta, Auth0, or Firebase Auth?

Yes. We have battle-tested migration playbooks for Okta, Auth0, Firebase Auth, AWS Cognito, and Azure AD B2C. We handle user migration, session continuity, and social login re-linking with zero downtime.

What does 'fixed price' mean? Are there hidden costs?

Our quotes are all-inclusive. The price covers discovery, architecture, implementation, testing, deployment, documentation, and 30-day warranty support. Infrastructure costs (cloud hosting) are separate and transparently estimated upfront.

Do you offer ongoing managed services?

Yes. Our Managed Keycloak-as-a-Service starts at $1,800/month and includes 24/7 monitoring, patching, scaling, security updates, and incident response. Think of it as your dedicated Keycloak ops team without the hiring overhead.

Is Keycloak really enterprise-ready?

Absolutely. Keycloak is backed by Red Hat (IBM), powers thousands of enterprise deployments globally, and is the upstream for Red Hat SSO. It supports SAML 2.0, OIDC, LDAP/AD federation, and every enterprise SSO protocol you need.

What about vendor lock-in?

Zero. Keycloak is 100% open source (Apache 2.0). You own your deployment, your data, and your configuration. Everything we build is yours — full source code, Terraform configs, and documentation included in every project.

Do you work with our existing DevOps team?

Yes. We integrate seamlessly with your existing CI/CD pipelines, cloud infrastructure, and DevOps workflows. We provide Terraform/OpenTofu IaC, Helm charts, and comprehensive runbooks so your team can maintain the deployment independently.

Keycloak on Kubernetes: Production-Ready High Availability Guide

Why Single-Node Keycloak Fails in Production

Running Keycloak on a single instance is fine for development. In production, it becomes a liability. Here is what goes wrong:

Single point of failure --- One node crash means every application relying on Keycloak for authentication goes down simultaneously. No login, no token refresh, no admin console.
Session loss --- User sessions live in memory by default. A restart wipes them all, forcing every user to re-authenticate at once.
No horizontal scaling --- A single JVM can only handle so many concurrent token validations. During traffic spikes, latency climbs and timeouts cascade.
Upgrade downtime --- Every Keycloak version upgrade requires stopping the instance, running database migrations, and restarting. That is a maintenance window your users will feel.

For any workload beyond a proof of concept, you need a multi-node deployment with failover, session replication, and zero-downtime upgrades. Kubernetes is the natural platform for this.

Architecture Overview: The HA Keycloak Stack

A production-grade Keycloak deployment on Kubernetes consists of four layers:

Layer	Component	Purpose
Application	Keycloak StatefulSet (3+ pods)	Authentication and authorization
Caching	Infinispan (embedded or external)	Distributed session and cache replication
Database	PostgreSQL HA (primary + replica)	Persistent storage for realms, users, credentials
Observability	Prometheus + Grafana	Metrics, alerting, and dashboards

Traffic enters through a Kubernetes Ingress or Gateway API resource, hits the Keycloak pods behind a Service, which share sessions via Infinispan and persist state to PostgreSQL. Prometheus scrapes metrics from all layers.

Choosing Your Deployment Method

You have three options, each with trade-offs:

Method	Pros	Cons	Best For
Helm Chart (Bitnami or Codecentric)	Fast setup, community-maintained values	Less control over fine-grained config	Teams wanting quick, opinionated deploys
Raw Manifests	Full control, easy to audit	More YAML to maintain	Teams with strong K8s expertise
Keycloak Operator	CRD-based, declarative lifecycle	Newer, smaller community	GitOps workflows, OpenShift environments

For most teams, the Helm chart approach strikes the right balance. The examples in this guide use raw manifests for clarity, but the concepts translate directly to Helm values.

PostgreSQL HA Setup

Keycloak's database is its single most critical dependency. A standalone PostgreSQL instance is just as dangerous as a single Keycloak node.

Primary-Replica with Streaming Replication

Use a PostgreSQL operator like CloudNativePG or Zalando's Postgres Operator to manage replication automatically:

# CloudNativePG Cluster definition
apiVersion: postgresql.cnpg.io/v1
kind: Cluster
metadata:
  name: keycloak-db
  namespace: keycloak
spec:
  instances: 3
  primaryUpdateStrategy: unsupervised
  storage:
    size: 20Gi
    storageClass: gp3-encrypted
  postgresql:
    parameters:
      max_connections: "200"
      shared_buffers: "512MB"
      effective_cache_size: "1536MB"
      work_mem: "8MB"
  backup:
    barmanObjectStore:
      destinationPath: "s3://keycloak-backups/postgresql"
      s3Credentials:
        accessKeyId:
          name: s3-creds
          key: ACCESS_KEY_ID
        secretAccessKey:
          name: s3-creds
          key: SECRET_ACCESS_KEY

Connection Pooling with PgBouncer

Keycloak opens many short-lived database connections. PgBouncer reduces connection overhead dramatically:

apiVersion: postgresql.cnpg.io/v1
kind: Pooler
metadata:
  name: keycloak-db-pooler
  namespace: keycloak
spec:
  cluster:
    name: keycloak-db
  instances: 2
  type: rw
  pgbouncer:
    poolMode: transaction
    parameters:
      default_pool_size: "25"
      max_client_conn: "200"

Point Keycloak at the PgBouncer service (keycloak-db-pooler-rw) rather than the database directly.

Infinispan Distributed Caching

Keycloak uses Infinispan for caching user sessions, authentication sessions, offline tokens, and login failures. In a multi-pod deployment, these caches must be replicated across nodes.

Embedded vs External Infinispan

Embedded (default): Keycloak pods form a JGroups cluster and replicate caches directly between themselves. Simpler to set up, works well for up to 5-8 nodes.

External: A separate Infinispan cluster manages caches. Better for large deployments (10+ Keycloak pods) where you want to scale the cache tier independently.

For most deployments, embedded Infinispan with DNS-based discovery is sufficient:

# Environment variables for JGroups DNS_PING discovery
- name: KC_CACHE
  value: "ispn"
- name: KC_CACHE_STACK
  value: "kubernetes"
- name: JAVA_OPTS_KC_HEAP
  value: "-XX:MaxRAMPercentage=70.0"
- name: jgroups.dns.query
  value: "keycloak-headless.keycloak.svc.cluster.local"

The headless Service (shown in the next section) enables pod-to-pod discovery. JGroups uses DNS_PING to find peers and forms the cache cluster automatically.

Cache Tuning

Configure cache owners based on your fault tolerance requirements:

Cache	Default Owners	Recommendation	Rationale
sessions	1	2	Losing a node should not log users out
authenticationSessions	1	2	In-flight logins survive node failure
offlineSessions	1	2	Offline tokens persist across restarts
loginFailures	1	1	Brute-force counters can afford loss

Kubernetes Deployment Configuration

Here is the complete StatefulSet with health probes, resource limits, and auto-scaling:

apiVersion: v1
kind: Service
metadata:
  name: keycloak-headless
  namespace: keycloak
  labels:
    app: keycloak
spec:
  type: ClusterIP
  clusterIP: None
  selector:
    app: keycloak
  ports:
    - name: http
      port: 8080
    - name: jgroups
      port: 7800
---
apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: keycloak
  namespace: keycloak
spec:
  serviceName: keycloak-headless
  replicas: 3
  selector:
    matchLabels:
      app: keycloak
  template:
    metadata:
      labels:
        app: keycloak
    spec:
      terminationGracePeriodSeconds: 60
      containers:
        - name: keycloak
          image: quay.io/keycloak/keycloak:26.0.0
          args: ["start"]
          ports:
            - containerPort: 8080
              name: http
            - containerPort: 7800
              name: jgroups
            - containerPort: 9000
              name: management
          env:
            - name: KC_DB
              value: "postgres"
            - name: KC_DB_URL
              value: "jdbc:postgresql://keycloak-db-pooler-rw:5432/keycloak"
            - name: KC_DB_USERNAME
              valueFrom:
                secretKeyRef:
                  name: keycloak-db-credentials
                  key: username
            - name: KC_DB_PASSWORD
              valueFrom:
                secretKeyRef:
                  name: keycloak-db-credentials
                  key: password
            - name: KC_HOSTNAME
              value: "auth.example.com"
            - name: KC_PROXY_HEADERS
              value: "xforwarded"
            - name: KC_HTTP_ENABLED
              value: "true"
            - name: KC_HEALTH_ENABLED
              value: "true"
            - name: KC_METRICS_ENABLED
              value: "true"
            - name: KC_CACHE
              value: "ispn"
            - name: KC_CACHE_STACK
              value: "kubernetes"
            - name: jgroups.dns.query
              value: "keycloak-headless.keycloak.svc.cluster.local"
            - name: JAVA_OPTS_KC_HEAP
              value: "-XX:MaxRAMPercentage=70.0"
          resources:
            requests:
              cpu: "500m"
              memory: "1Gi"
            limits:
              cpu: "2"
              memory: "2Gi"
          readinessProbe:
            httpGet:
              path: /health/ready
              port: 9000
            initialDelaySeconds: 30
            periodSeconds: 10
            failureThreshold: 3
          livenessProbe:
            httpGet:
              path: /health/live
              port: 9000
            initialDelaySeconds: 60
            periodSeconds: 15
            failureThreshold: 5
          startupProbe:
            httpGet:
              path: /health/started
              port: 9000
            initialDelaySeconds: 15
            periodSeconds: 5
            failureThreshold: 30
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: keycloak-hpa
  namespace: keycloak
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: StatefulSet
    name: keycloak
  minReplicas: 3
  maxReplicas: 10
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80

Key decisions in this configuration:

StatefulSet over Deployment ensures stable network identities for JGroups cluster formation.
Separate management port (9000) for health checks keeps probe traffic off the main HTTP port.
startupProbe with generous thresholds prevents Kubernetes from killing pods during slow initial startup or database migration.
terminationGracePeriodSeconds: 60 gives in-flight requests time to complete during scale-down.

Zero-Downtime Upgrades

Keycloak upgrades involve database schema migrations that must run exactly once. Here is how to handle them safely:

Rolling Update Strategy

Set maxUnavailable: 0 and maxSurge: 1 to ensure all existing pods stay healthy while new ones start.
Use an init container or a separate Job to run database migrations before the new pods start serving traffic.
Enable sticky sessions on your Ingress to prevent mid-authentication flows from being disrupted.

# Ingress annotation for sticky sessions (nginx)
metadata:
  annotations:
    nginx.ingress.kubernetes.io/affinity: "cookie"
    nginx.ingress.kubernetes.io/session-cookie-name: "KC_ROUTE"
    nginx.ingress.kubernetes.io/session-cookie-max-age: "600"

Blue-Green Deployment

For major version upgrades (e.g., 25.x to 26.x), a blue-green approach is safer:

Deploy the new version as a separate StatefulSet (keycloak-green) pointing to a migrated copy of the database.
Run smoke tests against the green deployment.
Switch the Ingress backend from blue to green.
Monitor for 30 minutes, then decommission blue.

This approach adds infrastructure cost but eliminates risk for upgrades with breaking changes.

Monitoring and Alerting

Keycloak exposes Prometheus metrics on the /metrics endpoint when KC_METRICS_ENABLED=true. Here are the critical metrics to watch:

Metric	Alert Threshold	Meaning
`keycloak_request_duration_seconds` (p99)	> 2s	Authentication latency too high
`keycloak_request_errors_total`	> 1% of total requests	Elevated error rate
`jvm_memory_used_bytes` / `jvm_memory_max_bytes`	> 85%	Memory pressure, risk of OOM
`jvm_gc_pause_seconds` (p99)	> 500ms	GC pauses affecting response times
`vendor_cache_manager_default_cache_keycloak_sessions_statistics_stores`	dropping	Session store failures
`pg_stat_activity_count`	> 80% of max_connections	Database connection exhaustion

Grafana Dashboard

Use the community Keycloak Grafana dashboard (ID: 19659) as a starting point. Add panels for:

Token issuance rate (tokens/sec by grant type)
Active sessions per realm
Cache hit ratio (should be above 95%)
Database query latency (p50, p95, p99)

ServiceMonitor for Prometheus Operator

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: keycloak-monitor
  namespace: keycloak
spec:
  selector:
    matchLabels:
      app: keycloak
  endpoints:
    - port: management
      path: /metrics
      interval: 15s

Disaster Recovery and Backup Strategy

A robust DR plan covers three layers:

Database backups: Use continuous WAL archiving with CloudNativePG's built-in Barman integration (shown in the PostgreSQL section above). This gives you point-in-time recovery to any second within your retention window.

Realm export: Schedule periodic realm exports via Keycloak's admin CLI. These JSON exports capture realm configuration, client definitions, and role mappings --- everything needed to rebuild from scratch.

# Export all realms (run as a CronJob in your cluster)
/opt/keycloak/bin/kc.sh export \
  --dir /tmp/keycloak-export \
  --users realm_file

# Upload to object storage
aws s3 sync /tmp/keycloak-export s3://keycloak-backups/realm-exports/$(date +%F)/

Cluster state: Back up your Kubernetes manifests (or Helm values) in version control. If you lose the cluster entirely, you need to be able to recreate the infrastructure, restore the database, and redeploy.

Recovery time targets:

Scenario	RTO	RPO	Strategy
Single pod failure	< 30s	0	Kubernetes auto-restart + session replication
Full node failure	< 2min	0	Pod rescheduling + persistent sessions
Database primary failure	< 60s	0	Automatic failover to replica
Complete cluster loss	< 1hr	< 5min	Restore from backup to new cluster

Performance Tuning Tips

JVM Settings

Keycloak 26.x runs on Quarkus, which is significantly lighter than the old WildFly distribution. Key JVM tuning parameters:

- name: JAVA_OPTS_KC_HEAP
  value: "-XX:MaxRAMPercentage=70.0"
- name: JAVA_OPTS_APPEND
  value: >-
    -XX:+UseG1GC
    -XX:MaxGCPauseMillis=200
    -XX:+UseStringDeduplication
    -Djgroups.thread_pool.min_threads=20
    -Djgroups.thread_pool.max_threads=200

Connection Pool Sizing

The default connection pool (min 0, max 100) is often too large. Right-size it:

- name: KC_DB_POOL_MIN_SIZE
  value: "10"
- name: KC_DB_POOL_MAX_SIZE
  value: "50"
- name: KC_DB_POOL_INITIAL_SIZE
  value: "10"

A good rule of thumb: set max_size to (2 * CPU cores) + available disk spindles per pod, then multiply by pod count and ensure this stays well below PostgreSQL's max_connections.

Cache Tuning

For high-traffic deployments, increase the Infinispan cache sizes in a custom cache-ispn.xml:

sessions cache: Set max entries based on expected concurrent sessions
Enable near-caching for read-heavy caches like realms and authorization
Use SYNC replication mode for session caches (consistency over speed)

Cost Estimation

Here is what to budget for different scales, using AWS EKS as a reference:

Scale	Users	Keycloak Pods	DB Instance	Monthly Cost (est.)
Starter	< 10K MAU	3x (1 vCPU, 2GB)	db.t4g.medium (HA)	$350-500
Growth	10K-100K MAU	5x (2 vCPU, 4GB)	db.r6g.large (HA)	$800-1,200
Enterprise	100K-1M MAU	8-10x (2 vCPU, 4GB)	db.r6g.xlarge (HA)	$2,000-3,500
Large Scale	1M+ MAU	10-15x (4 vCPU, 8GB) + external Infinispan	db.r6g.2xlarge (HA)	$5,000-8,000

These estimates include compute, storage, and data transfer. They do not include engineering time for setup, maintenance, monitoring configuration, and on-call coverage --- which is often the largest cost.

Skip the Infrastructure Work

Building and maintaining a production Keycloak cluster is a serious engineering investment. The Kubernetes manifests in this guide are a starting point, but production reality includes certificate rotation, secret management, log aggregation, upgrade testing, capacity planning, and incident response.

KeycloakPro's managed HA cluster service handles all of this. We deploy and operate production-grade Keycloak clusters with 99.99% SLA, automated backups, zero-downtime upgrades, and 24/7 monitoring --- so your team can focus on building your product instead of operating identity infrastructure.

Explore KeycloakPro's HA Cluster plans or talk to our team about your deployment requirements.

Why Single-Node Keycloak Fails in Production

Architecture Overview: The HA Keycloak Stack

Choosing Your Deployment Method

PostgreSQL HA Setup

Primary-Replica with Streaming Replication

Connection Pooling with PgBouncer

Infinispan Distributed Caching

Embedded vs External Infinispan

Cache Tuning

Kubernetes Deployment Configuration

Zero-Downtime Upgrades

Rolling Update Strategy

Blue-Green Deployment

Monitoring and Alerting

Grafana Dashboard

ServiceMonitor for Prometheus Operator

Disaster Recovery and Backup Strategy

Performance Tuning Tips

JVM Settings

Connection Pool Sizing

Cache Tuning

Cost Estimation

Skip the Infrastructure Work

Need Help With Keycloak?