> ## Documentation Index
> Fetch the complete documentation index at: https://docs.opal.dev/llms.txt
> Use this file to discover all available pages before exploring further.

# Kubernetes Components and Health Monitoring

> Learn about various kubernetes pods and jobs, and health monitoring recommendations on self-hosted Opal.

This guide provides high-level recommendations for self-hosted Opal customers on the various Kubernetes pods and jobs, their schedules, and recommended health monitoring practices.

## Overview

Opal consists of several types of Kubernetes workloads:

* **Deployments**: Long-running pods that handle API requests and background task processing
* **CronJobs**: Scheduled jobs that run periodically to sync data, clean up resources, and process background tasks
* **Jobs**: One-time jobs that run during upgrades

## Long-Running Deployments

These are long-running pods that run continuously. We recommend monitoring pods to be in a “Running” state and watching for unexpected restarts. Refer below for specific deployments and recommendations.

### 1. Web Backend `opal-web`

**Purpose**: Main API server handling HTTP requests from the frontend and external integrations.

**Health Monitoring**:

* [Ensure](https://kubernetes.io/docs/tutorials/kubernetes-basics/explore/explore-intro/#check-application-configuration) pods are running and ready (not in [CrashLoopBackOff, Error, or Pending](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#container-restarts) states)
* Monitor pod restart count

**Recommended Alerts**:

* Pods not running or ready
* Pod restart count increasing
* Pods stuck in error or pending states

### 2. Event Consumers `opal-web-event-consumers`

**Purpose:** Processes events from external systems and internal event streams.

**Health Monitoring**:

* Ensure pods are running and ready

**Recommended Alerts:**

* Pods not running or ready
* Pod restart count increasing

### 3. Task Workers `opal-web-task-workers`

**Purpose:** Processes background tasks from specialized queues. Different task worker types handle different operations such as general async work, event streaming, sync operations, and propagation tasks.

**Health Monitoring:**

* Ensure all task worker pods are running and ready
* Monitor restart counts across all task worker deployments

**Recommended Alerts:**

* Any task worker pods not running or ready
* Pod restart count increasing
* Task worker pods stuck in error or pending states

<Info>
  Some task workers have longer termination grace periods to allow long-running
  tasks to complete.
</Info>

{" "}

## Scheduled CronJobs

These are automated jobs that run periodically on a fixed schedule to sync data and clean up resources. Generally, we recommend monitoring for full sync completions within its expected timeframe (typically 2x the schedule interval), and setting up alerts for job failures. Refer below for specific CronJobs.

### 1. Sync Jobs

**Purpose:** Performs sync of user and resource data from connected systems (IDPs, cloud providers, HR systems).

**Schedules:**

* Regular sync: Every 4 hours `(0 */4 * * *)`
* Daily sync: Daily at 9:30 AM UTC `(30 9 * * *)`
* High-frequency sync: Every 5 minutes `(* /5 * * * *)`

**Starting Deadline:** 120-180 seconds

**Health Monitoring:**

* [Ensure](https://kubernetes.io/docs/tasks/job/automated-tasks-with-cron-jobs/) regular sync completes successfully within 8 hours (2x schedule interval)
* Ensure daily sync completes successfully within 48 hours (2x schedule interval)
* Ensure high-frequency sync completes successfully within 20 minutes (4x schedule interval)

**Recommended Alerts:**

* No successful full sync completion in the past 8 hours
* No successful daily sync completion in the past 48 hours
* No successful high-frequency sync completion in the past 10 minutes
* Sync job failure (exit code != 0)

### 2. Event Streaming Jobs

**Purpose:** Manages event stream operations including publishing events to external systems, requeuing failed messages, deactivating unused streams, and cleaning up old messages.

**Schedules:**

* Event stream producer: Every 1 minute `(*/1 * * * *)`
* Event stream re-queuer: Every 15 minutes `(*/15 * * * *)`
* Event stream deactivator: Every 30 minutes `(*/30 * * * *)`
* Event stream messages cleanup: Daily at 12:00 PM UTC `(0 12 * * *)`
* Event stream notifier: Daily at 1:00 PM UTC `(0 13 * * *)`

**Starting Deadline:** 120 seconds

**Health Monitoring:**

* Ensure event stream producer completes successfully within 2 minutes (2x schedule interval)
* Ensure event stream re-queuer completes successfully within 30 minutes (2x schedule interval)
* Ensure event stream deactivator completes successfully within 60 minutes (2x schedule interval)
* Ensure event stream cleanup and notifier complete successfully within 48 hours (2x schedule interval)

**Recommended Alerts:**

* Event stream producer no successful completion in the past 2 minutes
* Event stream re-queuer no successful completion in the past 30 minutes
* Event stream deactivator no successful completion in the past 60 minutes
* Event stream cleanup/notifier no successful completion in the past 48 hours
* Event streaming job failure (exit code != 0)

### 3. Recommendations `recommendations-subscores`

**Purpose:** Calculates and updates recommendation subscores for resources and groups to support access recommendations and risk analysis.

**Schedule:** Every 5 minutes `(*/5 * * * *)`

**Starting Deadline:** 120 seconds

**Health Monitoring:**

* Ensure job completes successfully within 10 minutes (2x schedule interval)

**Recommended Alerts:**

* No successful completion in the past 10 minutes
* Job failure (exit code != 0)

### 4. Metrics Collection `metrics-collector`

**Purpose:** Collects and aggregates metrics for reporting and analytics.

**Schedule:** Daily at 6:30 AM UTC `(30 6 * * *)`

**Starting Deadline:** 120 seconds

**Health Monitoring:**

* Ensure job completes successfully within 48 hours (2x schedule interval)

**Recommended Alerts:**

* No successful completion in the past 48 hours
* Job failure (exit code != 0)

### 5. Scheduled Tasks Cleanup `scheduled-tasks-cleanup`

**Purpose:** Cleans up old completed scheduled tasks from the database.

**Schedule:** Every 5 minutes `(*/5 * * * *)`

**Starting Deadline:** 120 seconds

**Health Monitoring:**

* Ensure job completes successfully within 48 hours (2x schedule interval)

**Recommended Alerts:**

* No successful completion in the past 48 hours
* Job failure (exit code != 0)

## One-Time Jobs

These jobs execute once to perform critical set up tasks. You should monitor that these jobs complete without errors during the upgrade window.

### Oneoff `oneoff`

**Purpose:** Runs one-time database migrations and setup tasks during Helm upgrades.

**Trigger:** Runs automatically as a post-upgrade [Helm hook](https://helm.sh/docs/topics/charts_hooks/)\*\*

**Health Monitoring:**

* Monitor job completion status
* Check for job failures during upgrades

**Recommended Alerts:**

* Job failure during upgrades (exit code != 0)
* Job running longer than expected

## General Health Monitoring Recommendations

### For All Deployments

1. **[Pod Status](https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-phase):** Monitor for pods in `CrashLoopBackOff`, `Error`, or`Pending`states
2. **[Resource Usage](https://kubernetes.io/docs/tasks/debug/debug-cluster/resource-metrics-pipeline/#measuring-resource-usage):** CPU usage > 80% of request, Memory usage > 80% of limit
3. **Restart Count:** Alert if a pod restarts more than 3 times in an hour
4. **Availability:** Ensure pods are running as expected

### For All CronJobs

1. **Execution Status:** Ensure jobs have completed successfully within 2x-4x their schedule interval
2. **Failure Monitoring:** Alert on job failures (exit code != 0)
3. **Concurrency:** All CronJobs use `Forbid` concurrency policy - ensure previous jobs complete before new ones start

### Database Migrations

Monitor for database migration failures or delays during pod startup and upgrades, as these may indicate database performance issues.

### Common Issues to Watch For

1. **Database Connection Issues:** All components depend on PostgreSQL. Monitor database connectivity.
2. **Redis Connection Issues:** Task workers and event consumers depend on Redis. Monitor Redis connectivity.
3. **Resource Constraints:** High memory/CPU usage may cause pods to be evicted or OOMKilled.
4. **Network Issues:** Pods need network access to external systems (IDPs, cloud providers) for syncing.

## Monitoring Best Practices

1. **Set up alerts** for all critical components (web backend, sync jobs)

2. **Monitor logs** for error patterns and exceptions

3. **[Track metrics](https://kubernetes.io/docs/concepts/cluster-administration/system-metrics/)** for processing times, and error rates

4. **Set up dashboards** for:

   * Pod health and resource usage
   * CronJob execution status and duration
   * Application metrics (request rates, error rates, latency)

5. **[Use Kubernetes events](https://kubernetes.io/docs/tasks/debug/debug-cluster/#looking-at-events)** to monitor for scheduling issues, pod evictions, etc.

## Additional Notes

* Every pod, including all CronJobs, include database migrations as init containers
* All components use the same Docker image (`opal-web_backend`) with different command arguments
* Health endpoints (`/api/health` and `/api/readiness`) are configured on all deployments via liveness and readiness probes
