Datalore 2024.1 Help

Healthcheck & Monitoring

Healthcheck

Datalore has built-in HTTP endpoint, which can be used for verifying whether instance has become online and responsive.

This endpoint is available by /healthpath, and returns OK in case no issues are detected.

The same endpoint is also used as Kubernetes liveness probe, if default Helm charts are used for the deployment.

    Monitoring

    Datalore has built-in metrics exporter, which is disabled by default and accessible by /metrics path, once enabled explicitly.

    There are two mutually exclusive environment variables of the Datalore server, which can be used to enable metrics:

    • METRICS_AUTH_TOKEN: once defined, enables the exporter and defines the authentication token needs to be used to acquire metrics.

    • ENABLE_UNAUTHORIZED_METRICS: once defined, enables the exporter. No authentication will be required to read metrics.

      Metrics

      1. agent_pool_size: shows how many agents the pool currently has.

        • Prometheus query: sum by (instance_name)(agent_pool_size)

      2. agent_waiting_time_bucket: represents a timespan in which user has awaited an instance startup.

        • Prometheus query: sum(increase(agent_waiting_time_bucket[10m])) by (le)

      3. agent_in_pool_time_bucket: represents a timespan in which agent has been online and idle, before assigned to a specific notebook.

        • Prometheus query: sum(increase(agent_in_pool_time_bucket[10m])) by (le)

      4. agents_started_total: shows how many agents have been started per minute.

        • Prometheus query: sum by (instance_name)(rate(agents_started_total[5m])) * 60

        Last modified: 09 April 2024