Skip to content

Add Healthcheck API for Cassini #96

@vonjackets

Description

@vonjackets

Cassini’s ListenerManager currently operates as an mTLS-enabled TCP server that accepts secure client connections and spawns listener actors for each session. There is currently no way for operators, orchestrators (e.g., the control plane), or CI harnesses to verify that Cassini is healthy, responsive, or ready to accept new connections.

This issue proposes adding a healthcheck API for Cassini to expose readiness and liveness information.


Background

Cassini is a long-running actor-based service. It doesn't expose HTTP endpoints; all interactions occur over a secured TCP layer. However, external orchestration systems (local controller, future distributed runners, or Kubernetes) need a way to:

  • Verify that Cassini is up and listening.
  • Confirm it can accept and process new client connections.
  • Optionally query internal actor or system health (e.g., listener counts, backlog size).

A healthcheck mechanism is especially important as we move toward distributed test orchestration, where the control plane will need to determine service readiness programmatically before dispatching test plans.


Questions & Design Considerations

🧩 1. Should the healthcheck be REST/HTTP-based?

Pros (HTTP/REST):

  • Standardized pattern (GET /healthz, /readyz) supported by most infra tooling.
  • Works seamlessly with container health probes and service monitors.
  • Easy to implement with lightweight HTTP server crates (axum, hyper, tiny_http).

Cons:

  • Introduces a separate protocol (HTTP vs. TCP).
  • Slightly more overhead; might feel inconsistent with Cassini’s current architecture.

Alternative: TCP-based healthcheck

  • Simpler but lower fidelity: e.g., attempt to open a TCP connection to Cassini’s port, verify TLS handshake.
  • Confirms the listener is running and certificates are valid.
  • Does not confirm internal actor state (e.g., if ListenerManager crashed internally but socket is still bound).

Recommendation:
✅ Implement a small auxiliary HTTP healthcheck endpoint (listening on localhost or a management port).
This endpoint can internally query the ListenerManager actor to report:

  • status: "ok" | "degraded" | "error"
  • number of active listener actors
  • timestamp of last accepted connection
  • broker connectivity (optional future check)

This keeps the operational interface standard without modifying the existing TCP listener behavior.


Proposed Implementation Plan

  1. Create a HealthState struct (stored in the supervisor or updated via message from ListenerManager):

    struct HealthState {
        last_connection: Option<Instant>,
        active_listeners: usize,
        last_error: Option<String>,
    }
  2. Expose an HTTP endpoint (e.g., port 9090 or configurable):

    • /healthz → returns 200 OK if listener is active and responsive.
    • /metrics (optional) → returns system stats in JSON or Prometheus format.
  3. Implement a simple healthcheck actor:

    • Periodically queries the ListenerManager for its internal state via an actor message.
    • Updates the HealthState cache.
  4. Update deployment configs (optional) to include readiness/liveness probes hitting localhost:9090/healthz.


Acceptance Criteria

  • Cassini exposes a healthcheck endpoint returning 200 when listener is healthy.
  • Failing TCP bind or actor supervision crash results in non-200 response.
  • The polar harness controller can use this endpoint to block until Cassini is ready before starting test runs.
  • HealthCheckResponse contains data like uptime, existing topics (if any).

Metadata

Metadata

Assignees

Labels

documentationImprovements or additions to documentationenhancementNew feature or requestgood first issueGood for newcomers

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions