-
Notifications
You must be signed in to change notification settings - Fork 5
Description
Cassini’s ListenerManager currently operates as an mTLS-enabled TCP server that accepts secure client connections and spawns listener actors for each session. There is currently no way for operators, orchestrators (e.g., the control plane), or CI harnesses to verify that Cassini is healthy, responsive, or ready to accept new connections.
This issue proposes adding a healthcheck API for Cassini to expose readiness and liveness information.
Background
Cassini is a long-running actor-based service. It doesn't expose HTTP endpoints; all interactions occur over a secured TCP layer. However, external orchestration systems (local controller, future distributed runners, or Kubernetes) need a way to:
- Verify that Cassini is up and listening.
- Confirm it can accept and process new client connections.
- Optionally query internal actor or system health (e.g., listener counts, backlog size).
A healthcheck mechanism is especially important as we move toward distributed test orchestration, where the control plane will need to determine service readiness programmatically before dispatching test plans.
Questions & Design Considerations
🧩 1. Should the healthcheck be REST/HTTP-based?
Pros (HTTP/REST):
- Standardized pattern (
GET /healthz,/readyz) supported by most infra tooling. - Works seamlessly with container health probes and service monitors.
- Easy to implement with lightweight HTTP server crates (
axum,hyper,tiny_http).
Cons:
- Introduces a separate protocol (HTTP vs. TCP).
- Slightly more overhead; might feel inconsistent with Cassini’s current architecture.
Alternative: TCP-based healthcheck
- Simpler but lower fidelity: e.g., attempt to open a TCP connection to Cassini’s port, verify TLS handshake.
- Confirms the listener is running and certificates are valid.
- Does not confirm internal actor state (e.g., if ListenerManager crashed internally but socket is still bound).
Recommendation:
✅ Implement a small auxiliary HTTP healthcheck endpoint (listening on localhost or a management port).
This endpoint can internally query the ListenerManager actor to report:
status: "ok" | "degraded" | "error"- number of active listener actors
- timestamp of last accepted connection
- broker connectivity (optional future check)
This keeps the operational interface standard without modifying the existing TCP listener behavior.
Proposed Implementation Plan
-
Create a
HealthStatestruct (stored in the supervisor or updated via message fromListenerManager):struct HealthState { last_connection: Option<Instant>, active_listeners: usize, last_error: Option<String>, }
-
Expose an HTTP endpoint (e.g., port
9090or configurable):/healthz→ returns200 OKif listener is active and responsive./metrics(optional) → returns system stats in JSON or Prometheus format.
-
Implement a simple healthcheck actor:
- Periodically queries the
ListenerManagerfor its internal state via an actor message. - Updates the
HealthStatecache.
- Periodically queries the
-
Update deployment configs (optional) to include readiness/liveness probes hitting
localhost:9090/healthz.
Acceptance Criteria
- Cassini exposes a healthcheck endpoint returning 200 when listener is healthy.
- Failing TCP bind or actor supervision crash results in non-200 response.
- The polar harness controller can use this endpoint to block until Cassini is ready before starting test runs.
- HealthCheckResponse contains data like uptime, existing topics (if any).