-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Description
Monitoring is required for non-ephemeral runners, and it serves several purposes:
- Evaluating the stability of new runner pools. Before using runners in jobs that may impact CI/CD, the CI infra team may observe the stability of the runners using jobs that do not affect CI/CD
- Responding in a timely manner to incidents. E.g. if a runner pool goes offline or its capacity is reduced, altert the relevant team, the community and possibly disable the runners until the issue is fixed
Metrics are required for all types of runners and should describe the lifecycle of runners.
Most metrics in this area may already be available from GitHub, but this has to be verified.
Both monitoring data and metrics should be available in the OpenTelementry format.
As part of this, we should:
- document the expectation for runners in terms of service availability
- design and document the setup required for PyTorch and community runners
- [optional] integrate OpenTelemetry data sources into Hud or build equivalent dashboards using tools like Grafana or similar
Metadata
Metadata
Assignees
Labels
No labels