-
Notifications
You must be signed in to change notification settings - Fork 0
Description
In the second workstream, we will focus on extending monitoring and observability to all runners, regardless of whether they run in AWS, a different public cloud, or a self-hosted pool.
Work on this stream can start in parallel with stream 1.
However, it should not be prioritised over it.
Monitoring and observability must be available at various levels:
- Monitoring of services (autoscaler, caches, etc)
- Monitoring of long-lived runners
- Autoscaler metrics
- Job-level metrics and logs
- Test level metrics
- Billing data
Some monitoring is available today for jobs that run in self-hosted pools, but we aim to make it easier for organisations to set that up.
Today, the PyTorch foundation uses Hud to visualise data from various sources. Data and signals from multiple sources shall be integrated back into Hud, where possible, to provide a consistent view across clouds from a single dashboard.
Note
Compared to the initial proposal, I included security to the stream two work, referring to the security of runners and the build system more broadly, in light of previous attacks there were executed against PyTorch CI system.