Stream 2 - Monitoring, Observability and Security

In the second workstream, we will focus on extending monitoring and observability to all runners, regardless of whether they run in AWS, a different public cloud, or a self-hosted pool.

Work on this stream can start in parallel with stream 1. 
However, it should not be prioritised over it. 

Monitoring and observability must be available at various levels:
- Monitoring of services (autoscaler, caches, etc)
- Monitoring of long-lived runners
- Autoscaler metrics
- Job-level metrics and logs
- Test level metrics
- Billing data

Some monitoring is available today for jobs that run in self-hosted pools, but we aim to make it easier for organisations to set that up.

Today, the PyTorch foundation uses Hud to visualise data from various sources. Data and signals from multiple sources shall be integrated back into Hud, where possible, to provide a consistent view across clouds from a single dashboard.

> [!NOTE]
> Compared to the [initial proposal](https://docs.google.com/document/d/1hJZfphY9Yx8PafkIDibN0Mn9oxltwUgKQVdRkkMvZhk/edit?pli=1&tab=t.451gmjcybsid), I included *security* to the stream two work, referring to the security of runners and the build system more broadly, in light of [previous attacks](https://johnstawinski.com/2024/01/11/playing-with-fire-how-we-executed-a-critical-supply-chain-attack-on-pytorch/) there were executed against PyTorch CI system.

[Reference material](https://docs.google.com/document/d/1hJZfphY9Yx8PafkIDibN0Mn9oxltwUgKQVdRkkMvZhk/edit?pli=1&tab=t.451gmjcybsid)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Stream 2 - Monitoring, Observability and Security #7

Sub-issues

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Stream 2 - Monitoring, Observability and Security #7

Description

Sub-issues

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions