Skip to content

Stream 2 - Monitoring, Observability and Security #7

@afrittoli

Description

@afrittoli

In the second workstream, we will focus on extending monitoring and observability to all runners, regardless of whether they run in AWS, a different public cloud, or a self-hosted pool.

Work on this stream can start in parallel with stream 1.
However, it should not be prioritised over it.

Monitoring and observability must be available at various levels:

  • Monitoring of services (autoscaler, caches, etc)
  • Monitoring of long-lived runners
  • Autoscaler metrics
  • Job-level metrics and logs
  • Test level metrics
  • Billing data

Some monitoring is available today for jobs that run in self-hosted pools, but we aim to make it easier for organisations to set that up.

Today, the PyTorch foundation uses Hud to visualise data from various sources. Data and signals from multiple sources shall be integrated back into Hud, where possible, to provide a consistent view across clouds from a single dashboard.

Note

Compared to the initial proposal, I included security to the stream two work, referring to the security of runners and the build system more broadly, in light of previous attacks there were executed against PyTorch CI system.

Reference material

Sub-issues

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions