Skip to content

Proposal: package statistics collection at registration time #623

@ericphanson

Description

@ericphanson

Background

in 2021 and 2023, @giordano and I presented package ecosystem statistics at JuliaCon (and in 2021 we wrote a blog post, https://julialang.org/blog/2021/08/general-survey/), collected through our package PackageAnalyzer. This is a quick static analysis designed to be run at ecosystem scale and can provide details like

julia> using PackageAnalyzer

julia> analyze("Flux")
PackageV1 Flux:
  * repo: https://github.com/FluxML/Flux.jl.git
  * uuid: 587475ba-b771-5e3f-ad9e-33799f191a9c
  * version: 0.16.5
  * is reachable: true
  * tree hash: d0751ca4c9762d9033534057274235dfef86aaf9
  * Julia code in `src`: 3420 lines
  * Julia code in `ext`: 419 lines (5.1% of `test` + `src` + `ext`)
  * Julia code in `test`: 4310 lines (52.9% of `test` + `src` + `ext`)
  * documentation in `docs`: 4374 lines (53.3% of `docs` + `src` + `ext`)
  * documentation in README & docstrings: 3791 lines (52.6% of README + `src`)
  * has license(s) in file: MIT
    * filename: LICENSE.md
    * OSI approved: true
  * has `docs/make.jl`: true
  * has `test/runtests.jl`: true
  * has continuous integration: true
    * GitHub Actions
    * Buildkite

This can be mildly useful on its own, but is most interesting when run over the whole registry, ideally at many points in time, so trends can emerge. For example, in March 2021 I added a license check to AutoMerge, and in our 2023 JuliaCon talk we plotted the results of that:

Image

(sorry for the blurry screencap!). To obtain that plot, Mosè and I wrote some code to checkout old versions of the General registry, then download & analyze the latest version of every package available at that point in time, saving out the results. This is a heavily IO bound procedure that ran on a big server for several days (with annoying crashes to deal with). The package analysis itself is quick (less than a second) once you have the code, but it takes awhile (and risk of network failures) to get each package downloaded. Thus, we have structured this work as one-off analyses rather than doing it on an ongoing basis, because the maintenance & compute burden is too high.

Motivation

I think it would be fun and useful to have package statistics data available automatically on an ongoing (say, monthly) basis, and to allow similar plots to be generated easily by anyone without needing a big server or to clone every package. In addition, I think a static github pages frontend could make it easy to explore the data. This would allow:

Implementation

New repos

I would create 2 new repos:

  • julia-ecosystem-stats-staging
  • julia-ecosystem-stats

whose purpose will be discussed below. They could live in the JuliaEcosystem org, like PackageAnalyzer.

Changes to PackageAnalyzer

I would add a PackageSummary object which summarizes a PackageAnalysisV1 object. Currently we compute fairly detailed information, e.g. per-file lines of code, of which the printed display above is a summary. A PackageSummaryV1 would have fields like lines_of_julia_src_code which aggregate over these, without having nested tables like PackageAnalysisV1. Additionally, contributor info stored in a PackageSummary would be purely counts, no usernames or git emails or anything like that.

I would also add something to read/write a PackageSummary to disk compactly in plain text, maybe a single-line TSV file or something. Or we could do a binary encoding. This would include the schema version (i.e. the V1 from PackageSummaryV1). Here we are using PackageAnalyzer's existing schema system (Legolas.jl) which allows multiple schema versions etc.

Changes to AutoMerge per-PR checks

During AutoMerge, we already clone the package repository to check it for a license. This is a good opportunity to run an analysis, since we have the code downloaded already. I would update AutoMerge to optionally (configurable per-registry) compute a PackageSummary and push it to a configurable repo, which for General would be julia-ecosystem-stats-staging, in a file staging/YYYY/MM/<uuid>@<version>-<tree>.tsv (using PackageSummary serialization from PackageAnalyzer). If AutoMerge runs several times on the PR, or on the same version but different PRs it would overwrite this file. There should be no races as the analysis should be identical if the tree hash is the same and we are keying off tree hash. Here staging means we know this tree-hash of the package may not be the one that is registered. Nevertheless, this is the time we already have "free" access to the code (it is downloaded) and are already running an analysis (license checking). So it is a relatively small modification to add a deeper analysis and save the results to a repo.

Notes:

  • Later, we can inspect the registry itself to know all the tree-hashes that actually were registered. So then we can just load the relevant files.
  • We could configure this repo to delete old history and only keep the last few months of data. This is a staging area, not persistence.
  • we use a separate file per pkguuid-version-tree to avoid git conflicts and races. Since our analysis should not vary for a given tree, we don't need to worry about "which" one gets committed if there are concurrent AutoMerge runs for some reason.
  • We could surface the analysis in the AutoMerge comment as well, perhaps in a details block if it is annoying.

julia-ecosystem-stats

This repo would persist the data long-term as well as host a front-end to view it.

Once per month, a github action would:

  • clone julia-ecosystem-stats-staging and General, and identify pkguuid-version-tree triplets there that were actually registered, and introspect the git history of General to get registration dates.
  • build a pkg_dense_monthly/YYYY/MM.csv.gz compressed CSV for the month. This would have 1 row per package, with an analysis of the package-version which has the highest semver version of the package as of YYYY/MM. If no versions were registered, it would be backfilled from previous months. This gives us a "dense" (no skipped packages) record that is easy to analyze. It would include columns for package version, registration date, etc. Besides being compressed, I would likely use a package-uuid-to-integer-lookup JSON sidecar which is the same for every month (updated each month with new UUIDs) so we can avoid incompressible UUIDs and keep this CSV as small as possible. It is likely possible to have a couple hundred bytes per package.
    • this would be the actual data being persisted. The granularity is package-month, not package-version, so that packages which register tons of versions don't bloat the data and we have more predictable growth. We have 10k packages now, so if we go back 5 years, that's 600k total rows -- not too bad, spread over 60 csvs.

Notes:

  • At this point, we have ongoing (on a monthly basis) package analysis stats. I would additionally backfill it with a big one-off compute job (run locally).
  • Since github cron jobs get turned off on inactive repos, we could have General have the cron job and workflow-dispatch this repo once a month instead of using a cron job in julia-ecosystem-stats
  • the pkg_dense_monthly/YYYY/MM.csv.gz files are typically write-once update-never, so we don't need to use a diff-friendly format (i.e. compression is OK). We could use a binary format if we wanted as well.

front-end

I'm not exactly sure what this would look like yet, but I am thinking to use github actions to deploy a static site to github pages. The deployment process would:

  • generate data aggregations (we could also compute these earlier and check into the repo if desired)
  • bundle/compress/format the data in a way friendly to the frontend
  • use a lot of client-side javascript to load and plot the data in various ways, ideally in a flexible way permitting data exploration

The key bit being, modern github pages do not need a branch; you can create a tar bundle any way you want and hand it over to github pages to deploy. So we can produce aggregations or generate plots which we do not need to check-in to the repo itself. (Although ideally the plotting happens client side instead of server side).

Questions

  1. Do the AutoMerge changes seem acceptable?
  2. Does the implementation seem maintainable from a technical perspective?
  3. Other thoughts?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions