Skip to content

Conversation

@dsmedia
Copy link
Collaborator

@dsmedia dsmedia commented Oct 27, 2025

Summary

Fixes broken and outdated source URLs in dataset metadata and documentation. These issues were identified during link checking as part of #724.

Changes

Dataset Metadata URLs (_data/datapackage_additions.toml)

  1. airports.csv: Updated Data.gov dataset ID

    • Before: https://catalog.data.gov/dataset/airports-5e97a
    • After: Description modified to include, "While the exact generation
      source of this file is unknown, this data is consistent with files provided on a monthly frequency by the FAA's National Airspace System Resource."
    • Reason: Precise file used to generate the csv cannot be determined. Very close match with early-2000s FAA datasets but no complete match found. NASR airport data shown as authoritative source of up-to-date information.
  2. londonBoroughs.json: Updated London Datastore URL

    • Before: https://data.london.gov.uk/dataset/statistical-gis-boundary-files-london
    • After: https://data.london.gov.uk/dataset/statistical-gis-boundary-files-for-london-20od9/
    • Reason: URL structure changed on data.london.gov.uk
  3. population_engineers_hurricanes.csv: Removed outdated FactFinder source

    • Removed: https://factfinder.census.gov/faces/tableservices/jsf/pages/productview.xhtml?pid=ACS_07_3YR_S1901&prodType=table
    • Reason: Incorrect source for population and employment data (the data is from 2016, not 2007). Also, factFinder is deprecated and URL is no longer accessible
    • Added: correct sources from census.gov / NOAA
  4. us-10m.json: Fixed LICENSE URL

    • Before: https://github.com/topojson/us-atlas/blob/master/LICENSE.md
    • After: https://github.com/topojson/us-atlas/blob/master/LICENSE
    • Reason: LICENSE file doesn't have .md extension
  5. world-110m.json: Fixed LICENSE URL

    • Before: https://github.com/topojson/world-atlas/blob/master/LICENSE.md
    • After: https://github.com/topojson/world-atlas/blob/master/LICENSE
    • Reason: LICENSE file doesn't have .md extension
  6. penguins.json: Updated Palmer Station LTER URL

    • Before: https://pal.lternet.edu/
    • After: https://pallter.marine.rutgers.edu/
    • Reason: Updated to current Palmer Station LTER site at Rutgers

Documentation URLs (CONTRIBUTING.md)

  1. Data Package Standard reference

    • Before: https://datapackage.org/standard/
    • After: https://datapackage.org/
    • Reason: /standard/ endpoint returns 404
  2. LICENSE link

    • Before: ./LICENSE
    • After: https://github.com/vega/vega-datasets/blob/main/README.md#license
    • Reason: Local LICENSE file doesn't exist; updated to GitHub README section

Verification

  • ✅ All 7 new/modified URLs manually verified as working
  • ✅ Regenerated datapackage.json and datapackage.md via npm run build
  • ✅ Code quality checks passed:
    • uvx taplo fmt --check --diff (TOML formatting)
    • uvx ruff check (Python linting)
    • uvx ruff format --check (Python formatting)

Related

Links checked as part of #724

Updates several broken or outdated source URLs in datapackage_additions.toml:

- airports.csv: Update to aviation-facilities1 (old dataset ID removed)
- londonBoroughs.json: Update to current data.london.gov.uk URL
- population_engineers_hurricanes.csv: Remove outdated FactFinder source (deprecated)
- us-10m.json: Fix LICENSE URL (remove incorrect .md extension)
- world-110m.json: Fix LICENSE URL (remove incorrect .md extension)
- penguins.json: Update LTER URL to palmerpenguins R package site

Also fixes documentation URLs in CONTRIBUTING.md:
- Update datapackage.org/standard URL

All URLs verified accessible. Broken URLs discovered during link checking
with lychee. Regenerated datapackage.json and datapackage.md via npm build.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Update LICENSE reference to point to GitHub README.md#license section
instead of local ./LICENSE file which doesn't exist in the repository.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
Update Palmer Station source URL to the correct LTER site at Rutgers
(https://pallter.marine.rutgers.edu/) instead of the palmerpenguins
R package site.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <[email protected]>
@domoritz
Copy link
Member

For removing URLs. Can we instead link to the internet archive?

…n_engineers_hurricanes.csv

Updated resource descriptions and sources in the datapackage_additions TOML file for data/population_engineers_hurricanes.csv. Corrects year and ACS dataset for population and employment data and confirms ratio denominator. Identifies likely source, a NOAA FAQ, of state-level hurricane count aggregation that specifies the methodology used (e.g. 'direct hits') as well as link to disaggregated data table.
@dsmedia
Copy link
Collaborator Author

dsmedia commented Oct 27, 2025

For removing URLs. Can we instead link to the internet archive?

should be good now. as it turns out the prior source was inaccurate anyway. replaced with correct sources. thanks @domoritz

@domoritz
Copy link
Member

For all the updated URLs, are they for exactly the same data as before? Otherwise, I'd prefer keeping a non working but correct link.

@dsmedia
Copy link
Collaborator Author

dsmedia commented Oct 28, 2025

For all the updated URLs, are they for exactly the same data as before? Otherwise, I'd prefer keeping a non working but correct link.

Valid point. In this case, everything is confirmed identical (or were old links that had always been wrong) except for the airports dataset renaming on data.gov, which I'll need a little time to track down the details on. I'll also double check the penguins to be 100% sure.

GIven the inevitability of link rot, maybe we can best address the tradeoff between provenance and accessibility by hammering out a policy for the CONTRIBUTING.md? Would something like this satisfy the concern you raise? Happy to adjust it.

Policy for Updating URLs
To balance provenance with accessibility, the default policy for any dead or superseded link in this repo is to preserve and add. When the problematic URL appears as data source metadata in datapackage_additions.toml, preserve the original [[resources.sources]] entry for historical accuracy, annotating the source title with the ISO 8601 date the change was observed ([Inaccessible as of 2025-10-27]). Second, add a new, separate [[resources.sources]] entry immediately below it directing users to a digital archive and/or a suitable swap-in replacement source, where applicable. There are two exceptions to the default preserve and add rule. URLs may be updated in place when 1) the change is a simple renaming (e.g. domain migration) and the data is identical or 2) the original URL was erroneous from the start.

@dsmedia
Copy link
Collaborator Author

dsmedia commented Oct 29, 2025

After a closer look. I should be able to replicate (or nearly replicate) airports.csv with archived data available from the National Transportation Atlas Database (NTAD) from the US Department of Transportation. Once I finalize, I will update the source accordingly and possibly add a generation script here, or in a future PR.

@dsmedia
Copy link
Collaborator Author

dsmedia commented Oct 31, 2025

  • penguins source confirmed to be a new URL of the same site with same datasets available. we have two sources linked to this dataset, and the other is a github repo that also refers back to this URL.

  • airports.csv original provenance is unknown and is not clear from commits. It can be assumed to have originated with an official FAA dataset, however. The file in fact matches quite well with data available in early-2000s-era FAA data, but at least 50 airport codes from airports.csv could not be matched up against FAA datasets from any year. Up-to-date airport data shows significant differences, and replacing the original dataset with a current version may could interefere with example gallery visualizations that join airports and flights data.

  • corrections to metadata for population_engineers_hurricanes.csv remove an error in the original source attribution.

@domoritz Happy to answer any further concerns. That said, I am comfortable these changes improve accuracy and do not mask the provenance or sources of any of the datasets.

@domoritz
Copy link
Member

URLs should never change but of course do. Yet, it's not our job to keep them up to date. If you want to update URLs, that's fine but I don't want a policy as it implies that we continue to update URLs.

@dsmedia dsmedia merged commit cb43688 into vega:main Nov 3, 2025
2 checks passed
@dsmedia dsmedia deleted the claude/fix-dataset-source-urls-011CUWvJQkDNwVpAv1r8MCXS branch November 3, 2025 02:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants