This repository is the home of a suite of tools EDGI's Website Governance Project uses to monitor changes to government websites, both environment-related and otherwise. EDGI uses these tools to publish reports that are written about in major publications such as The Atlantic or Vice. Teams at other organizations use parts of this project for similar purposes or to provide comparisons between different versions of public web pages.
While there are a wide variety of commercial services for monitoring web pages, none of them worked well for EDGI at the scale of tracking thousands or tens of thousands of pages, which is what this project focuses on. These tools perform tasks like:
- Loading, storing, and analyzing historical snapshots of web pages.
- Providing an API for retrieving and updating data about those snapshots.
- A website for visualizing and browsing changes between those snapshots.
- Organizing the workflow and processes of a team of human analysts who use the above tools to track and publicize information about meaningful changes to government websites.
ℹ️ The project is broken up into a variety of smaller tools in different repositories (see “project structure”). For a combined view of all issues and status, check the project board. This repository is for project-wide documentation and issues.
- Project Structure
- Get Involved
- Project Overview
- Code of Conduct
- Contributors & Sponsors
- License & Copyright
The technical tooling for Web Monitoring is broken up into several repositories, each named web-monitoring-{name}:
| Repo | Description | Tools Used |
|---|---|---|
| web-monitoring | (This Repo!) Project-wide documentation and issue tracking. | Markdown |
| web-monitoring-db | A database and API that stores metadata about the pages, versions, changes we track, as well as human annotations about those changes. | Ruby, Rails, Postgresql |
| web-monitoring-ui | A web-based UI (built in React) that shows diffs between different versions of the pages we track. It’s built on the API provided by web-monitoring-db. | JavaScript, React |
| web-monitoring-processing | Python-based tools for importing data and for extracting and analyzing data in our database of monitored pages and changes. | Python |
| web-monitoring-diff | Algorithms for diffing web pages in a variety of ways and a web server for providing those diffs via an HTTP API. | Python, Tornado |
| web-monitoring-task-sheets | Analyzes changes stored in -db and generates filtered, prioritized spreadsheets human analysts use to plan their work. | Python |
| web-monitoring-crawler | Captures copies of pages that EDGI monitors and stores them in web-monitoring-db and the Internet Archive. | Python, Docker |
| web-monitoring-ops | Server configuration and other deployment information for managing EDGI’s live instance of all these tools. | Kubernetes, Bash, AWS |
| wayback | A Python API to the Internet Archive’s Wayback Machine. It gives you tools to search for and load mementos (historical copies of web pages). | Python |
For more on how all these parts fit together, see ARCHITECTURE.md.
We’d love your help on improving this project! If you are interested in getting involved…
- Please follow EDGI's Code of Conduct
- Join EDGI by filling out the volunteer form at http://envirodatagov.org/volunteer/. As a member, you can be more involved in our overall process or contribute to work beyond just the code.
This project is two-part! We rely both on open source code contributors (building this tool) and on volunteer analysts who use the tool to identify and characterize changes to government websites.
- Read through the Project Overview and especially the section on "meaningful changes" to get a better idea of the work.
- Fill out the volunteer form at http://envirodatagov.org/volunteer/.
- Be sure to check our contributor guidelines.
- Take a look through the repos listed in the Project Structure section and choose one that feels appropriate to your interests and skillset.
- Try to get a repo running on your machine (and if you have any challenges, please make issues about them!).
- Find an issue labeled
good-first-issueand work to resolve it.
The purpose of the system is to enable analysts to quickly review monitored government websites in order to report on meaningful changes. In order to do so, the system, a.k.a. Scanner, does several major tasks:
- Interfaces with other archival services (like the Internet Archive) to save snapshots of web pages.
- Imports those snapshots and other metadata from archival sources.
- Determines which snapshots represent a change from a previous version of the page.
- Process changes to automatically determine a priority or sift out meaningful changes for deeper analysis by humans.
- Volunteers and experts work together to further sift out meaningful changes and qualify them for journalists by writing reports.
- Journalists build narratives and amplify stories for the wider public.
The majority of changes to web pages are not relevant and we want to avoid presenting those irrelevant changes to human analysts. Identifying irrelevant changes in an automated way is not easy, and we expect that analysts will always be involved in a decision about whether some changes are "important" or not.
However, as we expand the number of web pages we monitor, we definitely need to develop tools to reduce the number of pages that analysts must look at.
Some examples of meaningless changes:
- it's not unusual for a page to have a view counter on the bottom. In this case, the page changes by definition every time you view it.
- many sites have "content sliders" or news feeds that update periodically. This change may be "meaningful", in that it's interesting to see news updates. But it's only interesting once, not (as is sometimes seen) 1000 or 10000 times.
An example of a meaningful change:
- In February, we noticed a systematic replacement of the word "impact" with the word "effect" on one website. This change is very interesting because while "impact" and "effect" have similar meanings, "impact" is a stronger word. So, there is an effort being made to weaken the language on existing sites. Our question is in part: what tools would we need in order to have this change flagged by our tools and presented to the analyst as potentially interesting?
The example-data folder contains examples of website changes to use for analysis.
This repository falls under EDGI's Code of Conduct.
This project wouldn’t exist without a lot of amazing people’s help. Thanks to the following for their work reviewing URL's, monitoring changes, writing reports, and a slew of so many other things!
| Contributions | Name |
|---|---|
| 🔢 | Chris Amoss |
| 🔢 📋 🤔 | Maya Anjur-Dietrich |
| 🔢 | Marcy Beck |
| 🔢 📋 🤔 | Andrew Bergman |
| 📖 | Kelsey Breseman |
| 🔢 | Madelaine Britt |
| 🔢 | Ed Byrne |
| 🔢 | Morgan Currie |
| 🔢 | Justin Derry |
| 🔢 📋 🤔 | Gretchen Gehrke |
| 🔢 | Jon Gobeil |
| 🔢 | Pamela Jao |
| 🔢 | Sara Johns |
| 🔢 | Abby Klionski |
| 🔢 | Katherine Kulik |
| 🔢 | Aaron Lamelin |
| 🔢 📋 🤔 | Rebecca Lave |
| 🔢 | Eric Nost |
| 📖 | Karna Patel |
| 🔢 | Lindsay Poirier |
| 🔢 📋 🤔 | Toly Rinberg |
| 🔢 | Justin Schell |
| 🔢 | Lauren Scott |
| 🤔 🔍 | Nick Shapiro |
| 🔢 | Miranda Sinnott-Armstrong |
| 🔢 | Julia Upfal |
| 🔢 | Tyler Wedrosky |
| 🔢 | Adam Wizon |
| 🔢 | Jacob Wylie |
(For a key to the contribution emoji or more info on this format, check out “All Contributors.”)
Finally, we want to give a huge thanks to partner organizations that have helped to support this project with their tools and services:
- The David and Lucile Packard Foundation
- Doris Duke Charitable Foundation
- Amazon Web Services
- Sentry.io
- PageFreezer
- Google Cloud Platform
- Google Summer of Code
- DataKind
- The Internet Archive
Copyright (C) 2017-2025 Environmental Data and Governance Initiative (EDGI)

LICENSE file for details.
Software code in other Web Monitoring repositories is generally licensed under the GPL v3 license, but make sure to check each repository’s README for specifics.