EDGI: Web Monitoring Project

This repository is the home of a suite of tools EDGI's Website Governance Project uses to monitor changes to government websites, both environment-related and otherwise. EDGI uses these tools to publish reports that are written about in major publications such as The Atlantic or Vice. Teams at other organizations use parts of this project for similar purposes or to provide comparisons between different versions of public web pages.

While there are a wide variety of commercial services for monitoring web pages, none of them worked well for EDGI at the scale of tracking thousands or tens of thousands of pages, which is what this project focuses on. These tools perform tasks like:

Loading, storing, and analyzing historical snapshots of web pages.
Providing an API for retrieving and updating data about those snapshots.
A website for visualizing and browsing changes between those snapshots.
Organizing the workflow and processes of a team of human analysts who use the above tools to track and publicize information about meaningful changes to government websites.

ℹ️ The project is broken up into a variety of smaller tools in different repositories (see “project structure”). For a combined view of all issues and status, check the project board. This repository is for project-wide documentation and issues.

Project Structure
Get Involved
Project Overview
Code of Conduct
Contributors & Sponsors
License & Copyright

Project Structure

The technical tooling for Web Monitoring is broken up into several repositories, each named web-monitoring-{name}:

Repo	Description	Tools Used
web-monitoring	(This Repo!) Project-wide documentation and issue tracking.	Markdown
web-monitoring-db	A database and API that stores metadata about the pages, versions, changes we track, as well as human annotations about those changes.	Ruby, Rails, Postgresql
web-monitoring-ui	A web-based UI (built in React) that shows diffs between different versions of the pages we track. It’s built on the API provided by web-monitoring-db.	JavaScript, React
web-monitoring-processing	Python-based tools for importing data and for extracting and analyzing data in our database of monitored pages and changes.	Python
web-monitoring-diff	Algorithms for diffing web pages in a variety of ways and a web server for providing those diffs via an HTTP API.	Python, Tornado
web-monitoring-task-sheets	Analyzes changes stored in -db and generates filtered, prioritized spreadsheets human analysts use to plan their work.	Python
web-monitoring-crawler	Captures copies of pages that EDGI monitors and stores them in web-monitoring-db and the Internet Archive.	Python, Docker
web-monitoring-ops	Server configuration and other deployment information for managing EDGI’s live instance of all these tools.	Kubernetes, Bash, AWS
wayback	A Python API to the Internet Archive’s Wayback Machine. It gives you tools to search for and load mementos (historical copies of web pages).	Python

For more on how all these parts fit together, see ARCHITECTURE.md.

Get Involved

We’d love your help on improving this project! If you are interested in getting involved…

Please follow EDGI's Code of Conduct
Join EDGI by filling out the volunteer form at http://envirodatagov.org/volunteer/. As a member, you can be more involved in our overall process or contribute to work beyond just the code.

This project is two-part! We rely both on open source code contributors (building this tool) and on volunteer analysts who use the tool to identify and characterize changes to government websites.

Get involved as an analyst

Read through the Project Overview and especially the section on "meaningful changes" to get a better idea of the work.
Fill out the volunteer form at http://envirodatagov.org/volunteer/.

Get involved as a programmer

Be sure to check our contributor guidelines.
Take a look through the repos listed in the Project Structure section and choose one that feels appropriate to your interests and skillset.
Try to get a repo running on your machine (and if you have any challenges, please make issues about them!).
Find an issue labeled good-first-issue and work to resolve it.

Project Overview

Project Goals

The purpose of the system is to enable analysts to quickly review monitored government websites in order to report on meaningful changes. In order to do so, the system, a.k.a. Scanner, does several major tasks:

Interfaces with other archival services (like the Internet Archive) to save snapshots of web pages.
Imports those snapshots and other metadata from archival sources.
Determines which snapshots represent a change from a previous version of the page.
Process changes to automatically determine a priority or sift out meaningful changes for deeper analysis by humans.
Volunteers and experts work together to further sift out meaningful changes and qualify them for journalists by writing reports.
Journalists build narratives and amplify stories for the wider public.

Identifying "Meaningful Changes"

The majority of changes to web pages are not relevant and we want to avoid presenting those irrelevant changes to human analysts. Identifying irrelevant changes in an automated way is not easy, and we expect that analysts will always be involved in a decision about whether some changes are "important" or not.

However, as we expand the number of web pages we monitor, we definitely need to develop tools to reduce the number of pages that analysts must look at.

Some examples of meaningless changes:

it's not unusual for a page to have a view counter on the bottom. In this case, the page changes by definition every time you view it.
many sites have "content sliders" or news feeds that update periodically. This change may be "meaningful", in that it's interesting to see news updates. But it's only interesting once, not (as is sometimes seen) 1000 or 10000 times.

An example of a meaningful change:

In February, we noticed a systematic replacement of the word "impact" with the word "effect" on one website. This change is very interesting because while "impact" and "effect" have similar meanings, "impact" is a stronger word. So, there is an effort being made to weaken the language on existing sites. Our question is in part: what tools would we need in order to have this change flagged by our tools and presented to the analyst as potentially interesting?

Sample Data

The example-data folder contains examples of website changes to use for analysis.

Code of Conduct

This repository falls under EDGI's Code of Conduct.

Contributors

Individuals

This project wouldn’t exist without a lot of amazing people’s help. Thanks to the following for their work reviewing URL's, monitoring changes, writing reports, and a slew of so many other things!

Contributions	Name
🔢	Chris Amoss
🔢 📋 🤔	Maya Anjur-Dietrich
🔢	Marcy Beck
🔢 📋 🤔	Andrew Bergman
📖	Kelsey Breseman
🔢	Madelaine Britt
🔢	Ed Byrne
🔢	Morgan Currie
🔢	Justin Derry
🔢 📋 🤔	Gretchen Gehrke
🔢	Jon Gobeil
🔢	Pamela Jao
🔢	Sara Johns
🔢	Abby Klionski
🔢	Katherine Kulik
🔢	Aaron Lamelin
🔢 📋 🤔	Rebecca Lave
🔢	Eric Nost
📖	Karna Patel
🔢	Lindsay Poirier
🔢 📋 🤔	Toly Rinberg
🔢	Justin Schell
🔢	Lauren Scott
🤔 🔍	Nick Shapiro
🔢	Miranda Sinnott-Armstrong
🔢	Julia Upfal
🔢	Tyler Wedrosky
🔢	Adam Wizon
🔢	Jacob Wylie

(For a key to the contribution emoji or more info on this format, check out “All Contributors.”)

Sponsors & Partners

Finally, we want to give a huge thanks to partner organizations that have helped to support this project with their tools and services:

License & Copyright

Copyright (C) 2017-2025 Environmental Data and Governance Initiative (EDGI)
Web Monitoring documentation in this repository is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. See the LICENSE file for details.

Software code in other Web Monitoring repositories is generally licensed under the GPL v3 license, but make sure to check each repository’s README for specifics.

Name		Name	Last commit message	Last commit date
Latest commit History 122 Commits
assets		assets
example-data		example-data
ARCHITECTURE.md		ARCHITECTURE.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Repository files navigation

EDGI: Web Monitoring Project

Project Structure

Get Involved

Get involved as an analyst

Get involved as a programmer

Project Overview

Project Goals

Identifying "Meaningful Changes"

Sample Data

Code of Conduct

Contributors

Individuals

Sponsors & Partners

License & Copyright

About

Uh oh!

Releases

Sponsor this project

Uh oh!

Packages

Uh oh!

Contributors 13

Uh oh!

Languages

Uh oh!

License

edgi-govdata-archiving/web-monitoring

Folders and files

Latest commit

History

Repository files navigation

EDGI: Web Monitoring Project

Project Structure

Get Involved

Get involved as an analyst

Get involved as a programmer

Project Overview

Project Goals

Identifying "Meaningful Changes"

Sample Data

Code of Conduct

Contributors

Individuals

Sponsors & Partners

License & Copyright

About

Topics

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Sponsor this project

Uh oh!

Packages 0

Uh oh!

Contributors 13

Uh oh!

Languages

Packages