Skip to content

Commit 73b621d

Browse files
committed
Mark bad climate.nasa.gov redirects as 404
NASA went through a year-and-a-half long transition of pages from `climate.nasa.gov` to `science.nasa.gov/climate-change` that concluded a couple weeks ago. Unfortunately, when they finished, they started redirecting `climate.nasa.gov/*` to the new climate change home page instead of to the matching page on the new site, making a bunch of URLs effectively into 404s. See also: edgi-govdata-archiving/web-monitoring-db#1306
1 parent d7a08a1 commit 73b621d

File tree

1 file changed

+9
-0
lines changed

1 file changed

+9
-0
lines changed

analyst_sheets/analyze.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -351,6 +351,15 @@ def get_version_status(version: dict) -> int:
351351
if redirects and redirects[-1].endswith('epa.gov/sites/production/files/signpost/cc.html'):
352352
return 404
353353

354+
# Special case for climate.nasa.gov getting moved with bad redirects for
355+
# all the sub-pages (they all redirected to the new home page).
356+
if (
357+
redirects
358+
and re.match(r'^https?://climate.nasa.gov/.+$', url, re.IGNORECASE)
359+
and redirects[-1].endswith('://science.nasa.gov/climate-change/')
360+
):
361+
return 404
362+
354363
if version['title']:
355364
# Page titles are frequently formulated like "<title> | <site name>" or
356365
# "<title> | <site section> | <site name>" (order may also be reversed).

0 commit comments

Comments
 (0)