-
Notifications
You must be signed in to change notification settings - Fork 1
Description
With the current architecture, I’m not sure this will be very easy, and I think it’s also relatively low priority because it’s intermittent.
Last month, we had a lot of requests to ehp.niehs.nih.gov blocked by Cloudflare (for more detail on these, see edgi-govdata-archiving/web-monitoring#189). It doesn’t appear that Browsertrix ever got past the Cloudflare challenge in this case. It is pretty easy to detect, though: 403 status code and a cf-mitigated: challenge header. It might be nice to detect these bad captures after the fact (e.g. during import) and try crawling them one more time.
That said, the site where this was happening heavily no longer exists, and there are only a handful of similar examples I can find in the last 45 days. It has happened only a few times (most captures are good) for:
Query this in the Rails console with: Version.where(capture_time: (Time.now - 45.days)..).where("headers->>'cf-mitigated' = 'challenge'").pluck(:uuid, :capture_time, :url)
Anyway, the extreme rarity of this makes it feel pretty low priority. I suspect the problem is that EHP had put up a lot of notices about possible shutdowns and it may have been getting crawled pretty heavily by lots of people for archival purposes, triggering more strict behavior by Cloudflare.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status