Detect and retry bad captures due to Cloudflare challenges?

With the current architecture, I’m not sure this will be very easy, and I think it’s also relatively low priority because it’s intermittent.

Last month, we had a *lot* of requests to ehp.niehs.nih.gov blocked by Cloudflare (for more detail on these, see https://github.com/edgi-govdata-archiving/web-monitoring/issues/189). It doesn’t appear that Browsertrix ever got past the Cloudflare challenge in this case. It *is* pretty easy to detect, though: 403 status code and a `cf-mitigated: challenge` header. It might be nice to detect these bad captures after the fact (e.g. during import) and try crawling them one more time.

That said, the site where this was happening heavily no longer exists, and there are only a handful of similar examples I can find in the last 45 days. It has happened only a few times (most captures are good) for:
- https://data.census.gov/
- https://emp.lbl.gov/tracking-the-sun/
- https://eta.lbl.gov/justice-40

Query this in the Rails console with: `Version.where(capture_time: (Time.now - 45.days)..).where("headers->>'cf-mitigated' = 'challenge'").pluck(:uuid, :capture_time, :url)`

Anyway, the extreme rarity of this makes it feel pretty low priority. I suspect the problem is that EHP had put up a lot of notices about possible shutdowns and it may have been getting crawled pretty heavily by lots of people for archival purposes, triggering more strict behavior by Cloudflare.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Detect and retry bad captures due to Cloudflare challenges? #32

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Detect and retry bad captures due to Cloudflare challenges? #32

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions