Skip to content

Detect and retry bad captures due to Cloudflare challenges? #32

@Mr0grog

Description

@Mr0grog

With the current architecture, I’m not sure this will be very easy, and I think it’s also relatively low priority because it’s intermittent.

Last month, we had a lot of requests to ehp.niehs.nih.gov blocked by Cloudflare (for more detail on these, see edgi-govdata-archiving/web-monitoring#189). It doesn’t appear that Browsertrix ever got past the Cloudflare challenge in this case. It is pretty easy to detect, though: 403 status code and a cf-mitigated: challenge header. It might be nice to detect these bad captures after the fact (e.g. during import) and try crawling them one more time.

That said, the site where this was happening heavily no longer exists, and there are only a handful of similar examples I can find in the last 45 days. It has happened only a few times (most captures are good) for:

Query this in the Rails console with: Version.where(capture_time: (Time.now - 45.days)..).where("headers->>'cf-mitigated' = 'challenge'").pluck(:uuid, :capture_time, :url)

Anyway, the extreme rarity of this makes it feel pretty low priority. I suspect the problem is that EHP had put up a lot of notices about possible shutdowns and it may have been getting crawled pretty heavily by lots of people for archival purposes, triggering more strict behavior by Cloudflare.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Inbox

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions