Skip to content

Auto-update page URLs that have redirected for a while? #1290

@Mr0grog

Description

@Mr0grog

We occasionally have issues where a page redirects for a while, and then the old URL eventually goes offline (403, 404, or the server is gone so there's a network error). In most cases, we want to just update the page’s canonical URL to the redirect target when this happens and it would be nice not to need someone to do it manually.

Some ideas:

  1. Update the URL of pages that have consistently redirected to the same location for some period of time (2 weeks? a month?).

  2. Wait for a page to go offline (for >1 crawl?) and, if it previously redirected (repeatedly to same URL for a while?), update its canonical URL. In this case, we can also update the old PageUrl record to have a to_time that indicates the old URL is no longer valid.

(2) is a really nice feature, but it means we’ll have one crawl record where we don’t get any data (because that crawl will have just seen it as offline). This gets complex, but maybe a nice way to combine these is:

  • For pages that have redirected consistently for a while, add the redirect target as a new alternative URL.
  • The crawl can add both the old canonical URL and the redirect target as seeds, so if the old canonical URL fails, we still get the good capture from the redirect target. Since the DB knows its a valid alternative, both captures will get associated with the page.
  • Then we need some fancy stuff either in the crawl import or in the DB to see this situation and:
    • Switch the canonical URL.
    • Mark the old canonical URL as having a to_time so we have a record of it no longer being valid.

That’s pretty complex, though! Not sure if worthwhile.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    Status

    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions