-
-
Notifications
You must be signed in to change notification settings - Fork 25
Description
We occasionally have issues where a page redirects for a while, and then the old URL eventually goes offline (403, 404, or the server is gone so there's a network error). In most cases, we want to just update the page’s canonical URL to the redirect target when this happens and it would be nice not to need someone to do it manually.
Some ideas:
-
Update the URL of pages that have consistently redirected to the same location for some period of time (2 weeks? a month?).
-
Wait for a page to go offline (for >1 crawl?) and, if it previously redirected (repeatedly to same URL for a while?), update its canonical URL. In this case, we can also update the old
PageUrlrecord to have ato_timethat indicates the old URL is no longer valid.
(2) is a really nice feature, but it means we’ll have one crawl record where we don’t get any data (because that crawl will have just seen it as offline). This gets complex, but maybe a nice way to combine these is:
- For pages that have redirected consistently for a while, add the redirect target as a new alternative URL.
- The crawl can add both the old canonical URL and the redirect target as seeds, so if the old canonical URL fails, we still get the good capture from the redirect target. Since the DB knows its a valid alternative, both captures will get associated with the page.
- Then we need some fancy stuff either in the crawl import or in the DB to see this situation and:
- Switch the canonical URL.
- Mark the old canonical URL as having a
to_timeso we have a record of it no longer being valid.
That’s pretty complex, though! Not sure if worthwhile.
Metadata
Metadata
Assignees
Labels
Type
Projects
Status