Backfill URL normalization and canonicalization #494

carlgieringer · 2023-08-04T16:18:40Z

#492 added URL normalization and the requesting of canonical URLs. We should backfill these procedures to existing URLs:

Update existing URLs to set url to normalizeUrl(url).
Decide whether normalizeUrl should remove trailing slash (and if so, re-normalize URLs.)
Update existing URLs to reflect canonical_url_confirmations. Request the confirmation if none is present.

See also #496.

The text was updated successfully, but these errors were encountered:

) Also: - Fix a bug that broke extracting the quotation from a text fragment link because it was normalized first - Fix a bug that broke extracting the quotation from from a link that had a document fragment - Log an error rather than throwing when the UI constructs a text fragment link using a URL that already has a text fragment, since some persisted URLs have not been normalized (#494) - Fix a bug that overwrote MediaExcerpt and UrlLocator entities (without their customizations) because we hadn't configured a MediaExcerpt basis type for our Justification normalization schema. --------- Signed-off-by: Carl Gieringer <[email protected]>

* Write script backfilling URL normalization (#494) * Add draft JustificationBasisCompound migration script and re-add basic compound-based support to DAOs. --------- Signed-off-by: Carl Gieringer <[email protected]>

carlgieringer · 2023-09-03T19:28:26Z

When I backfilled URL-normalization, I did so with a version of normalizeUrl that always appended a slash to the path if it was missing. This normalized index.html to index.html/ which is not what we want. I had missed this caveat from https://en.wikipedia.org/wiki/URI_normalization#Normalization_process:

However, there is no way to know if a URI path component represents a directory or not. RFC 3986 notes that if the former URI redirects to the latter URI, then that is an indication that they are equivalent.

We should re-run URL normalization without this mistake. We should first probably introduce a URL and normalized URL to help with bugs like this in the future, in case we lose information in the normalization.

Fixed in #567

carlgieringer added the data integrity Ensure our data is valid and consistent label Aug 4, 2023

carlgieringer self-assigned this Aug 4, 2023

This was referenced Aug 4, 2023

Normalize URLs and confirm canonical URLs #492

Merged

Infer source description from popular ones associated with the URL #495

Merged

carlgieringer added this to Add appearances Aug 12, 2023

carlgieringer moved this to Todo in Add appearances Aug 12, 2023

carlgieringer mentioned this issue Aug 27, 2023

Add migration to remove writquotes #546

Merged

1 task

carlgieringer added this to the P0 milestone Sep 5, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Backfill URL normalization and canonicalization #494

Backfill URL normalization and canonicalization #494

carlgieringer commented Aug 4, 2023 •

edited

Loading

carlgieringer commented Sep 3, 2023 •

edited

Loading

Backfill URL normalization and canonicalization #494

Backfill URL normalization and canonicalization #494

Comments

carlgieringer commented Aug 4, 2023 • edited Loading

carlgieringer commented Sep 3, 2023 • edited Loading

carlgieringer commented Aug 4, 2023 •

edited

Loading

carlgieringer commented Sep 3, 2023 •

edited

Loading