Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backfill URL normalization and canonicalization #494

Open
1 of 3 tasks
carlgieringer opened this issue Aug 4, 2023 · 1 comment
Open
1 of 3 tasks

Backfill URL normalization and canonicalization #494

carlgieringer opened this issue Aug 4, 2023 · 1 comment
Assignees
Labels
data integrity Ensure our data is valid and consistent
Milestone

Comments

@carlgieringer
Copy link
Contributor

carlgieringer commented Aug 4, 2023

#492 added URL normalization and the requesting of canonical URLs. We should backfill these procedures to existing URLs:

  • Update existing URLs to set url to normalizeUrl(url).
  • Decide whether normalizeUrl should remove trailing slash (and if so, re-normalize URLs.)
  • Update existing URLs to reflect canonical_url_confirmations. Request the confirmation if none is present.

See also #496.

@carlgieringer carlgieringer added the data integrity Ensure our data is valid and consistent label Aug 4, 2023
@carlgieringer carlgieringer self-assigned this Aug 4, 2023
carlgieringer added a commit that referenced this issue Aug 4, 2023
)

Also:

- Fix a bug that broke extracting the quotation from a text fragment link because it was normalized first
- Fix a bug that broke extracting the quotation from from a link that had a document fragment
- Log an error rather than throwing when the UI constructs a text fragment link using a URL that already has a text fragment, since some persisted URLs have not been normalized (#494)
- Fix a bug that overwrote MediaExcerpt and UrlLocator entities (without their customizations) because we hadn't configured a MediaExcerpt basis type for our Justification normalization schema.

---------

Signed-off-by: Carl Gieringer <[email protected]>
carlgieringer added a commit that referenced this issue Aug 27, 2023
* Write script backfilling URL normalization (#494)
* Add draft JustificationBasisCompound migration script and re-add basic compound-based support to DAOs.

---------

Signed-off-by: Carl Gieringer <[email protected]>
@carlgieringer
Copy link
Contributor Author

carlgieringer commented Sep 3, 2023

When I backfilled URL-normalization, I did so with a version of normalizeUrl that always appended a slash to the path if it was missing. This normalized index.html to index.html/ which is not what we want. I had missed this caveat from https://en.wikipedia.org/wiki/URI_normalization#Normalization_process:

However, there is no way to know if a URI path component represents a directory or not. RFC 3986 notes that if the former URI redirects to the latter URI, then that is an indication that they are equivalent.

We should re-run URL normalization without this mistake. We should first probably introduce a URL and normalized URL to help with bugs like this in the future, in case we lose information in the normalization.

Fixed in #567

@carlgieringer carlgieringer added this to the P0 milestone Sep 5, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
data integrity Ensure our data is valid and consistent
Projects
Status: Todo
Development

No branches or pull requests

1 participant