-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. Weβll occasionally send you account related emails.
Already on GitHub? Sign in to your account
As a data custodian, I want to load URLs / file paths without unnecessary / additional slashes #158
Comments
The implication of this ticket would be that harvest must run validate on all products prior to ingestion. Given the hours that validate takes for some bundles you may be some push back. The other implication would be that harvest start to implement a subset of validate. If they disagree, then which is right. The classical a person with one clock knows what time it is while a person with two clocks is never sure. Also, how much does it implement until it is the new validate. |
@jordanpadams consider @al-niessner's comment above β "A good sailor always travels with one clock or threeβnever two." |
@nutjob4life @al-niessner update the story title to be more specific to this use case. we will not be running validate, but we want to load "cleaner" file paths / URLs in the future to avoid potential processing issues downstream, similar to what occurred with Deep Archive |
Thanks @jordanpadams! π |
Do we scan the entire document or just the paths we butcher? harvest tries to convert, if told too, paths from local file to http server locations. Simple to correct butchering rather than document but could still linux valid and schema valid paths with multiple slashes (if allowed by schema). The ones returned by curl look like the butchered variety and is done using String.* so nobody is checking along the way and could be schema invalid. |
@al-niessner Sorry for the lack of clarity here. @nutjob4life is referencing data in the Registry now, but the point of this enhancement is to prevent it from happening in the future upon ingest through Harvest. Between this part of the config and this part of the config a double-slash is being injected in here when we are forming the URL. We just want to make sure that doesn't happen. |
Yes, merging those two parts of the config file is the butchering I was referring to. |
Checked for duplicates
No - I haven't checked
π§βπ¬ User Persona(s)
Data Engineer
πͺ Motivation
See NASA-PDS/operations#476 for context; the issue is that somehow some file paths with double-slashes in them got into the Registry. For example, see
Those double-slashes cause the Deep Archive to also output double-slashes, which later fail validation.
These should not go into the Registry in the first place.
π Additional Details
No response
Acceptance Criteria
Given
When I perform
Then I expect
βοΈ Engineering Details
No response
The text was updated successfully, but these errors were encountered: