Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

As a data custodian, I want to load URLs / file paths without unnecessary / additional slashes #158

Closed
nutjob4life opened this issue Mar 26, 2024 · 7 comments Β· Fixed by NASA-PDS/registry-common#47
Labels

Comments

@nutjob4life
Copy link
Member

Checked for duplicates

No - I haven't checked

πŸ§‘β€πŸ”¬ User Persona(s)

Data Engineer

πŸ’ͺ Motivation

See NASA-PDS/operations#476 for context; the issue is that somehow some file paths with double-slashes in them got into the Registry. For example, see

curl --silent 'https://pds.nasa.gov/api/search/1.0//products/urn:nasa:pds:cassini_uvis_solarocc_beckerjarmak2023::1.0/members/latest' \
    | json_pp | egrep '//data'

Those double-slashes cause the Deep Archive to also output double-slashes, which later fail validation.

These should not go into the Registry in the first place.

πŸ“– Additional Details

No response

Acceptance Criteria

Given
When I perform
Then I expect

βš™οΈ Engineering Details

No response

@al-niessner
Copy link
Contributor

@nutjob4life

The implication of this ticket would be that harvest must run validate on all products prior to ingestion. Given the hours that validate takes for some bundles you may be some push back.

The other implication would be that harvest start to implement a subset of validate. If they disagree, then which is right. The classical a person with one clock knows what time it is while a person with two clocks is never sure. Also, how much does it implement until it is the new validate.

@nutjob4life
Copy link
Member Author

@jordanpadams consider @al-niessner's comment above ↑

"A good sailor always travels with one clock or threeβ€”never two."
β€”A good sailor, possibly

@jordanpadams jordanpadams changed the title As a data custodian, I don't want to put bad data into the PDS Registry As a data custodian, I want to load URLs / file paths without unnecessary / additional slashes Apr 18, 2024
@jordanpadams
Copy link
Member

@nutjob4life @al-niessner update the story title to be more specific to this use case. we will not be running validate, but we want to load "cleaner" file paths / URLs in the future to avoid potential processing issues downstream, similar to what occurred with Deep Archive

@nutjob4life
Copy link
Member Author

Thanks @jordanpadams! πŸ™

@al-niessner
Copy link
Contributor

@nutjob4life @jordanpadams

Do we scan the entire document or just the paths we butcher? harvest tries to convert, if told too, paths from local file to http server locations. Simple to correct butchering rather than document but could still linux valid and schema valid paths with multiple slashes (if allowed by schema). The ones returned by curl look like the butchered variety and is done using String.* so nobody is checking along the way and could be schema invalid.

@jordanpadams
Copy link
Member

jordanpadams commented Apr 18, 2024

@al-niessner Sorry for the lack of clarity here. @nutjob4life is referencing data in the Registry now, but the point of this enhancement is to prevent it from happening in the future upon ingest through Harvest.

Between this part of the config and this part of the config a double-slash is being injected in here when we are forming the URL. We just want to make sure that doesn't happen.

@al-niessner
Copy link
Contributor

@al-niessner Sorry for the lack of clarity here. @nutjob4life is referencing data in the Registry now, but the point of this enhancement is to prevent it from happening in the future upon ingest through Harvest.

Between this part of the config and this part of the config a double-slash is being injected in here when we are forming the URL. We just want to make sure that doesn't happen.

Yes, merging those two parts of the config file is the butchering I was referring to.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: 🏁 Done
Status: 🏁 Done
3 participants