Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2734 preserve orig filename #6774

Merged
merged 13 commits into from
Apr 2, 2020
Merged

2734 preserve orig filename #6774

merged 13 commits into from
Apr 2, 2020

Conversation

sekmiller
Copy link
Contributor

What this PR does / why we need it:
This PR provides a method for storing the original file name for ingested tabular files. It also adds the original file name to the json printer for data files. For files that were uploaded previously the son printer will infer the original file name from the file metadata and a conversion of the extension based on the original file format field on the data table.

Which issue(s) this PR closes:

Closes #2734

Special notes for your reviewer:
None

Suggestions on how to test this:
Verify that ingested tabular files json export includes the original file name even if the file was ingested previously and thus does not have the original file name saved to the dataTable table.

Does this PR introduce a user interface change?:
No

Is there a release notes update needed for this change?:
No

Additional documentation:
There is a sql script that adds the new field to the dataTable table.

@coveralls
Copy link

coveralls commented Mar 26, 2020

Coverage Status

Coverage increased (+0.02%) to 19.656% when pulling 0856d47 on 2734-preserve-orig-filename into eb2fa18 on develop.

@landreev landreev self-requested a review March 26, 2020 20:51
@landreev landreev self-assigned this Mar 26, 2020
if (StringUtil.nonEmpty(extensionToRemove)) {
String newFileName = filename.replace(extensionToRemove, originalExtension);
return newFileName;
}
Copy link
Contributor

@landreev landreev Mar 27, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sekmiller Am I reading this correctly - that if the current filename has no extension ("foo"), this code will return "foo"? - shouldn't it still return "foo.<ORIG_EXT>"?
(although this shouldn't really happen - since ingested files should always have the ".tab" extension)
(also, it looks like we never checked for this "no extension" case before. To be really OCD though, if you call .replace(".tab", ".dta") on something exotic like "datafile.tabular.tab" will return "datafile.dtaular.dta". See my really old code in StoredOriginalFile.java - I use the regex version of replace there: .replaceAll(".tab$", ".dta"), to make sure I'm only replacing the end part.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good points. I will take another pass at this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@landreev I put in your suggestion on using replaceAll and relying on the fact that a tabular file should always end in ".tab ". It occurs to me that any editing of the file label to anything other than foo.tab will break this (including your "no extension" case). How worried should we be about this, and any suggestions to code against it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Answered my own question while out for a run: so the replaceAll("tab$" which I put in should fix 100% of the cases. Since we know it won't I can modify the previous "look for an extension to replace" code by adding a $ to the replace string so that inadvertent internal instances of the extension are not replaced, and if there's no extension found just append the extension derived from the file type. All of this points out that even after all this we cannot be certain that we're returning anything like the actual original file name.

Copy link
Contributor

@landreev landreev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to JsonPrinter, I believe we also want to use this new method in UningestFileCommand.java and in StoredOriginalFile.java (where the filename is set when the saved original is downloaded via the access API; with inputStreamIO.setFileName(...))

@djbrooke djbrooke assigned sekmiller and unassigned landreev Mar 27, 2020
@sekmiller sekmiller removed their assignment Mar 27, 2020
@landreev landreev self-assigned this Mar 27, 2020
simplifies the logic, removes unused code
Copy link
Contributor

@landreev landreev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made another quick change, to further simplify StoredOriginalFile. Looks great otherwise.

@kcondon kcondon self-assigned this Apr 2, 2020
@kcondon kcondon merged commit d631b05 into develop Apr 2, 2020
@kcondon kcondon deleted the 2734-preserve-orig-filename branch April 2, 2020 13:44
@djbrooke djbrooke added this to the Dataverse 5 milestone Apr 2, 2020
@matthew-a-dunlap
Copy link
Contributor

Was just talking about this with @donsizemore , we're excited to see this as part of Dataverse 5!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add originalFileName field to json
6 participants