-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for directly ingesting tab-delimited files #4044
Comments
I just closed #1912 which is also about supporting ingest of TSV files and there's some good detail in there that we should review when we work on this. |
@oscardssmith Looking for a brief outline of the features/changes to provide QA with a test plan, as well as documentation, most likely somewhere in User Guide > Tabular Data File Ingest. http://guides.dataverse.org/en/latest/user/tabulardataingest/index.html |
@mheppler is there an easy way to view my modifications to the docs? |
QA details for @kcondon
|
Conflicts: src/main/java/edu/harvard/iq/dataverse/DataFile.java (just imports) src/main/java/edu/harvard/iq/dataverse/util/FileUtil.java (switch vs if/else)
Ok, merge conflicts resolved in c616686. @oscardssmith see the commit message for what was in conflict. While I have this branch loaded up I tested https://github.com/IQSS/open-source-at-harvard/blob/master/data/2017-07-31.tsv and it wasn't ingested. I'm not sure why. |
@oscardssmith , I assume ingesting a csv file, downloading as .tab, renaming to .tsv, then ingesting again should work? So far, no luck. I'm going to try with a simpler file I created myself. I must be doing something wrong because I can't get a simple .tsv file to ingest. |
the extension should be tsv, but it might, and arguably should work for both. |
it may not be working because the downloaded one we provide is missing the header, which we need to ingest. |
@oscardssmith and I have been taking about this issue this morning. It's not ready for QA. I did end up botching the merge conflict resolution on Friday so I just created a branch called "4044-tsv-ingest2" so we can try again if we want to put that branch through QA instead. I did notice a small bug or at least shortcoming. I was able to ingest |
Adding |
Because I botched merging develop into the last pull request, I asked @oscardssmith to do it in a new one: pull request #4854. Thanks! He also fixed the issue above. |
TSV ingest generally works but it is dropping the extension on original file download. |
Fixed |
OK note that downloading a file that was uploaded as .tab becomes .tsv on download as original file. This is because we do not store the extension but recreate it by examining the mime type. Though an issue, it preexists this PR and might be worked on when we look at preserving original filenames/ store original filenames requested by Odum. |
@donsizemore @akio-sone and @evelynPM have been discussing this at #2734 |
Now that #3767 has been merged, we can build on top of the new code there to add direct support for tab-delimited files.
The Apache CSVParser that we are using now will be happy to parse using a separator character other than comma. So if we can identify CSV and tab-delimited files accurately before they are passed to the ingest plugin, this same plugin can be used for both CSV and TAB, with very minor modifications. The plugin will just need to check the mime type passed to it, and choose the delimiter character based on it.
So the main amount of work for this issue will need to be done outside of the plugin, in the code that determines the file type, when a file is uploaded. The specific class files are FileUtil.java and IngestableDataChecker.java. As of now, we don't really have good, reliable tests for either there. With more complex formats, like Stata, SPSS, etc. it's enough to read a very short sequence of bytes, and check it for specific, clearly defined signatures. With CSV and TAB, the only reliable test is probably to read more data, like the first 10 lines at least, and just try to parse them, first with the comma as the delimiter, and if that fails - with tab (?). Maybe we should run this test on every file that's otherwise identified as plain text?
If we implement this correctly, this will resolve another issue identified in #3767 - we appear to trust the user-supplied extension ".csv", blindly. So when people upload tab-delimited files with .csv extensions, we try to run the CSV parser on them. And if, for example, if it just so happens that the file either doesn't have any commas, and the parser doesn't fail outright - it gets processed as a CSV file with the single column.
The text was updated successfully, but these errors were encountered: