Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for directly ingesting tab-delimited files #4044

Closed
landreev opened this issue Aug 4, 2017 · 16 comments
Closed

Add support for directly ingesting tab-delimited files #4044

landreev opened this issue Aug 4, 2017 · 16 comments
Assignees

Comments

@landreev
Copy link
Contributor

landreev commented Aug 4, 2017

Now that #3767 has been merged, we can build on top of the new code there to add direct support for tab-delimited files.

The Apache CSVParser that we are using now will be happy to parse using a separator character other than comma. So if we can identify CSV and tab-delimited files accurately before they are passed to the ingest plugin, this same plugin can be used for both CSV and TAB, with very minor modifications. The plugin will just need to check the mime type passed to it, and choose the delimiter character based on it.

So the main amount of work for this issue will need to be done outside of the plugin, in the code that determines the file type, when a file is uploaded. The specific class files are FileUtil.java and IngestableDataChecker.java. As of now, we don't really have good, reliable tests for either there. With more complex formats, like Stata, SPSS, etc. it's enough to read a very short sequence of bytes, and check it for specific, clearly defined signatures. With CSV and TAB, the only reliable test is probably to read more data, like the first 10 lines at least, and just try to parse them, first with the comma as the delimiter, and if that fails - with tab (?). Maybe we should run this test on every file that's otherwise identified as plain text?

If we implement this correctly, this will resolve another issue identified in #3767 - we appear to trust the user-supplied extension ".csv", blindly. So when people upload tab-delimited files with .csv extensions, we try to run the CSV parser on them. And if, for example, if it just so happens that the file either doesn't have any commas, and the parser doesn't fail outright - it gets processed as a CSV file with the single column.

@pdurbin
Copy link
Member

pdurbin commented Jun 1, 2018

I just closed #1912 which is also about supporting ingest of TSV files and there's some good detail in there that we should review when we work on this.

@mheppler
Copy link
Contributor

mheppler commented Jul 6, 2018

@oscardssmith Looking for a brief outline of the features/changes to provide QA with a test plan, as well as documentation, most likely somewhere in User Guide > Tabular Data File Ingest.

http://guides.dataverse.org/en/latest/user/tabulardataingest/index.html

@oscardssmith
Copy link
Contributor

@mheppler is there an easy way to view my modifications to the docs?

@oscardssmith
Copy link
Contributor

QA details for @kcondon

  • check /user/tabulardataingest/csv tsv.rst docs
  • test ingesting tab files
  • maybe test a csv file just for fun.

@kcondon kcondon self-assigned this Jul 11, 2018
@pdurbin
Copy link
Member

pdurbin commented Jul 13, 2018

Pull request #4606 was just merged and now pull request #4817 has merge conflicts that should be resolved before this issue is QA'ed.

@pdurbin pdurbin self-assigned this Jul 13, 2018
pdurbin added a commit that referenced this issue Jul 13, 2018
Conflicts:
src/main/java/edu/harvard/iq/dataverse/DataFile.java (just imports)
src/main/java/edu/harvard/iq/dataverse/util/FileUtil.java (switch vs if/else)
@pdurbin
Copy link
Member

pdurbin commented Jul 13, 2018

Ok, merge conflicts resolved in c616686. @oscardssmith see the commit message for what was in conflict.

While I have this branch loaded up I tested https://github.com/IQSS/open-source-at-harvard/blob/master/data/2017-07-31.tsv and it wasn't ingested. I'm not sure why.

@kcondon
Copy link
Contributor

kcondon commented Jul 13, 2018

@oscardssmith , I assume ingesting a csv file, downloading as .tab, renaming to .tsv, then ingesting again should work? So far, no luck. I'm going to try with a simpler file I created myself.

I must be doing something wrong because I can't get a simple .tsv file to ingest.
Also, the first line containing variables is missing from tab download for ingested csv but works in v4.9.1
Also, looked at the docs but what should be the file extension, .tsv or .tab or does it matter?

@oscardssmith
Copy link
Contributor

the extension should be tsv, but it might, and arguably should work for both.

@oscardssmith
Copy link
Contributor

it may not be working because the downloaded one we provide is missing the header, which we need to ingest.

@pdurbin pdurbin removed their assignment Jul 14, 2018
@pdurbin pdurbin assigned oscardssmith and pdurbin and unassigned kcondon Jul 16, 2018
@pdurbin
Copy link
Member

pdurbin commented Jul 16, 2018

@oscardssmith and I have been taking about this issue this morning. It's not ready for QA. I did end up botching the merge conflict resolution on Friday so I just created a branch called "4044-tsv-ingest2" so we can try again if we want to put that branch through QA instead.

I did notice a small bug or at least shortcoming. I was able to ingest scripts/issues/2595/numconnacquired.tsv but it doesn't show up as "tabulardata" in the facet:

screen shot 2018-07-16 at 10 58 11 am

@pdurbin
Copy link
Member

pdurbin commented Jul 16, 2018

Adding text/tsv=tabulardata below text/tab-separated-values=tabulardata in src/main/java/MimeTypeFacets.properties and uploading a new file fixes the facet bug above but I'm not sure if we should have both forms or replace the former with the latter.

@pdurbin
Copy link
Member

pdurbin commented Jul 17, 2018

Because I botched merging develop into the last pull request, I asked @oscardssmith to do it in a new one: pull request #4854. Thanks! He also fixed the issue above.

@kcondon kcondon self-assigned this Jul 17, 2018
@kcondon
Copy link
Contributor

kcondon commented Jul 17, 2018

TSV ingest generally works but it is dropping the extension on original file download.

@oscardssmith
Copy link
Contributor

Fixed

@kcondon
Copy link
Contributor

kcondon commented Jul 18, 2018

OK note that downloading a file that was uploaded as .tab becomes .tsv on download as original file. This is because we do not store the extension but recreate it by examining the mime type. Though an issue, it preexists this PR and might be worked on when we look at preserving original filenames/ store original filenames requested by Odum.

@pdurbin
Copy link
Member

pdurbin commented Jul 18, 2018

This is because we do not store the extension but recreate it by examining the mime type.

@donsizemore @akio-sone and @evelynPM have been discussing this at #2734

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants