Add support for directly ingesting tab-delimited files #4044

landreev · 2017-08-04T01:34:34Z

Now that #3767 has been merged, we can build on top of the new code there to add direct support for tab-delimited files.

The Apache CSVParser that we are using now will be happy to parse using a separator character other than comma. So if we can identify CSV and tab-delimited files accurately before they are passed to the ingest plugin, this same plugin can be used for both CSV and TAB, with very minor modifications. The plugin will just need to check the mime type passed to it, and choose the delimiter character based on it.

So the main amount of work for this issue will need to be done outside of the plugin, in the code that determines the file type, when a file is uploaded. The specific class files are FileUtil.java and IngestableDataChecker.java. As of now, we don't really have good, reliable tests for either there. With more complex formats, like Stata, SPSS, etc. it's enough to read a very short sequence of bytes, and check it for specific, clearly defined signatures. With CSV and TAB, the only reliable test is probably to read more data, like the first 10 lines at least, and just try to parse them, first with the comma as the delimiter, and if that fails - with tab (?). Maybe we should run this test on every file that's otherwise identified as plain text?

If we implement this correctly, this will resolve another issue identified in #3767 - we appear to trust the user-supplied extension ".csv", blindly. So when people upload tab-delimited files with .csv extensions, we try to run the CSV parser on them. And if, for example, if it just so happens that the file either doesn't have any commas, and the parser doesn't fail outright - it gets processed as a CSV file with the single column.

pdurbin · 2018-06-01T14:32:16Z

I just closed #1912 which is also about supporting ingest of TSV files and there's some good detail in there that we should review when we work on this.

mheppler · 2018-07-06T21:27:47Z

@oscardssmith Looking for a brief outline of the features/changes to provide QA with a test plan, as well as documentation, most likely somewhere in User Guide > Tabular Data File Ingest.

http://guides.dataverse.org/en/latest/user/tabulardataingest/index.html

oscardssmith · 2018-07-09T13:47:12Z

@mheppler is there an easy way to view my modifications to the docs?

oscardssmith · 2018-07-09T13:58:45Z

QA details for @kcondon

check /user/tabulardataingest/csv tsv.rst docs
test ingesting tab files
maybe test a csv file just for fun.

pdurbin · 2018-07-13T18:04:25Z

Pull request #4606 was just merged and now pull request #4817 has merge conflicts that should be resolved before this issue is QA'ed.

Conflicts: src/main/java/edu/harvard/iq/dataverse/DataFile.java (just imports) src/main/java/edu/harvard/iq/dataverse/util/FileUtil.java (switch vs if/else)

pdurbin · 2018-07-13T20:06:16Z

Ok, merge conflicts resolved in c616686. @oscardssmith see the commit message for what was in conflict.

While I have this branch loaded up I tested https://github.com/IQSS/open-source-at-harvard/blob/master/data/2017-07-31.tsv and it wasn't ingested. I'm not sure why.

kcondon · 2018-07-13T21:48:45Z

@oscardssmith , I assume ingesting a csv file, downloading as .tab, renaming to .tsv, then ingesting again should work? So far, no luck. I'm going to try with a simpler file I created myself.

I must be doing something wrong because I can't get a simple .tsv file to ingest.
Also, the first line containing variables is missing from tab download for ingested csv but works in v4.9.1
Also, looked at the docs but what should be the file extension, .tsv or .tab or does it matter?

oscardssmith · 2018-07-13T22:52:58Z

the extension should be tsv, but it might, and arguably should work for both.

oscardssmith · 2018-07-13T22:54:34Z

it may not be working because the downloaded one we provide is missing the header, which we need to ingest.

pdurbin · 2018-07-16T15:00:49Z

@oscardssmith and I have been taking about this issue this morning. It's not ready for QA. I did end up botching the merge conflict resolution on Friday so I just created a branch called "4044-tsv-ingest2" so we can try again if we want to put that branch through QA instead.

I did notice a small bug or at least shortcoming. I was able to ingest scripts/issues/2595/numconnacquired.tsv but it doesn't show up as "tabulardata" in the facet:

pdurbin · 2018-07-16T15:08:57Z

Adding text/tsv=tabulardata below text/tab-separated-values=tabulardata in src/main/java/MimeTypeFacets.properties and uploading a new file fixes the facet bug above but I'm not sure if we should have both forms or replace the former with the latter.

pdurbin · 2018-07-17T20:27:45Z

Because I botched merging develop into the last pull request, I asked @oscardssmith to do it in a new one: pull request #4854. Thanks! He also fixed the issue above.

kcondon · 2018-07-17T22:30:46Z

TSV ingest generally works but it is dropping the extension on original file download.

oscardssmith · 2018-07-18T14:48:04Z

Fixed

kcondon · 2018-07-18T15:50:45Z

OK note that downloading a file that was uploaded as .tab becomes .tsv on download as original file. This is because we do not store the extension but recreate it by examining the mime type. Though an issue, it preexists this PR and might be worked on when we look at preserving original filenames/ store original filenames requested by Odum.

pdurbin · 2018-07-18T16:00:18Z

This is because we do not store the extension but recreate it by examining the mime type.

@donsizemore @akio-sone and @evelynPM have been discussing this at #2734

pdurbin mentioned this issue Jun 1, 2018

Support for tsv file ingest for tabular data #1912

Closed

djbrooke added the Status: Backlog label Jun 20, 2018

oscardssmith mentioned this issue Jul 6, 2018

4044 tsv ingest #4817

Closed

5 tasks

oscardssmith added Status: Code Review and removed Status: Backlog labels Jul 6, 2018

benjamin-martinez added Status: QA and removed Status: Code Review labels Jul 6, 2018

kcondon self-assigned this Jul 11, 2018

pdurbin self-assigned this Jul 13, 2018

pdurbin added a commit that referenced this issue Jul 13, 2018

Merge branch 'develop' into 4044-tsv-ingest #4044

c616686

Conflicts: src/main/java/edu/harvard/iq/dataverse/DataFile.java (just imports) src/main/java/edu/harvard/iq/dataverse/util/FileUtil.java (switch vs if/else)

pdurbin removed their assignment Jul 14, 2018

pdurbin added Status: Code Review and removed Status: QA labels Jul 16, 2018

pdurbin assigned oscardssmith and pdurbin and unassigned kcondon Jul 16, 2018

oscardssmith mentioned this issue Jul 16, 2018

4044 tsv ingest2 #4854

Merged

5 tasks

oscardssmith added Status: Development and removed Status: Code Review labels Jul 16, 2018

oscardssmith unassigned pdurbin Jul 16, 2018

oscardssmith added Status: Code Review and removed Status: Development labels Jul 17, 2018

oscardssmith removed their assignment Jul 17, 2018

pdurbin added Status: QA and removed Status: Code Review labels Jul 17, 2018

kcondon self-assigned this Jul 17, 2018

kcondon closed this as completed Jul 18, 2018

kcondon removed the Status: QA label Jul 18, 2018

pdurbin mentioned this issue Jul 18, 2018

Add originalFileName field to json #2734

Closed

mheppler mentioned this issue Sep 19, 2018

Facets - "File Type" vs "filetype" #5067

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for directly ingesting tab-delimited files #4044

Add support for directly ingesting tab-delimited files #4044

landreev commented Aug 4, 2017

pdurbin commented Jun 1, 2018

mheppler commented Jul 6, 2018

oscardssmith commented Jul 9, 2018

oscardssmith commented Jul 9, 2018

pdurbin commented Jul 13, 2018

pdurbin commented Jul 13, 2018

kcondon commented Jul 13, 2018 •

edited

Loading

oscardssmith commented Jul 13, 2018

oscardssmith commented Jul 13, 2018

pdurbin commented Jul 16, 2018

pdurbin commented Jul 16, 2018

pdurbin commented Jul 17, 2018

kcondon commented Jul 17, 2018

oscardssmith commented Jul 18, 2018

kcondon commented Jul 18, 2018

pdurbin commented Jul 18, 2018

Add support for directly ingesting tab-delimited files #4044

Add support for directly ingesting tab-delimited files #4044

Comments

landreev commented Aug 4, 2017

pdurbin commented Jun 1, 2018

mheppler commented Jul 6, 2018

oscardssmith commented Jul 9, 2018

oscardssmith commented Jul 9, 2018

pdurbin commented Jul 13, 2018

pdurbin commented Jul 13, 2018

kcondon commented Jul 13, 2018 • edited Loading

oscardssmith commented Jul 13, 2018

oscardssmith commented Jul 13, 2018

pdurbin commented Jul 16, 2018

pdurbin commented Jul 16, 2018

pdurbin commented Jul 17, 2018

kcondon commented Jul 17, 2018

oscardssmith commented Jul 18, 2018

kcondon commented Jul 18, 2018

pdurbin commented Jul 18, 2018

kcondon commented Jul 13, 2018 •

edited

Loading