Fix 8.5% of unknown files; Re-ingest 10,000+ Excel files (.xlsx) #3288

raprasad · 2016-08-16T20:45:25Z

Needed for #2202 to work properly

Fixes

~~Update contenttypes in the db~~ (done)
Attempt to re-ingest each file

According to a dataverse snapshot at end of July:

There are ~~19,362~~ 13,696 files with extension ".xlsx"
Of these files, ~~15,187~~ 10,460 files, are contenttype: "application/octet-stream"
- In other words, these ~~15,187~~ 10,460 files do not seem to be identified as Excel
- They are classified as unknown
These account for a full ~~10.6%~~ 8.5% of Dataverses's 122k+ "unknown files"

# Count all "unknown" files
SELECT count(distinct(datafile.id)) FROM "datafile" INNER JOIN "dvobject" ON ("datafile"."id" = "dvobject"."id") WHERE "datafile"."contenttype" = 'application/octet-stream';

# Count all .xlsx files 
SELECT count(distinct(datafile.id)) FROM "datafile" INNER JOIN "dvobject" ON ("datafile"."id" = "dvobject"."id") WHERE ("datafile"."id" IN (SELECT fm."datafile_id" FROM "filemetadata" fm WHERE fm."label" LIKE '%.xlsx'));

# Count .xlsx files labeled as unknown
SELECT count(distinct(datafile.id)) FROM "datafile" INNER JOIN "dvobject" ON ("datafile"."id" = "dvobject"."id")
WHERE ("datafile"."id" IN (SELECT fm."datafile_id" FROM "filemetadata"fm WHERE fm."label" LIKE '%.xlsx')
AND "datafile"."contenttype" = 'application/octet-stream')

Numbers updated on: 9/6/16. All numbers above from a late July prod. snapshot.

Prod numbers, 9/6 below:

  "data": {
    "number_unique_extensions": 1454,
    "file_extension_counts": [
      {
        "extension": "",
        "count": 36657,
        "total_count": 124357,
        "percent_string": "29.477%"
      },
      {
        "extension": ".xz",
        "count": 12423,
        "total_count": 124357,
        "percent_string": "9.990%"
      },
      {
        "extension": ".xlsx",
        "count": 10471,
        "total_count": 124357,
        "percent_string": "8.420%"
      },
(etc, etc)

djbrooke · 2016-08-17T14:04:16Z

This has value beyond analytics.

@sbarbosadataverse - I know we have discussed this but I couldn't find another Github issue. Is there one, or should we just track the work here?

pdurbin · 2016-08-17T14:07:13Z

This issue strikes me as a bit of a duplicate of #585 and #2264.

djbrooke · 2016-08-23T13:55:49Z

@raprasad - is this a code change that needs to be tied to release or is there some script/process that needs to be run/rerun to re-ingest these files? Thanks!

raprasad · 2016-08-23T14:06:20Z

@djbrooke: I think @landreev can re-run ingest code. Can go over more details post stand-up.

sekmiller · 2016-09-01T21:40:44Z

@raprasad What is the query that you are using to get these results?

sekmiller · 2016-09-02T14:18:48Z

Before digging into the ingest code I created some simple .xlsx files. When I ingest them they are converted into .tab files with a content type of "text/tab-separated-values" @landreev is there a trick to get them to be ingested at xlsx files?

@raprasad is there anything else you can tell about the files in question? Could they have been in 3.6 and migrated to 4.0?

sekmiller · 2016-09-02T14:21:43Z

More information - when I try to download one of my original .xlsx files, I am offered that format as the original format.

sekmiller · 2016-09-02T14:44:45Z

@raprasad are you joining datatable in your query? That holds the original file format for files that are ingested as data tables.

raprasad · 2016-09-06T14:44:42Z

after talking to @sekmiller, ?s to answer:

(A) When were these xlsx files uploaded? Pre/post 4.0?
- Answer: The content type of application/octet-stream was not applied after 7/27/2015
(B) How does the home page "File Type" facet capture files? e.g. How are excel file files classified in this facet?
- Answer: See next comment

raprasad · 2016-09-06T14:54:45Z

Answer to (B) above:

One the home page, the "File type" facet for properly identified Excel files is either:
1. tabulardata. User uploads an .xlsx file and ends up with a .tab as in @sekmiller's previous comment. May be problematic/confusing for user but that's a separate issue
2. document. An .xlsx file not "ingested" is considered a file type "document"
Regarding .xlsx files listed as file type document, this is how it happens:
1. This code determines that an .xlsx file is a type "document":
  - https://github.com/IQSS/dataverse/blob/develop/src/main/java/edu/harvard/iq/dataverse/DataFileServiceBean.java#L1078
2. This code gives it the facet file type as "document":
  - https://github.com/IQSS/dataverse/blob/develop/src/main/java/edu/harvard/iq/dataverse/DataFileServiceBean.java#L998
3. The file type "document" above is used for solr indexing

NOTE: This confirms that the original issue .xlsx files marked as content type "application/octet-stream" is legitimate

raprasad · 2016-09-09T16:22:05Z

update query to fix content type:

1 - Run subquery count

SELECT count(distinct(df.id))
FROM datafile df,  filemetadata fm
WHERE fm.datafile_id = df.id
AND fm.label LIKE '%.xlsx'
AND df.contenttype = 'application/octet-stream';

2 - Run update query

update datafile set contenttype='application/vnd.openxmlformats-officedocument.spreadsheetml.sheet' where id in (SELECT distinct(df.id)
FROM datafile df,  filemetadata fm
WHERE fm.datafile_id = df.id
AND fm.label LIKE '%.xlsx'
AND df.contenttype = 'application/octet-stream');

Sanity

Rows in 1 and 2 should match. Ran test on copy of prod.
Altnerative: run update with actual list of ids

raprasad · 2016-09-09T17:36:48Z

almost done, after @kcondon runs index all, content types will be correct in db and solr

…instead of datafile object. #3288

djbrooke · 2016-09-14T21:04:43Z

The update of .xlsx files from unknown to known has been completed. This provides significant value in that it reduces the number of files that are undefined in our reports.

The re-ingest of files is an outstanding issue, one that would be better served by delaying until after we have file versioning in place (#2290). I'm going to move this issue out of 4.5.1 and mark it as not started. It will be reassigned to a future release and we can make the ingest changes then.

Some notes for when we pick this up in the future, gleaned from chat:

Agreed that:

We want to ingest unprocessed Excel files (and other tabular files in the future)
Ingestion will either:
- Create a new UNF if one doesn't exist
- Change the UNF if there is one. (e.g. One file was ingested but now two are.)
A new (minor) version should be created

Potential workflow:

Ingest file, Create new dataset version, Notify user

If dataset is published, then publish the new version
If dataset is unpublished, then leave new version unpublished

We should review to the notification text to make sure it explains that the UNF of the previous version can still be found (and referenced) in the dataset landing page. We should also make sure that thew new version tracks what was changed (e.g. UNF generated from excel file). My question about workflow 1 is that if we do a batch ingest for all excel files, will we know ahead of time that they are "ingestable"? Finally, a number of users had asked to turn off the "explore" (TwoRavens) option for theirs tabular files. Could we run this once there is the option to remove the "explore" button for a data file?

One other note. The current wait we store files is that the data table / unf are stored with the file, not the version. So we would have to delete the file, then re add it. (What will eventually also be file replace?) Does this present any issues? One is that we technically have always said that that is a major version change.

The point is we don't currently have a way to store a file in both an uningested state and an ingested state, except to treat as a new file (ie a new version of the file). And to maintain version integrity, we need to be able to store the file in both states.

pdurbin · 2018-07-12T01:33:51Z

The re-ingest of files is an outstanding issue

@djbrooke yes, especially now that #2301 about Stata has been addressed we should revisit how best to retry ingesting a whole variety of files that failed ingest with older versions of Dataverse. Is that what this issue represents? Should we create a fresh one instead? In 06ef690 I tested that a certain Brooke's file are now expected to ingest properly once we cut a release. 😄

pdurbin · 2018-07-17T22:15:40Z

Closing in favor of this new issue: File Reingest API #4865

raprasad added Feature: Metadata Priority: High Type: Bug a defect Component: Code Infrastructure formerly "Feature: Code Infrastructure" labels Aug 16, 2016

raprasad changed the title ~~Re-ingest 14,000+ Excel files (.xlsx) that are marked as "unknown"~~ Fix 10.6% of unkown files; Re-ingest 14,000+ Excel files (.xlsx) Aug 16, 2016

raprasad changed the title ~~Fix 10.6% of unkown files; Re-ingest 14,000+ Excel files (.xlsx)~~ Fix 10.6% of unknown files; Re-ingest 14,000+ Excel files (.xlsx) Aug 16, 2016

djbrooke assigned sbarbosadataverse Aug 17, 2016

djbrooke changed the title ~~Fix 10.6% of unknown files; Re-ingest 14,000+ Excel files (.xlsx)~~ Fix 10.6% of unknown files; Re-ingest 14,000+ Excel files (.xlsx) Aug 17, 2016

djbrooke unassigned sbarbosadataverse Aug 22, 2016

djbrooke added this to the 4.5.1 - IP Groups fixes milestone Sep 1, 2016

djbrooke added the ready label Sep 1, 2016

djbrooke assigned sekmiller Sep 1, 2016

djbrooke added in progress and removed ready labels Sep 2, 2016

raprasad changed the title ~~Fix 10.6% of unknown files; Re-ingest 14,000+ Excel files (.xlsx)~~ Fix 10.6% of unknown files; Re-ingest ~~14,000+~~ 10,000+ Excel files (.xlsx) Sep 6, 2016

raprasad changed the title ~~Fix 10.6% of unknown files; Re-ingest ~~14,000+~~ 10,000+ Excel files (.xlsx)~~ Fix 8.5% of unknown files; Re-ingest 10,000+ Excel files (.xlsx) Sep 6, 2016

raprasad added Feature: File Upload & Handling and removed Component: Code Infrastructure formerly "Feature: Code Infrastructure" labels Sep 6, 2016

mheppler mentioned this issue Sep 7, 2016

Dataset - Too Many "Unknown" Files, Friendly File MIME Type Display Names #2202

Closed

raprasad self-assigned this Sep 9, 2016

djbrooke removed Priority 2: Moderate labels Sep 9, 2016

raprasad added a commit that referenced this issue Sep 13, 2016

#3288 added method for single file

cb4132a

raprasad added a commit that referenced this issue Sep 13, 2016

#3288, bit of cleanup

c05120d

raprasad added a commit that referenced this issue Sep 13, 2016

Added additional startIngestSingleFile method that takes datafile id …

4f0212a

…instead of datafile object. #3288

raprasad added a commit that referenced this issue Sep 13, 2016

Working on #3288

843c757

djbrooke unassigned sekmiller Sep 14, 2016

djbrooke unassigned raprasad Sep 14, 2016

djbrooke added ready and removed in progress labels Sep 14, 2016

djbrooke removed this from the 4.5.1 - IP Groups fixes milestone Sep 15, 2016

djbrooke removed the ready label Sep 20, 2016

pdurbin mentioned this issue Mar 8, 2017

200+ files uploaded to Harvard Dataverse failed ingest during one month in 2017 #3675

Closed

pdurbin closed this as completed Jul 17, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix 8.5% of unknown files; Re-ingest 10,000+ Excel files (.xlsx) #3288

Fix 8.5% of unknown files; Re-ingest 10,000+ Excel files (.xlsx) #3288

raprasad commented Aug 16, 2016 •

edited

Loading

djbrooke commented Aug 17, 2016

pdurbin commented Aug 17, 2016

djbrooke commented Aug 23, 2016

raprasad commented Aug 23, 2016

sekmiller commented Sep 1, 2016

sekmiller commented Sep 2, 2016

sekmiller commented Sep 2, 2016

sekmiller commented Sep 2, 2016

raprasad commented Sep 6, 2016 •

edited

Loading

raprasad commented Sep 6, 2016

raprasad commented Sep 9, 2016 •

edited

Loading

raprasad commented Sep 9, 2016

djbrooke commented Sep 14, 2016

pdurbin commented Jul 12, 2018

pdurbin commented Jul 17, 2018

Fix 8.5% of unknown files; Re-ingest 10,000+ Excel files (.xlsx) #3288

Fix 8.5% of unknown files; Re-ingest 10,000+ Excel files (.xlsx) #3288

Comments

raprasad commented Aug 16, 2016 • edited Loading

Fixes

djbrooke commented Aug 17, 2016

pdurbin commented Aug 17, 2016

djbrooke commented Aug 23, 2016

raprasad commented Aug 23, 2016

sekmiller commented Sep 1, 2016

sekmiller commented Sep 2, 2016

sekmiller commented Sep 2, 2016

sekmiller commented Sep 2, 2016

raprasad commented Sep 6, 2016 • edited Loading

raprasad commented Sep 6, 2016

raprasad commented Sep 9, 2016 • edited Loading

1 - Run subquery count

2 - Run update query

Sanity

raprasad commented Sep 9, 2016

djbrooke commented Sep 14, 2016

pdurbin commented Jul 12, 2018

pdurbin commented Jul 17, 2018

raprasad commented Aug 16, 2016 •

edited

Loading

raprasad commented Sep 6, 2016 •

edited

Loading

raprasad commented Sep 9, 2016 •

edited

Loading