-
Notifications
You must be signed in to change notification settings - Fork 490
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix 8.5% of unknown files; Re-ingest 10,000+ Excel files (.xlsx) #3288
Comments
This has value beyond analytics. @sbarbosadataverse - I know we have discussed this but I couldn't find another Github issue. Is there one, or should we just track the work here? |
@raprasad - is this a code change that needs to be tied to release or is there some script/process that needs to be run/rerun to re-ingest these files? Thanks! |
@raprasad What is the query that you are using to get these results? |
Before digging into the ingest code I created some simple .xlsx files. When I ingest them they are converted into .tab files with a content type of "text/tab-separated-values" @landreev is there a trick to get them to be ingested at xlsx files? @raprasad is there anything else you can tell about the files in question? Could they have been in 3.6 and migrated to 4.0? |
More information - when I try to download one of my original .xlsx files, I am offered that format as the original format. |
@raprasad are you joining datatable in your query? That holds the original file format for files that are ingested as data tables. |
after talking to @sekmiller, ?s to answer:
|
Answer to (B) above:
NOTE: This confirms that the original issue |
update query to fix content type: 1 - Run subquery countSELECT count(distinct(df.id))
FROM datafile df, filemetadata fm
WHERE fm.datafile_id = df.id
AND fm.label LIKE '%.xlsx'
AND df.contenttype = 'application/octet-stream'; 2 - Run update queryupdate datafile set contenttype='application/vnd.openxmlformats-officedocument.spreadsheetml.sheet' where id in (SELECT distinct(df.id)
FROM datafile df, filemetadata fm
WHERE fm.datafile_id = df.id
AND fm.label LIKE '%.xlsx'
AND df.contenttype = 'application/octet-stream'); Sanity
|
almost done, after @kcondon runs index all, content types will be correct in db and solr |
The update of .xlsx files from unknown to known has been completed. This provides significant value in that it reduces the number of files that are undefined in our reports. The re-ingest of files is an outstanding issue, one that would be better served by delaying until after we have file versioning in place (#2290). I'm going to move this issue out of 4.5.1 and mark it as not started. It will be reassigned to a future release and we can make the ingest changes then. Some notes for when we pick this up in the future, gleaned from chat: Agreed that:
Potential workflow:
We should review to the notification text to make sure it explains that the UNF of the previous version can still be found (and referenced) in the dataset landing page. We should also make sure that thew new version tracks what was changed (e.g. UNF generated from excel file). My question about workflow 1 is that if we do a batch ingest for all excel files, will we know ahead of time that they are "ingestable"? Finally, a number of users had asked to turn off the "explore" (TwoRavens) option for theirs tabular files. Could we run this once there is the option to remove the "explore" button for a data file? One other note. The current wait we store files is that the data table / unf are stored with the file, not the version. So we would have to delete the file, then re add it. (What will eventually also be file replace?) Does this present any issues? One is that we technically have always said that that is a major version change. The point is we don't currently have a way to store a file in both an uningested state and an ingested state, except to treat as a new file (ie a new version of the file). And to maintain version integrity, we need to be able to store the file in both states. |
@djbrooke yes, especially now that #2301 about Stata has been addressed we should revisit how best to retry ingesting a whole variety of files that failed ingest with older versions of Dataverse. Is that what this issue represents? Should we create a fresh one instead? In 06ef690 I tested that a certain Brooke's file are now expected to ingest properly once we cut a release. 😄 |
Closing in favor of this new issue: File Reingest API #4865 |
Needed for #2202 to work properly
Fixes
Update contenttypes in the db(done)According to a dataverse snapshot at end of July:
19,36213,696 files with extension ".xlsx"15,18710,460 files, are contenttype: "application/octet-stream"15,18710,460 files do not seem to be identified as Excel10.6%8.5% of Dataverses's 122k+ "unknown files"Numbers updated on: 9/6/16. All numbers above from a late July prod. snapshot.
Prod numbers, 9/6 below:
The text was updated successfully, but these errors were encountered: