Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix 8.5% of unknown files; Re-ingest 10,000+ Excel files (.xlsx) #3288

Closed
1 of 2 tasks
raprasad opened this issue Aug 16, 2016 · 15 comments
Closed
1 of 2 tasks

Fix 8.5% of unknown files; Re-ingest 10,000+ Excel files (.xlsx) #3288

raprasad opened this issue Aug 16, 2016 · 15 comments

Comments

@raprasad
Copy link
Contributor

raprasad commented Aug 16, 2016

Needed for #2202 to work properly

Fixes

  • Update contenttypes in the db (done)
  • Attempt to re-ingest each file

According to a dataverse snapshot at end of July:

  • There are 19,362 13,696 files with extension ".xlsx"
  • Of these files, 15,187 10,460 files, are contenttype: "application/octet-stream"
    • In other words, these 15,187 10,460 files do not seem to be identified as Excel
    • They are classified as unknown
  • These account for a full 10.6% 8.5% of Dataverses's 122k+ "unknown files"
# Count all "unknown" files
SELECT count(distinct(datafile.id)) FROM "datafile" INNER JOIN "dvobject" ON ("datafile"."id" = "dvobject"."id") WHERE "datafile"."contenttype" = 'application/octet-stream';

# Count all .xlsx files 
SELECT count(distinct(datafile.id)) FROM "datafile" INNER JOIN "dvobject" ON ("datafile"."id" = "dvobject"."id") WHERE ("datafile"."id" IN (SELECT fm."datafile_id" FROM "filemetadata" fm WHERE fm."label" LIKE '%.xlsx'));

# Count .xlsx files labeled as unknown
SELECT count(distinct(datafile.id)) FROM "datafile" INNER JOIN "dvobject" ON ("datafile"."id" = "dvobject"."id")
WHERE ("datafile"."id" IN (SELECT fm."datafile_id" FROM "filemetadata"fm WHERE fm."label" LIKE '%.xlsx')
AND "datafile"."contenttype" = 'application/octet-stream')

Numbers updated on: 9/6/16. All numbers above from a late July prod. snapshot.


Prod numbers, 9/6 below:

  "data": {
    "number_unique_extensions": 1454,
    "file_extension_counts": [
      {
        "extension": "",
        "count": 36657,
        "total_count": 124357,
        "percent_string": "29.477%"
      },
      {
        "extension": ".xz",
        "count": 12423,
        "total_count": 124357,
        "percent_string": "9.990%"
      },
      {
        "extension": ".xlsx",
        "count": 10471,
        "total_count": 124357,
        "percent_string": "8.420%"
      },
(etc, etc)
@raprasad raprasad changed the title Re-ingest 14,000+ Excel files (.xlsx) that are marked as "unknown" Fix 10.6% of unkown files; Re-ingest 14,000+ Excel files (.xlsx) Aug 16, 2016
@raprasad raprasad changed the title Fix 10.6% of unkown files; Re-ingest 14,000+ Excel files (.xlsx) Fix 10.6% of unknown files; Re-ingest 14,000+ Excel files (.xlsx) Aug 16, 2016
@djbrooke
Copy link
Contributor

This has value beyond analytics.

@sbarbosadataverse - I know we have discussed this but I couldn't find another Github issue. Is there one, or should we just track the work here?

@pdurbin
Copy link
Member

pdurbin commented Aug 17, 2016

This issue strikes me as a bit of a duplicate of #585 and #2264.

@djbrooke djbrooke changed the title Fix 10.6% of unknown files; Re-ingest 14,000+ Excel files (.xlsx) Fix 10.6% of unknown files; Re-ingest 14,000+ Excel files (.xlsx) Aug 17, 2016
@djbrooke
Copy link
Contributor

@raprasad - is this a code change that needs to be tied to release or is there some script/process that needs to be run/rerun to re-ingest these files? Thanks!

@raprasad
Copy link
Contributor Author

@djbrooke: I think @landreev can re-run ingest code. Can go over more details post stand-up.

@sekmiller
Copy link
Contributor

@raprasad What is the query that you are using to get these results?

@sekmiller
Copy link
Contributor

Before digging into the ingest code I created some simple .xlsx files. When I ingest them they are converted into .tab files with a content type of "text/tab-separated-values" @landreev is there a trick to get them to be ingested at xlsx files?

@raprasad is there anything else you can tell about the files in question? Could they have been in 3.6 and migrated to 4.0?

@sekmiller
Copy link
Contributor

More information - when I try to download one of my original .xlsx files, I am offered that format as the original format.

@sekmiller
Copy link
Contributor

@raprasad are you joining datatable in your query? That holds the original file format for files that are ingested as data tables.

@djbrooke djbrooke added in progress and removed ready labels Sep 2, 2016
@raprasad
Copy link
Contributor Author

raprasad commented Sep 6, 2016

after talking to @sekmiller, ?s to answer:

  • (A) When were these xlsx files uploaded? Pre/post 4.0?
    • Answer: The content type of application/octet-stream was not applied after 7/27/2015
  • (B) How does the home page "File Type" facet capture files? e.g. How are excel file files classified in this facet?
    • Answer: See next comment

@raprasad
Copy link
Contributor Author

raprasad commented Sep 6, 2016

Answer to (B) above:

  1. One the home page, the "File type" facet for properly identified Excel files is either:
    1. tabulardata. User uploads an .xlsx file and ends up with a .tab as in @sekmiller's previous comment. May be problematic/confusing for user but that's a separate issue
    2. document. An .xlsx file not "ingested" is considered a file type "document"
  2. Regarding .xlsx files listed as file type document, this is how it happens:
    1. This code determines that an .xlsx file is a type "document":
    2. This code gives it the facet file type as "document":
    3. The file type "document" above is used for solr indexing

NOTE: This confirms that the original issue .xlsx files marked as content type "application/octet-stream" is legitimate

@raprasad raprasad changed the title Fix 10.6% of unknown files; Re-ingest 14,000+ Excel files (.xlsx) Fix 10.6% of unknown files; Re-ingest ~~14,000+~~ 10,000+ Excel files (.xlsx) Sep 6, 2016
@raprasad raprasad changed the title Fix 10.6% of unknown files; Re-ingest ~~14,000+~~ 10,000+ Excel files (.xlsx) Fix 8.5% of unknown files; Re-ingest 10,000+ Excel files (.xlsx) Sep 6, 2016
@raprasad raprasad added Feature: File Upload & Handling and removed Component: Code Infrastructure formerly "Feature: Code Infrastructure" labels Sep 6, 2016
@raprasad
Copy link
Contributor Author

raprasad commented Sep 9, 2016

update query to fix content type:

1 - Run subquery count

SELECT count(distinct(df.id))
FROM datafile df,  filemetadata fm
WHERE fm.datafile_id = df.id
AND fm.label LIKE '%.xlsx'
AND df.contenttype = 'application/octet-stream';

2 - Run update query

update datafile set contenttype='application/vnd.openxmlformats-officedocument.spreadsheetml.sheet' where id in (SELECT distinct(df.id)
FROM datafile df,  filemetadata fm
WHERE fm.datafile_id = df.id
AND fm.label LIKE '%.xlsx'
AND df.contenttype = 'application/octet-stream');

Sanity

  • Rows in 1 and 2 should match. Ran test on copy of prod.
  • Altnerative: run update with actual list of ids

@raprasad
Copy link
Contributor Author

raprasad commented Sep 9, 2016

almost done, after @kcondon runs index all, content types will be correct in db and solr

@raprasad raprasad self-assigned this Sep 9, 2016
raprasad added a commit that referenced this issue Sep 13, 2016
raprasad added a commit that referenced this issue Sep 13, 2016
raprasad added a commit that referenced this issue Sep 13, 2016
raprasad added a commit that referenced this issue Sep 13, 2016
@djbrooke
Copy link
Contributor

The update of .xlsx files from unknown to known has been completed. This provides significant value in that it reduces the number of files that are undefined in our reports.

The re-ingest of files is an outstanding issue, one that would be better served by delaying until after we have file versioning in place (#2290). I'm going to move this issue out of 4.5.1 and mark it as not started. It will be reassigned to a future release and we can make the ingest changes then.


Some notes for when we pick this up in the future, gleaned from chat:

Agreed that:

  1. We want to ingest unprocessed Excel files (and other tabular files in the future)
  2. Ingestion will either:
    • Create a new UNF if one doesn't exist
    • Change the UNF if there is one. (e.g. One file was ingested but now two are.)
  3. A new (minor) version should be created

Potential workflow:

  1. Ingest file, Create new dataset version, Notify user
  • If dataset is published, then publish the new version
  • If dataset is unpublished, then leave new version unpublished

We should review to the notification text to make sure it explains that the UNF of the previous version can still be found (and referenced) in the dataset landing page. We should also make sure that thew new version tracks what was changed (e.g. UNF generated from excel file). My question about workflow 1 is that if we do a batch ingest for all excel files, will we know ahead of time that they are "ingestable"? Finally, a number of users had asked to turn off the "explore" (TwoRavens) option for theirs tabular files. Could we run this once there is the option to remove the "explore" button for a data file?

One other note. The current wait we store files is that the data table / unf are stored with the file, not the version. So we would have to delete the file, then re add it. (What will eventually also be file replace?) Does this present any issues? One is that we technically have always said that that is a major version change.

The point is we don't currently have a way to store a file in both an uningested state and an ingested state, except to treat as a new file (ie a new version of the file). And to maintain version integrity, we need to be able to store the file in both states.

@pdurbin
Copy link
Member

pdurbin commented Jul 12, 2018

The re-ingest of files is an outstanding issue

@djbrooke yes, especially now that #2301 about Stata has been addressed we should revisit how best to retry ingesting a whole variety of files that failed ingest with older versions of Dataverse. Is that what this issue represents? Should we create a fresh one instead? In 06ef690 I tested that a certain Brooke's file are now expected to ingest properly once we cut a release. 😄

@pdurbin
Copy link
Member

pdurbin commented Jul 17, 2018

Closing in favor of this new issue: File Reingest API #4865

@pdurbin pdurbin closed this as completed Jul 17, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants