Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spike: Investigate challenges with how Dataverse software handles shapefiles #8816

Open
jggautier opened this issue Jun 27, 2022 · 9 comments

Comments

@jggautier
Copy link
Contributor

jggautier commented Jun 27, 2022

Challenges with how the Dataverse software handles shapefiles was mentioned in a GitHub issue at #6873 (comment). My questions then were more about computational reproducibility. But this has come up again because a depositor I'm trying to help has concerns with how this functionality is complicating their upload of a lot of data onto the Harvard Dataverse Repository.

The depositor, who's using Dataverse APIs to upload files that are not on their computer (I think the files are on an AWS server), may or may not be able to detect and double zip all shapefiles in order to prevent the Dataverse software from zipping the shapefiles when they're uploaded to the repository. I'll ask the depositor if they can do this.

But:

  • When uploading files using the Dataverse APIs, is it possible to tell the Dataverse software not to put shapefiles in a .zip file?
  • As @qqmyers mentioned during a meeting this morning, can the repository's ingest settings be temporarily changed so that the Dataverse software doesn't zip the shapefiles in this depositor's uploads?

For more context, the email conversation is in IQSS's support email system at https://help.hmdc.harvard.edu/Ticket/Display.html?id=322323 and the data is from Redistricting Data Hub

More broadly, I think more research should be done about the value of the Dataverse software's handling of shapefiles, including the questions and discussion in the GitHub issue comment at #6873 (comment)

The issue at #7352 might also be related.

Having ways for depositors to learn about this behavior before they start uploading would be helpful. This behavior is documented only in the Developer guides (https://guides.dataverse.org/en/6.2/developers/geospatial.html) and not in the User Guides or in the UI, although it's referenced in the User Guides.

@jggautier jggautier changed the title Spike: Investigate challenges with how Dataverse software handles shapefiiles Spike: Investigate challenges with how Dataverse software handles shapefiles Jun 27, 2022
@mreekie mreekie added the spike label Jun 27, 2022
@mreekie
Copy link

mreekie commented Jun 27, 2022

Next steps:

  • Get more details. Seems like there's the customer question and then the solution.
  • The problem itself has been around awhile. Can we get a temporary solution for this customer then get this problem reprioritized?

@qqmyers
Copy link
Member

qqmyers commented Jun 27, 2022

Looking at the code, I think unzip and the unzip/rezip for shapefiles isn't considered ingest so setting the ingest size limit won't help (could be wrong). That said, because this involves unzip, it could be disabled if the depositor's dataset used a store that has direct upload enabled. (Direct upload doesn't unzip because pulling the files to Dataverse to unzip after direct uploading to the S3 bucket essentially defeats the purpose of putting the file directly in S3 to start with.) Direct upload, when enabled, can be done via the UI or the direct-upload API which can be used directly or via DVUploader. pyDataverse doesn't yet use the direct upload API so it would not handle this case at present.

@jggautier
Copy link
Contributor Author

Thanks as always! The shapefiles that the depositor is uploading are not zipped (and the depositor is trying to prevent the Dataverse software from zipping the files), so I think this particular case involves only zipping. Does what you wrote also mean that direct upload will prevent the Dataverse software from zipping the files, too?

@mreekie
Copy link

mreekie commented Jun 27, 2022

Touched base with Julian on the issue in person. Here is where we are at and next steps:'
What we know:

  • Customer got unexpected behavior when uploading files. They did not expect the files to be zipped.
  • We think the customer has 100+ datasets at least that they are dealing with so telling them to go ahead and we'll figure this out later is not the best next step.
  • Leonid appears to have looked at this question previously.
  • Danny appears to have at some point looked into this and maybe decided that the behavior is OK?
  • If the behavior is OK it's not documented.
  • The customer is a dev working on behalf of a group that has alot of experience with geospatial data.
  • The system is working as designed. This current implementation was done working with World Map and World Map requested or at least OK'd the way this was implemented.
  • We are stuck this week with so many people being out on vacation.

We don't know

  • Is the system working correctly? If so then we have a documentation issue to solve.
  • Does dataverse need to process the affected files in a special way? if so we have code issue or need a workaround to implement.

Next Steps:

  • We have established and RT ticket to handle the specifics related to helping this customer.
  • We will let them know what we know and where we are at resource-wise this week via the ticket.
  • We will retrace the steps taken the last time this question was asked to see if we can find the right person to answer the question about how these files need to be handled. Is there an industry standard we're following?
  • Next week we will schedule a meeting with the right experts to discuss this and the outcome from that will drive the ultimate solution.
  • We will re-trace our steps in earlier

@qqmyers
Copy link
Member

qqmyers commented Jun 27, 2022

Perhaps I'm missing something but all I see in the Dataverse code is a check for application/zipped-shapefile and code to unzip/rezip. Are we sure it is Dataverse zipping and not the upload util? For example, I see https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/U3YXNX which has unzipped shapefile elements. (FWIW I can't see the RT ticket so I don't know any details in it.)

@jggautier
Copy link
Contributor Author

jggautier commented Jun 27, 2022

Ugh, when I download the shapefiles in https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/U3YXNX and re-upload them in a new dataset on Demo Dataverse and on Harvard's repo through the UI (dragging and dropping the files into the "Upload with HTTP via your browser" panel), they aren't being zipped as I've experienced before. I also tested with another set of shapefiles I've used before and Demo and Harvard's repo aren't zipping them. So now I'm confused. This zipping of shapefiles is also what I described in the GitHub comment I mentioned.

@qqmyers, by "the upload util" do you mean DVUploader? As far as I know the depositor is using only the Dataverse Native API for uploading files, including shapefiles that aren't zipped. And the depositor has shared screenshots of the shapefiles being zipped after upload.

@qqmyers
Copy link
Member

qqmyers commented Jun 27, 2022

DVUploader won't do any file manipulation. I'm just guessing that there may be something else involved that is creating the zip (which Dataverse would then unzip/rezip).

@jggautier
Copy link
Contributor Author

jggautier commented Jun 27, 2022

I got some clarification from the depositor about what's happening with their shapefiles. It's a bit different than what I've been describing in this issue so far. They are uploading multiple zip files, and some of those zip files contain sets of shapefiles. For a fake example:

boston.zip contains:

  1. shapefile_set_1.dbf
  2. shapefile_set_1.prj
  3. shapefile_set_1.shp
  4. shapefile_set_1.shx
  5. shapefile_set_1.cst
  6. readme.txt

Upon upload, the Dataverse software unzips boston.zip and then rezips only the first 4 files (the four file types mentioned in the Developer Guide). shapefile_set_1.cst and readme.txt are not included in the zip file that the Dataverse software creates.

So after this zipping and partial re-zipping, in the file table you see:

  1. boston.zip (which would contain the first four files)
  2. shapefile_set_1.cst
  3. readme.txt

The depositor expects all six files (or however many files are in the actual zip files that the depositor needs to upload) to be in the same zip file. In my made up example, that would include the shapefile_set_1.cst and readme.txt files.

I don't know much about direct upload, but from the Developer guides it sounds like something a Dataverse installation admin would have to enable, right? Maybe this is a workaround that someone else on the team at IQSS could help with?

The depositor let me know that they have no hard deadline for the upload, and they'll continue working on the data that doesn't involve shapefiles, but they would like to get all of the data uploaded as soon as possible. I let them know that we're short handed this week and maybe next week and will continue updating them as we learn more.

@pdurbin
Copy link
Member

pdurbin commented Oct 1, 2022

@pdurbin pdurbin removed the spike label Oct 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants