-
Notifications
You must be signed in to change notification settings - Fork 493
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Files being converted automatically from .csv to .tab #6385
Comments
@tainguyenbui thanks for the detailed writeup. Since this is a core part of the application, we've been resistant to change it (see #2199 (comment)) but new use cases are always helpful for revisiting functionality. You should be able to get a .csv file out by specifying "original" as part of the API call, but we had not considered the locking delay implications. @landreev @scolapasta (and whoever else is interested) let's catch up about this sometime over the next few days. Community comments welcome here as well! |
@djbrooke, the workflow we are following is a bit different to just getting the The application retrieves the dataset information about a dataset, which includes information such as file ids, file names etc. We then have created a tool that reads .csv files and displays information about the .csv. The user is able to modify that .csv and then we would proceed to upload the new version of the .csv file and publish the new dataset version. Since the .tab file and the .csv files have different content-type, as well as the ingestion to .tab not working in all scenarios, we end up with errors that are difficult to handle, such as errors due to different content-types. For that reason, we thought that just dealing with .csv could help us simplify, reduce troubles in the process. |
@tainguyenbui and @MYF95 thanks for talking about this earlier. For the issues with the dataset locking, it's good to hear that you're using a queue to minimize the impacts of the delays. For the replace issue, @landreev just mentioned that we have "forceReplace" that should allow you to upload a csv even though it expects a tab for content-type. Editing the example from the docs (http://guides.dataverse.org/en/latest/api/native-api.html#replacing-files) it would be like: curl -H “X-Dataverse-key:$API_TOKEN” -X POST -F ‘[email protected]’ Hopefully this does the trick and this is just a documentation issue where we can do better. Let me know! |
@djbrooke I've tried the Additionally, the two properties below when retrieving dataset information let us know that despite the file being a
original response:
|
@tainguyenbui you might be interested in the discussion at #4000 which talks about original files and how CSV is just as good of an archival format as TSV. The other day @djbrooke and I talked about this issue and how we might want to clarify and separate two different goals of ingest (both in terms of code/behavior and docs, I'd say):
|
@pdurbin, thanks for the links. I had a look previously to issue #4000, unfortunately our files are already given by an existing application and it is going to be the file that the platform users are going to be using for now, hence why we wouldn't want to break what the users already have in place. It sounds great that there are some discussions on how to make enhancements to the existing data ingestion and data navigation. |
Hi @tainguyenbui, good to hear the forcereplace is working for you. I'm going to close this issue for now since we won't make any changes to the ingest process. @pdurbin thanks also for summarizing the info from our conversation the other day! |
Hi Dataverse Team!
Description
As part of the data ingestion, uploading file or replacing an existing file in a dataset, Dataverse is currently converting light-weight
.csv
files into.tab
Additionally, this conversion to tabular also produces that a dataset is locked until the tabular file is created, not allowing the dataset to be published automatically.
The problem we have
The platform we are developing, that uses Dataverse as a data repository does not need
.csv
files to be converted, in fact, it is currently being developed to cater for.csv
files only and it is likely to continue being like that in the future.Due to the change of extension, when we attempt to replace what we initially uploaded to a newer version, we experience issues due to
content-type
mismatch, one file being acomma separated value
and the other atabular separated file
.Additionally, in our current implementation, we would like to replace the file and, if successful with a 200 OK response, then publish the dataset automatically with the new version of the file. This is currently not possible due to the dataset being locked whilst converting the file to
.tab
Desired behaviour
We would love to have the ability to avoid those
.csv
files being converted into different formats. This would also save the lock of the dataset while converting.csv
into.tab
, hence, we would be able to publish the dataset straight away.Possible solutions
ingestion
as it is already being done for large files.replace
endpointWe are open to discussions since we are the ones very interested in keeping the original
.csv
files untouched in the dataset.Thanks a lot in advance for your help
Regards,
Tai
The text was updated successfully, but these errors were encountered: