Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replace download option "RData" with RDS #6678

Open
reikoch opened this issue Feb 25, 2020 · 14 comments
Open

replace download option "RData" with RDS #6678

reikoch opened this issue Feb 25, 2020 · 14 comments

Comments

@reikoch
Copy link

reikoch commented Feb 25, 2020

In https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/QPHMKX each of the 4 data files provides the option to download the individual file in RData format.
This is nice for R users but in this example the downloaded data get inserted as an object called "x" in R; loading several objects will repeatedly overwrite object x.

As an alternative to RData format I would suggest to use R's RDS format https://stat.ethz.ch/R-manual/R-devel/library/base/html/readRDS.html which

  • saves exactly one object per file
  • when reading back it can be assigned to any name, fully under control by the programmer.
@wibeasley
Copy link

I think an rds is preferable to an RData file, for the reasons that @kuriwaki and I have discussed in the r client repo. Please let us know if you'd like to discuss it more.

@pdurbin
Copy link
Member

pdurbin commented Feb 25, 2020

@wibeasley thanks for jumping in.

@reikoch thanks for opening this issue. I'm not a very good R developer and I'm ignorant about these formats but my first thought is... are you sure you want to replace the ability to download RData format with RDS format? I'm concerned about scripts that may rely on the older format (I assume it's older) for reproducibility. I would think adding RDS support would be safer, more backward compatible. So we'd offer both formats, I'm saying.

@kuriwaki
Copy link
Member

Backward compatibility is probably necessary.
I would still echo @reikoch 's points about how the RData format is currently implemented is error-prone. Because these download options only apply to ingestible data (i.e., a single table, not a bundle or environment) to begin with, there is no real reason to prefer .rda over .rds in this setting.

By the way, for this particular dataverse file, I think downloading it as the original .csv file and reading it in as a csv file is preferable to transforming it in to RData/Rds.

@reikoch
Copy link
Author

reikoch commented Feb 26, 2020

Well generally I think it is bad to use a mechanism as RData format where when loading you cannot determine the target's name. load('xy.RData') can silently overwrite existing objects in the R session whereas myvar <- readRDS('xy.rds') allows me to decide under which name I want the content of xy.rds be brought in. A nice essay about this topic you find in https://yihui.org/en/2017/12/save-vs-saverds/.
If you feel backwards compatibility is needed for a while, what about deprecating RData downloads first? No problem with RData for uploads.

True, R can read pretty much any file format but rds and RData are type safe (dates are noted as such etc), csv is not. In addition with plain csv there is no encoding of the data specified, http://frictionlessdata.io/ might be a way out as data packages store these metadata, xlsx does so too.

As a consumer I love type safe data formats in specified encoding!

@kuriwaki
Copy link
Member

rds and RData are type safe (dates are noted as such etc), csv is not.

  • I agree; my point was that for the particular dataset you linked to , the authors uploaded their data in csv originally. So there's no loss of info to download it as csv.

@reikoch
Copy link
Author

reikoch commented Feb 27, 2020

Ok, that means the RData file is derived from the csv file making some assumptions on encoding. Looking at the variable VSORRESU in CSC305ABC_VS it seems that the csv file was encoded in Latin1 which the derivation did not pick up - see unit for temperature measurements.

Maybe just provide original file and a quick analysis of encoding and csv dialect for data uploaded as csv?

@cmbz
Copy link

cmbz commented Aug 20, 2024

To focus on the most important features and bugs, we are closing issues created before 2020 (version 5.0) that are not new feature requests with the label 'Type: Feature'.

If you created this issue and you feel the team should revisit this decision, please reopen the issue and leave a comment.

@cmbz cmbz closed this as completed Aug 20, 2024
@wibeasley
Copy link

I feel the benefits of an rds file (over a rdata file) are as relevant as ever.

However I'd rather have a parquet file than an rds file (see #9897) because it has the benefits of an rds file (plus is language-agnostic).

@kuriwaki, you're more in touch with how people use R w/ Dataverse. Is there a community that would substantially benefit from both rds & parquet files? Or would the parquet files satisfy their needs adequately?

@cmbz
Copy link

cmbz commented Aug 20, 2024

Reopening as per @wibeasley's request.

@cmbz cmbz reopened this Aug 20, 2024
@wibeasley
Copy link

@cmbz, I'm not sure it needs to be reopened.

If @kuriwaki and others agree that all the benefits of an rds file are provided by a parquet file (for the R + Dataverse users), I think we'd rather the rds effort be conserved and redirected towards a parquet option (which would benefit other languages too, like Python).

Like rds, parquet files are compressed and strongly-typed. There also should be packages to handle the hard work, so Dataverse software doesn't need to get entangled with the problems with RData files and the messiness of Rserve described by @landreev.

@cmbz
Copy link

cmbz commented Aug 20, 2024

Got it, thanks @wibeasley. Please chime in here @kuriwaki and let me know if you're okay with closing again.

@kuriwaki
Copy link
Member

kuriwaki commented Aug 23, 2024

My views now are closer to #7249 where I suggest getting rid of either rdata/rds exports of ingested files altogether.

I think having the rdata export format (again, for ingested files, which I think is the question here) is not necessary for R users, and might only confuse beginners of R. R users should read ingested files as plain-text files, not as a custom R format.

@kuriwaki
Copy link
Member

edit: added a missing negation, typo

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants