Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a .qsv file format that is an implementation of W3C's CSV on the Web #1982

Open
jqnatividad opened this issue Jul 18, 2024 · 12 comments
Labels
CKAN interoperability with CKAN Data Management System datapusher+ for Datapusher+ DCAT3 enhancement New feature or request. Once marked with this label, its in the backlog. qsv pro requires backend/cloud services

Comments

@jqnatividad
Copy link
Collaborator

jqnatividad commented Jul 18, 2024

Currently, qsv creates, consumes and validates CSV files hewing closely to the RFC4180 specification as interpreted by the csv crate.

However, it doesn't allow us to save additional metadata - about the CSV file (dialect, delimiter used, comments, DOI, url, etc.) nor the data the file contains (summary statistics, data dictionary, creator, last updated, hash of the data, etc.)

The request is to create a .qsv file format that is an implementation of W3C's CSV on the Web specification using guidance on https://csvw.org and store schemata/metadata/data in the qsv file that includes not just the schema info, but summary and frequency statistics as well; container for DCAT 3/CKAN package/resource metadata; etc.

Doing so will unlock additional capabilities in qsv, qsv pro, Datapusher+ and CKAN.

It will also allow us to "clean-up" and consolidate the "metadata" files that qsv creates - the stats cache files, the index file, etc. and package up the CSV and its associated metadata in one container as a signed zip file.

It will also make "harvesting" and federation with CKAN easier and more robust as all the needed data/metadata is in one container.

@jqnatividad jqnatividad added enhancement New feature or request. Once marked with this label, its in the backlog. qsv pro requires backend/cloud services datapusher+ for Datapusher+ labels Jul 18, 2024
@jqnatividad
Copy link
Collaborator Author

@jqnatividad jqnatividad changed the title Create a .qsv file that is an implementation of W3C's CSV on the Web Create a .qsv file format that is an implementation of W3C's CSV on the Web Jul 19, 2024
@jqnatividad jqnatividad added the CKAN interoperability with CKAN Data Management System label Jul 19, 2024
@rzmk
Copy link
Collaborator

rzmk commented Aug 2, 2024

Experimenting with this:

image

Sample .qsv file in this ZIP: fruits.qsv.zip (can't share .qsv on GitHub).

@jqnatividad
Copy link
Collaborator Author

For comparison, note that several popular file formats are actually compressed "packages":

@rzmk
Copy link
Collaborator

rzmk commented Aug 18, 2024

May be nice if the .qsv file is verified to be validated or there's a flag that can be quickly checked to see if it is or not along with whether an index is available.

@jqnatividad
Copy link
Collaborator Author

jqnatividad commented Aug 18, 2024

Right @rzmk ! The .qsv file, once implemented, is guaranteed to be ALWAYS valid, as the associated metadata/cache files will always be consistent with the core DATA stored in the archive. We can further ensure security by zipsigning the file so it cannot be tampered.

Further, we can assign a Digital Object Identifier (DOI) to each qsv file so we can track/trace its provenance, and possibly, downstream use.

@jqnatividad
Copy link
Collaborator Author

If done properly, even with all the extra metadata in the .qsv package, a .qsv file will be even smaller than the raw version of the CSV!
This is because CSV files tend to have very high compression ratios - typically 80-90%, and all that extra metadata (stats, frequency tables, etc.) are tiny, just a few KBs, even for multi-gigabyte CSV files.

@jqnatividad
Copy link
Collaborator Author

jqnatividad commented Aug 30, 2024

The qsv file will contain the cache file (#2097 ).
It will also have all the metadata describing the dataset using the DCAT 3 (particularly, the DCAT-US v3 spec for the first implementation)

@jqnatividad
Copy link
Collaborator Author

jqnatividad commented Aug 31, 2024

Related to #1705.
The profile command will create the .qsv file.

@Orcomp
Copy link

Orcomp commented Oct 9, 2024

Worth experimenting with different compression algorithms. We have found Zstandard to work very well with csv files.

@jqnatividad
Copy link
Collaborator Author

Worth experimenting with different compression algorithms. We have found Zstandard to work very well with csv files.

Thanks @Orcomp , do you have any benchmarks/metrics you can share? For Zstandard and other compression algorithms you considered?

@Orcomp
Copy link

Orcomp commented Oct 10, 2024

You can check out https://morotti.github.io/lzbench-web

(From my personal experience, zstd has a good balance between compression ratio and compress/decompress speeds. I looked into this 2-3 years ago, so things might have changed a bit since.)

@jqnatividad
Copy link
Collaborator Author

Instead of just signing the qsv using conventional techniques, "explore using two emerging standards: the W3C Verifiable Credentials Data Model 2.0 and Decentralized Identifiers (DIDs) v1.0 that leverage NIST's FIPS 186-5 but also align well with DCAT RDF model, making both human and machine readable."

See DOI-DO/dcat-us#132

@jqnatividad jqnatividad pinned this issue Oct 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CKAN interoperability with CKAN Data Management System datapusher+ for Datapusher+ DCAT3 enhancement New feature or request. Once marked with this label, its in the backlog. qsv pro requires backend/cloud services
Projects
None yet
Development

No branches or pull requests

3 participants