Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sample: add random sampling of remote CSVs without downloading the entire CSV first using http range requests #2140

Open
jqnatividad opened this issue Sep 13, 2024 · 0 comments
Labels
enhancement New feature or request. Once marked with this label, its in the backlog. performance

Comments

@jqnatividad
Copy link
Collaborator

jqnatividad commented Sep 13, 2024

When sampling a remote CSV, qsv has to download the file first into a tempfile, before commencing sampling.

Even with the new --max-size option, we're limited to sampling only the downloaded portion.

For servers that support http range requests (which is pretty much most modern servers) and provide http content-length info, do the sampling using range-requests calls instead.

This should allow qsv to sample very large CSV files quickly as we don't need to download to a temporary file first.

When implementing, ensure to download the first N rows (default:1000?) so we can get the header and infer the schema.

@jqnatividad jqnatividad added the enhancement New feature or request. Once marked with this label, its in the backlog. label Sep 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request. Once marked with this label, its in the backlog. performance
Projects
None yet
Development

No branches or pull requests

1 participant