-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow downloading just some columns of a dataset #4114
Comments
In the general case you can’t always reduce the quantity of data to download, since you can’t parse CSV or JSON data without downloading the whole files right ? ^^ However we could explore this case-by-case I guess |
Actually for csv pandas has |
Bumping the visibility of this :) Is there a recommended way of doing this? |
Passing |
I tried using the |
It doesn't work for the dataset with |
It only works for |
Hi @lhoestq, I have an audio dataset of 250GB on the huggingface hub in parquet format. I only wanted to load the text column. It is taking a lot of time. It seems like it is downloading audio as well even in streaming mode. |
bump on this |
Something like this worked for me: ds = load_dataset(
"parler-tts/libritts_r_filtered",
"clean",
streaming=True,
columns=['text_normalized']
) |
Is your feature request related to a problem? Please describe.
Some people are interested in doing label analysis of a CV dataset without downloading all the images. Downloading the whole dataset does not always makes sense for this kind of use case
Describe the solution you'd like
Be able to just download some columns of a dataset, such as doing
Although this might make things a bit complicated in terms of local caching of datasets.
The text was updated successfully, but these errors were encountered: