Allow downloading just some columns of a dataset #4114

osanseviero · 2022-04-06T16:38:46Z

Is your feature request related to a problem? Please describe.
Some people are interested in doing label analysis of a CV dataset without downloading all the images. Downloading the whole dataset does not always makes sense for this kind of use case

Describe the solution you'd like
Be able to just download some columns of a dataset, such as doing

load_dataset("huggan/wikiart",columns=["artist", "genre"])

Although this might make things a bit complicated in terms of local caching of datasets.

lhoestq · 2022-04-06T16:47:29Z

In the general case you can’t always reduce the quantity of data to download, since you can’t parse CSV or JSON data without downloading the whole files right ? ^^ However we could explore this case-by-case I guess

osanseviero · 2022-04-07T07:56:26Z

Actually for csv pandas has usecols which allows loading a subset of columns in a more efficient way afaik, but yes, you're right this might be more complex than I thought.

lukasugar · 2024-02-20T16:51:04Z

Bumping the visibility of this :) Is there a recommended way of doing this?

lhoestq · 2024-02-21T11:29:33Z

Passing columns=[...] to load_dataset() in streaming mode does work if the dataset is in Parquet format, but for other formats it's either not possible or not implemented

oza75 · 2024-04-07T13:50:57Z

I tried using the columns=['bambara'] on this dataset oza75/bambara-tts which is in parquet, but it does not work. This feature is really useful because sometimes you don't want to download the whole dataset but just a few columns.

Ravi2712 · 2024-05-16T14:16:36Z

It doesn't work for the dataset with parquet format. Are we missing something?

lhoestq · 2024-05-17T09:41:08Z

It only works for streaming=True. When not streaming it does download the full files locally before reading the data

kdcyberdude · 2024-07-06T01:42:18Z

Hi @lhoestq, I have an audio dataset of 250GB on the huggingface hub in parquet format. I only wanted to load the text column. It is taking a lot of time. It seems like it is downloading audio as well even in streaming mode.

trojblue · 2024-10-18T20:23:13Z

bump on this

xenova · 2025-02-03T10:07:27Z

Something like this worked for me:

ds = load_dataset(
    "parler-tts/libritts_r_filtered",
    "clean",
    streaming=True,
    columns=['text_normalized']
)

osanseviero added the enhancement New feature or request label Apr 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow downloading just some columns of a dataset #4114

Allow downloading just some columns of a dataset #4114

osanseviero commented Apr 6, 2022

lhoestq commented Apr 6, 2022

osanseviero commented Apr 7, 2022

lukasugar commented Feb 20, 2024

lhoestq commented Feb 21, 2024 •

edited

Loading

oza75 commented Apr 7, 2024

Ravi2712 commented May 16, 2024

lhoestq commented May 17, 2024

kdcyberdude commented Jul 6, 2024

trojblue commented Oct 18, 2024

xenova commented Feb 3, 2025

Allow downloading just some columns of a dataset #4114

Allow downloading just some columns of a dataset #4114

Comments

osanseviero commented Apr 6, 2022

lhoestq commented Apr 6, 2022

osanseviero commented Apr 7, 2022

lukasugar commented Feb 20, 2024

lhoestq commented Feb 21, 2024 • edited Loading

oza75 commented Apr 7, 2024

Ravi2712 commented May 16, 2024

lhoestq commented May 17, 2024

kdcyberdude commented Jul 6, 2024

trojblue commented Oct 18, 2024

xenova commented Feb 3, 2025

lhoestq commented Feb 21, 2024 •

edited

Loading