Revisit huggingface cache policy #1564

elronbandel · 2025-01-29T09:15:18Z

Changes:

Make loading of external hugginface datasets be cached be default (by huggingface cache mechanism)
Make re-loading of unitxt dataset not cached by default (unless user asked for it explicitly)
Make processing of external huggingface datasets be without streaming unless specified explicitly

Signed-off-by: elronbandel <[email protected]>

dafnapension · 2025-01-29T09:55:06Z

src/unitxt/loaders.py

@@ -823,6 +823,7 @@ class LoadFromHFSpace(LoadHF):
    use_token: Optional[bool] = None
    token_env: Optional[str] = None
    requirements_list: List[str] = ["huggingface_hub"]
+    streaming = True


Why? Once the files are downloaded to local file system, what benefit you get here?

You dont, It just dosent work otherwise. We should base this loader on dictionary loader.

dafnapension · 2025-01-29T09:57:30Z

in:

unitxt/src/unitxt/loaders.py

Line 293 in c863ee7

dataset[split] = dataset[split].to_iterable_dataset()

once the dataset is loaded from HF (as dataset, not iterable_dataset) why change it to iterable_dataset? I got the impression that each move of each instance in iterable_dataset incurs much energy (probably to preserve a state or something). Does Unitxt needs the dataset as iterable?

Signed-off-by: elronbandel <[email protected]>

assaftibm · 2025-01-29T18:21:32Z

You can use this code to compare the loading time of HF and UT:

import os.path
import shutil
from time import time

from datasets import load_dataset as hf_load_dataset
from unitxt.api import load_dataset
import unitxt

path = "hf_cache"
# path = "/home/dafna/.cache/huggingface"

if os.path.exists(path):
    shutil.rmtree(path)
cache_dir=path
t0 = time()
ds = hf_load_dataset("PrimeQA/clapnq_passages", cache_dir=path)
print(f"hf ds: {ds}")
t1 = time()
instances = list(ds["train"])
t2 = time()
print(f"num of instances in ds = {len(instances)}")
print(f"from start to after load_dataset: {t1-t0}")
print(f"from after load_dataset to after list of ds = {t2-t1}")
print()
if os.path.exists(path):
    shutil.rmtree(path)

# disable the cache for hf datasets
unitxt.settings.disable_hf_datasets_cache=True
t0 = time()
ds = load_dataset('card=cards.rag.documents.clap_nq.en')
t1 = time()
instances = list(ds["train"])
t2 = time()
print(f"num of instances in ds = {len(instances)}")
print(f"from start to after load_dataset: {t1-t0}")
print(f"from after load_dataset to after list of ds = {t2-t1}")

elronbandel · 2025-01-29T20:07:54Z

@assaftibm this PR is not yet the solution to the problem you pointed out. We broke it up to two PRs as it is global changes which we want to introduce and test gradually.

…iguration Signed-off-by: elronbandel <[email protected]>

…rag_bench Signed-off-by: elronbandel <[email protected]>

Signed-off-by: elronbandel <[email protected]>

…st runs Signed-off-by: elronbandel <[email protected]>

Signed-off-by: elronbandel <[email protected]>

…ding Signed-off-by: elronbandel <[email protected]>

Signed-off-by: elronbandel <[email protected]>

Revisit huggingface cache policy

db2b74b

Signed-off-by: elronbandel <[email protected]>

elronbandel requested review from dafnapension and yoavkatz January 29, 2025 09:15

Enable streaming for LoadFromHFSpace and clean up commented code

c863ee7

Signed-off-by: elronbandel <[email protected]>

dafnapension reviewed Jan 29, 2025

View reviewed changes

Disable Hugging Face datasets cache in CatalogPreparationTestCase

0e36f1a

Signed-off-by: elronbandel <[email protected]>

assaftibm mentioned this pull request Jan 29, 2025

slow dataset loading #1559

Open

elronbandel added 17 commits January 29, 2025 22:33

Enable streaming for wiki_bio loader in TaskCard and update JSON conf…

e672ca1

…iguration Signed-off-by: elronbandel <[email protected]>

Merge branch 'main' into hf-cache

81873b7

Add conditional test card execution for 'doqa_travel' subset in chat_…

3d91e20

…rag_bench Signed-off-by: elronbandel <[email protected]>

Merge branch 'hf-cache' of https://github.com/IBM/unitxt into hf-cache

0968633

Signed-off-by: elronbandel <[email protected]>

Enhance memory and performance logging in catalog preparation tests

119d07e

Signed-off-by: elronbandel <[email protected]>

Return parallel execution to 1 and adjust modulo for deterministic te…

6b84f81

…st runs Signed-off-by: elronbandel <[email protected]>

Try 1

a43910c

Signed-off-by: elronbandel <[email protected]>

try 1 fixed

b5a5ff0

Signed-off-by: elronbandel <[email protected]>

trial 2

1a421af

Signed-off-by: elronbandel <[email protected]>

Stop testing social iqa until problem resolved

db75df8

Signed-off-by: elronbandel <[email protected]>

Update social iqa card to use specific revision and enable testing

412e90b

Signed-off-by: elronbandel <[email protected]>

Refactor translation card testing logic and remove unused dataset loa…

f6e5388

…ding Signed-off-by: elronbandel <[email protected]>

Update head_qa card loader path and streamline dataset configuration

a0e7d0d

Signed-off-by: elronbandel <[email protected]>

Enable streaming for websrc card loader in configuration

a6fd3dd

Signed-off-by: elronbandel <[email protected]>

Add revision reference to Winogrande card loaders

700b26a

Signed-off-by: elronbandel <[email protected]>

Add revision reference to PIQA card loader

4e5fd67

Signed-off-by: elronbandel <[email protected]>

Update

edc0ae7

Signed-off-by: elronbandel <[email protected]>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Revisit huggingface cache policy #1564

Revisit huggingface cache policy #1564

elronbandel commented Jan 29, 2025

dafnapension Jan 29, 2025

elronbandel Jan 29, 2025

dafnapension commented Jan 29, 2025

assaftibm commented Jan 29, 2025 •

edited

Loading

elronbandel commented Jan 29, 2025 •

edited

Loading

Revisit huggingface cache policy #1564

Are you sure you want to change the base?

Revisit huggingface cache policy #1564

Conversation

elronbandel commented Jan 29, 2025

dafnapension Jan 29, 2025

Choose a reason for hiding this comment

elronbandel Jan 29, 2025

Choose a reason for hiding this comment

dafnapension commented Jan 29, 2025

assaftibm commented Jan 29, 2025 • edited Loading

elronbandel commented Jan 29, 2025 • edited Loading

assaftibm commented Jan 29, 2025 •

edited

Loading

elronbandel commented Jan 29, 2025 •

edited

Loading