Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revisit huggingface cache policy #1564

Open
wants to merge 20 commits into
base: main
Choose a base branch
from
Open

Revisit huggingface cache policy #1564

wants to merge 20 commits into from

Conversation

elronbandel
Copy link
Member

Changes:

  1. Make loading of external hugginface datasets be cached be default (by huggingface cache mechanism)
  2. Make re-loading of unitxt dataset not cached by default (unless user asked for it explicitly)
  3. Make processing of external huggingface datasets be without streaming unless specified explicitly

@@ -823,6 +823,7 @@ class LoadFromHFSpace(LoadHF):
use_token: Optional[bool] = None
token_env: Optional[str] = None
requirements_list: List[str] = ["huggingface_hub"]
streaming = True
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why? Once the files are downloaded to local file system, what benefit you get here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You dont, It just dosent work otherwise. We should base this loader on dictionary loader.

@dafnapension
Copy link
Collaborator

in:

dataset[split] = dataset[split].to_iterable_dataset()

once the dataset is loaded from HF (as dataset, not iterable_dataset) why change it to iterable_dataset? I got the impression that each move of each instance in iterable_dataset incurs much energy (probably to preserve a state or something). Does Unitxt needs the dataset as iterable?

@assaftibm
Copy link
Member

assaftibm commented Jan 29, 2025

You can use this code to compare the loading time of HF and UT:

import os.path
import shutil
from time import time

from datasets import load_dataset as hf_load_dataset
from unitxt.api import load_dataset
import unitxt

path = "hf_cache"
# path = "/home/dafna/.cache/huggingface"

if os.path.exists(path):
    shutil.rmtree(path)
cache_dir=path
t0 = time()
ds = hf_load_dataset("PrimeQA/clapnq_passages", cache_dir=path)
print(f"hf ds: {ds}")
t1 = time()
instances = list(ds["train"])
t2 = time()
print(f"num of instances in ds = {len(instances)}")
print(f"from start to after load_dataset: {t1-t0}")
print(f"from after load_dataset to after list of ds = {t2-t1}")
print()
if os.path.exists(path):
    shutil.rmtree(path)

# disable the cache for hf datasets
unitxt.settings.disable_hf_datasets_cache=True
t0 = time()
ds = load_dataset('card=cards.rag.documents.clap_nq.en')
t1 = time()
instances = list(ds["train"])
t2 = time()
print(f"num of instances in ds = {len(instances)}")
print(f"from start to after load_dataset: {t1-t0}")
print(f"from after load_dataset to after list of ds = {t2-t1}")

@elronbandel
Copy link
Member Author

elronbandel commented Jan 29, 2025

@assaftibm this PR is not yet the solution to the problem you pointed out. We broke it up to two PRs as it is global changes which we want to introduce and test gradually.

Signed-off-by: elronbandel <[email protected]>
Signed-off-by: elronbandel <[email protected]>
Signed-off-by: elronbandel <[email protected]>
Signed-off-by: elronbandel <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants