-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Revisit huggingface cache policy #1564
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: elronbandel <[email protected]>
Signed-off-by: elronbandel <[email protected]>
@@ -823,6 +823,7 @@ class LoadFromHFSpace(LoadHF): | |||
use_token: Optional[bool] = None | |||
token_env: Optional[str] = None | |||
requirements_list: List[str] = ["huggingface_hub"] | |||
streaming = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why? Once the files are downloaded to local file system, what benefit you get here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You dont, It just dosent work otherwise. We should base this loader on dictionary loader.
in: Line 293 in c863ee7
once the dataset is loaded from HF (as dataset, not iterable_dataset) why change it to iterable_dataset? I got the impression that each move of each instance in iterable_dataset incurs much energy (probably to preserve a state or something). Does Unitxt needs the dataset as iterable? |
Signed-off-by: elronbandel <[email protected]>
You can use this code to compare the loading time of HF and UT: import os.path
import shutil
from time import time
from datasets import load_dataset as hf_load_dataset
from unitxt.api import load_dataset
import unitxt
path = "hf_cache"
# path = "/home/dafna/.cache/huggingface"
if os.path.exists(path):
shutil.rmtree(path)
cache_dir=path
t0 = time()
ds = hf_load_dataset("PrimeQA/clapnq_passages", cache_dir=path)
print(f"hf ds: {ds}")
t1 = time()
instances = list(ds["train"])
t2 = time()
print(f"num of instances in ds = {len(instances)}")
print(f"from start to after load_dataset: {t1-t0}")
print(f"from after load_dataset to after list of ds = {t2-t1}")
print()
if os.path.exists(path):
shutil.rmtree(path)
# disable the cache for hf datasets
unitxt.settings.disable_hf_datasets_cache=True
t0 = time()
ds = load_dataset('card=cards.rag.documents.clap_nq.en')
t1 = time()
instances = list(ds["train"])
t2 = time()
print(f"num of instances in ds = {len(instances)}")
print(f"from start to after load_dataset: {t1-t0}")
print(f"from after load_dataset to after list of ds = {t2-t1}") |
@assaftibm this PR is not yet the solution to the problem you pointed out. We broke it up to two PRs as it is global changes which we want to introduce and test gradually. |
…iguration Signed-off-by: elronbandel <[email protected]>
…rag_bench Signed-off-by: elronbandel <[email protected]>
Signed-off-by: elronbandel <[email protected]>
Signed-off-by: elronbandel <[email protected]>
…st runs Signed-off-by: elronbandel <[email protected]>
Signed-off-by: elronbandel <[email protected]>
Signed-off-by: elronbandel <[email protected]>
Signed-off-by: elronbandel <[email protected]>
Signed-off-by: elronbandel <[email protected]>
Signed-off-by: elronbandel <[email protected]>
…ding Signed-off-by: elronbandel <[email protected]>
Signed-off-by: elronbandel <[email protected]>
Signed-off-by: elronbandel <[email protected]>
Signed-off-by: elronbandel <[email protected]>
Signed-off-by: elronbandel <[email protected]>
Signed-off-by: elronbandel <[email protected]>
Changes: