-
Notifications
You must be signed in to change notification settings - Fork 2.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add concurrent loading of shards to datasets.load_from_disk #6464
Conversation
If we use multithreading no need to ask for Also you can ignore the Have you been able to run the benchmark on a fresh node ? The speed up doesn't seem that big in your first report |
I got some fresh nodes with the 32 threads I'm loading the dataset with around 315 seconds (without any preloading). Sequentially, it used to take around 1865 seconds. |
0f39880
to
2d31e43
Compare
I switched to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a nice speed up !
I just have one comment:
src/datasets/arrow_reader.py
Outdated
for f_dict in files: | ||
pa_table: Table = self._get_table_from_filename(f_dict, in_memory=in_memory) | ||
pa_tables.append(pa_table) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you can remove this, and also remove the pa_tables = []
a few lines above this
for f_dict in files: | |
pa_table: Table = self._get_table_from_filename(f_dict, in_memory=in_memory) | |
pa_tables.append(pa_table) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh! Thanks I missed that :)
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Thanks for the update ! Btw you should tell Jack Morris that you added this :) see https://x.com/jxmnop/status/1749812573984461145?s=20 The CI fail is unrelated to this PR - I'm trying to fix it on |
Thank you! I'll let him know :) |
great work guys! letting you know here too |
Show benchmarksPyArrow==8.0.0 Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
Show updated benchmarks!Benchmark: benchmark_array_xd.json
Benchmark: benchmark_getitem_100B.json
Benchmark: benchmark_indices_mapping.json
Benchmark: benchmark_iterating.json
Benchmark: benchmark_map_filter.json
|
In some file systems (like luster), memory mapping arrow files takes time. This can be accelerated by performing the mmap in parallel on processes or threads.
IN_MEMORY_MAX_SIZE
config.BaseReader.read
toDatasetBuilder.as_dataset
. SinceDatasetBuilder.as_dataset
is used in many places besideload_dataset
.Tests on luster file system (on a shared partial node):
Loading 1231 shards of ~2GBs.
The files were pre-loaded in another process before the script runs (couldn't get a fresh node).
Run 1
Run 2
Run 3
fix #2252