Restrict parallelization for Flash loader #355

rettigl · 2024-03-07T15:33:08Z

The flash (and future FEL) loader currently does the conversion to parquet in parallel with all files simultaneously. This can consume huge amounts of memory, and lead to very inefficient conversion for systems with few cores and low memory (such as the github action workers).

The solution would be to do the conversion in chunks determined by the number of available cores and/or the num_cores config parameter we already define. Maybe this should be moved from [binning] to [core] then.

The text was updated successfully, but these errors were encountered:

rettigl · 2024-07-18T07:29:56Z

Closed in #329

zain-sohail · 2024-07-29T21:36:58Z

With current scheme, one core is used per dataframe. But since the time is only a few seconds per file now, the overheads are way too large. Hence, running multiple files per core is likely optimal.

I have tested with 49 files.
With the main branch, using 49 workers leads to about 200 s time. Likewise refactored branch takes 150 s with 49 workers. This is about the same as loading it serially, or possibly worse.

Instead of using 49 workers, using just 10 workers leads to speed up. 100 s with main branch and 30 s with refactored. So there is still a speed up but the n_workers parameter should be adaptively scaled.
Simple method could be 1 worker per 3-5 files.

Refactored code:

Instead of joblib, I also tried with dask distributed localcluster and that provides a more linear speedup with the amount of cores used.

Here I am reading 49 files. Serially it takes about 196 s. Dask delayed without the client takes 100 s (basically single core then).

Two different runs using dask.distributed LocalCluster.

In current code, only one other place uses joblib so maybe we can make a switch to dask distributed.

rettigl closed this as completed Jul 18, 2024

zain-sohail reopened this Jul 29, 2024

OpenCOMPES locked and limited conversation to collaborators Jan 14, 2025

zain-sohail converted this issue into discussion #550 Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

This issue was moved to a discussion.

Restrict parallelization for Flash loader #355

Restrict parallelization for Flash loader #355

rettigl commented Mar 7, 2024

rettigl commented Jul 18, 2024

zain-sohail commented Jul 29, 2024 •

edited

Loading

This issue was moved to a discussion.

This issue was moved to a discussion.

Restrict parallelization for Flash loader #355

Restrict parallelization for Flash loader #355

Comments

rettigl commented Mar 7, 2024

rettigl commented Jul 18, 2024

zain-sohail commented Jul 29, 2024 • edited Loading

This issue was moved to a discussion.

zain-sohail commented Jul 29, 2024 •

edited

Loading