-
-
Notifications
You must be signed in to change notification settings - Fork 725
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory (leak) aggregation after multiple runs with .compute() #2464
Comments
.compute()
.compute()
Do you observe the memory leak only with |
I recommend trying this same workflow not with dask.dataframe, but with |
@TomAugspurger @mrocklin Thank you for suggestions. I tried replicating the code that used pure The fix for the |
Just tried re-running this with pandas v 0.24, and the memory leak is gone! Many thanks for help. |
When
dask.compute()
orclient.compute()
are executed, memory does not get released. I have dask running in acelery
worker, which keeps executing tasks, however, memory use consumption increases as well. Dask seems not be able to release some of the memory that it used up.I tried using
pympler
,tracemalloc
andmemory_profiler
, but nothing is pointing to any data objects. The memory consumption is seen in monitoring tools, such ashtop
.To reproduce the memory leak, here is an example I compiled:
To break this down
make_input()
creates an input file for you;dask_task()
several times (5-10); with each iterration memory use would increment;clean_dask()
call closes workers, client, local scheduler and tries cleaning up the memory;Here are my observations for memory resources
Dask client and cluster set up:
91MB
dask_task()
executed 1 time:122MB
clean_dask()
ran:122MB
dask_task()
executed 5 times:198MB
clean_dask()
ran:198MB
Re-set,
dask_task()
executed 20 times:237MB
clean_dask()
ran:237MB
Re-set,
dask_task()
executed 50 times:332MB
clean_dask()
ran:233MB
Re-set,
dask_task()
executed 100 times:492MB
clean_dask()
ran:492MB
Re-set,
dask_task()
executed 200 times:497MB
clean_dask()
ran:497MB
Re-set,
dask_task()
executed 500 times:610MB
clean_dask()
ran:574MB
Re-set,
dask_task()
executed 1,000 times:624MB
clean_dask()
ran:570MB
Closing client/cluster only managed to clear the memory on some occasions, but not in call cases. The trend of the behaviour is that once a limit is reached, the memory does not go up as much.
Desired behaviour
Once a dask execution is completed, I would like to release the memory used for other processes to be consumed. Dask runs tasks inside a Celery worker, inside a docker container. Restarting workers/containers is obviously an undesired way of managing memory consumption.
I do not require to have a LocalCluster setup, nor the Client. When setting up a client and cluster, it gives an option to explicitly run
.close()
on them, which frees up some more memory, rather than running dask alone.I would much appreciate your assistance or suggestions. Thank you.
The text was updated successfully, but these errors were encountered: