Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

out-of-memory error on hpc during capture output #24

Open
m-bossart opened this issue Apr 12, 2022 · 3 comments
Open

out-of-memory error on hpc during capture output #24

m-bossart opened this issue Apr 12, 2022 · 3 comments

Comments

@m-bossart
Copy link
Contributor

No description provided.

@m-bossart
Copy link
Contributor Author

Training occasionally runs out of memory on hpc while writing the output Arrow data files. The function that captures the output is here:

function _capture_output(output_dict, output_directory, id)

For the cases that fail, the first dataframe is written to file and the process is killed on the second iteration of the loop.
I tried to use df=nothing and GC.gc() to free the memory required from the prior iteration but still got failures. @jd-lara any suggestions? Does it make sense to write data to the output files directly from the training callback?

@jd-lara
Copy link
Collaborator

jd-lara commented Apr 12, 2022

what's the size of the DF you are trying to write? I wonder if the Arrow write command itself is what's causing the problem.

@m-bossart
Copy link
Contributor Author

Not that big. ~380 parameters x 300 rows. It has worked successfully for much larger cases. Most of the time it works. 10-20% of the runs have this failure in the most recent version.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants