-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Error with multiple runs in parallel and solver_link_type="memory"
#14
Comments
Some further results. The I tracked the CPU utilization using slurm's profiling. This is what it looks like before I cancel the job: |
I was not able to reproduce the issue on my local machine with 16 CPUs even when I set the workers variable to 64 and workers in the list comprehension to 12000 (I'm using the latest GAMSPy which is 1.3.1). The socket connection attempt has a timeout of 30 seconds. So, if GAMSPy cannot establish the socket connection in 30 seconds, it should automatically fail. Hence, I doubt that the hanging is because of the socket connection creation. I will try to reproduce it on other machines in the coming days. |
It seems to be non-deterministic. I went back and tested it again with both 16 and 64 CPUs, So maybe try running the same example a couple of times. I cannot exclude the possibility that it is related to the configuration of our HPC cluster, so I will look into that. However, with the parallelization of a simpler function (which also writes files) these issues did not occur. |
I was able to reproduce it with a network license. No matter how many times I tried it with local license, I couldn't reproduce it. So, my hunch is that it's a licensing issue. I will investigate it further in the coming days. Thanks for the experiments. |
I wanted to parallelize GAMSPy runs as I described in the forum, where I was recommended to do this on a UNIX system due to issues with Windows Defender. On a UNIX system however, I encountered a different issue: Not all tasks were completed, and most often the execution froze completely - despite the data parallelism, which should not lead to deadlocks!
So I suspected some type of memory issue and found that it was due to the passing of the problem to the solver via the memory using
solver_link_type="memory"
. At least on my small example problem, the issues do not occur when using the default, which is reading from disk.To reproduce, execute it and use the bash command
ls -l *.gdx | wc -l
to count the number of gdx files created. In my case, the "memory" version fails to create the last gdx, showing a result of63
. The "disk" version successfully creates all gdxs and thus shows64
.Note that I did not investigate this issue on a Windows system again. I ran the example on an AMD EPYC 7702 with 64 cores.
The text was updated successfully, but these errors were encountered: