Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Error with multiple runs in parallel and solver_link_type="memory" #14

Open
grecht opened this issue Dec 2, 2024 · 4 comments
Open
Labels
bug Something isn't working

Comments

@grecht
Copy link

grecht commented Dec 2, 2024

I wanted to parallelize GAMSPy runs as I described in the forum, where I was recommended to do this on a UNIX system due to issues with Windows Defender. On a UNIX system however, I encountered a different issue: Not all tasks were completed, and most often the execution froze completely - despite the data parallelism, which should not lead to deadlocks!
So I suspected some type of memory issue and found that it was due to the passing of the problem to the solver via the memory using solver_link_type="memory". At least on my small example problem, the issues do not occur when using the default, which is reading from disk.

To reproduce, execute it and use the bash command ls -l *.gdx | wc -l to count the number of gdx files created. In my case, the "memory" version fails to create the last gdx, showing a result of 63. The "disk" version successfully creates all gdxs and thus shows 64.

Note that I did not investigate this issue on a Windows system again. I ran the example on an AMD EPYC 7702 with 64 cores.

import sys
import concurrent.futures
import multiprocessing as mp
import numpy as np
import gamspy as gp


def f(i, rands, options):
    ct = gp.Container()

    S = ct.addSet("R", records=[f"r{i}" for i in range(len(rands))])
    p = ct.addParameter("p", [S], records=rands)
    x = ct.addVariable("x", domain=S)

    eq = ct.addEquation("eq", domain=S)
    eq[S] = x[S] >= p[S]

    obj = gp.Sum(S, x[S])

    m = ct.addModel("m", gp.Problem.LP, [eq], gp.Sense.MIN, obj)
    m.solve(options=options)

    ct.write(f"ct_{i}.gdx")
    ct.close()


def main():
    workers = 64

    size = int(1e2)

    rng = np.random.default_rng(seed=0)
    rands = [10 * rng.random(size=size) for _ in range(workers)]

    options = gp.Options(
        # Comment out the following line to make it work.
        solve_link_type="memory",
        threads=1
    )

    executor = concurrent.futures.ProcessPoolExecutor(
        max_workers=workers,
        mp_context=mp.get_context("spawn"),
    )

    for i, rand in enumerate(rands):
        executor.submit(f, i, rand, options)

    executor.shutdown()

    print("Done.")


if __name__ == '__main__':
    main()
@grecht
Copy link
Author

grecht commented Dec 4, 2024

Some further results. The solve_link_type may not be the problem after all and what I observed could be some other effect. Running 365 problems (in the example, replace workers in the rands list comprehension by 365), I run into the same issue even with solver_link_type="disk": It freezes at 361 completed problems. Maybe it is due to the communication with GAMS via sockets?

I tracked the CPU utilization using slurm's profiling. This is what it looks like before I cancel the job:

cpu_utilization

@0x17 0x17 added the bug Something isn't working label Dec 4, 2024
@mabdullahsoyturk
Copy link
Contributor

I was not able to reproduce the issue on my local machine with 16 CPUs even when I set the workers variable to 64 and workers in the list comprehension to 12000 (I'm using the latest GAMSPy which is 1.3.1). The socket connection attempt has a timeout of 30 seconds. So, if GAMSPy cannot establish the socket connection in 30 seconds, it should automatically fail. Hence, I doubt that the hanging is because of the socket connection creation. I will try to reproduce it on other machines in the coming days.

@grecht
Copy link
Author

grecht commented Dec 8, 2024

It seems to be non-deterministic. I went back and tested it again with both 16 and 64 CPUs, len(rands)==365, IPOPT, and solve_link_type="disk" (but I do not know if this option is actually applied - the documentation says that depending on the capabilities of the solver, this option might be set to the respective other possible value).
Sometimes it terminated, sometimes it did not, freezing at a varying number of 1-6 remaining problems. And now there even is a third option: Sometimes it terminated, but only wrote 363 gdx files. That is weird, since the problems should be exactly the same due to setting the seed of the random number generator.

So maybe try running the same example a couple of times. I cannot exclude the possibility that it is related to the configuration of our HPC cluster, so I will look into that. However, with the parallelization of a simpler function (which also writes files) these issues did not occur.

@mabdullahsoyturk
Copy link
Contributor

I was able to reproduce it with a network license. No matter how many times I tried it with local license, I couldn't reproduce it. So, my hunch is that it's a licensing issue. I will investigate it further in the coming days. Thanks for the experiments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants