-
Notifications
You must be signed in to change notification settings - Fork 29
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Runtime failures on larger process counts #66
Comments
The following program fails reliable with 1024 processes on our cluster. Please, can someone look at it, apparently somewhat in the collectives is broken (GPI2 v1.5.0). (or it is my AllGatherValueImpl function)
|
Update on this issue: the reason seems to be the setting of PCI_WR_ORDERING. If set to per_mkey(0), all is fine. If set to force_relax(1), races will happen. |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
I run in all kinds of trouble, when I try running GPI2 1.5 on somewhat larger process counts. Up to 128 procs all is fine, but starting with 256 processes the program stops with all kinds of unreproducible errors. Tried on two Infiniband clusters.
Has something changed from 1.3 to 1.5, so that a
gaspi_proc_init
withGASPI_TOPOLOGY_STATIC
isn't advisable anymore for those process counts? And if I useGASPI_TOPOLOGY_NONE
and connect only neighbors by hand, can I then use the gaspi collectives nevertheless?The text was updated successfully, but these errors were encountered: