-
Notifications
You must be signed in to change notification settings - Fork 389
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prov/psm3: illegal instruction #8933
Comments
At build time psm3 will decide on what instruction set to compile with. The minimum is to support avx (v1). However, we will also test for and compile with avx2 if available on the build machine as it provides a noticeable performance improvement. From the description, this looks like it was compiled with AVX2 support, which means you will need a gen3 (or Haswell) processor to run this binary. However, Intel(R) Core(TM) i7-2720QM should support avx (v1). While we do not 'claim' to validate/support psm3 on your processor, you may be able to run it by recompiling psm3/libfabric yourself on this system. |
Could a dynamic check for AVX be added? The libfabric package is compiled with AVX enabled (or PSM3 would not be part of the package), but the runtime CPU does not have AVX support. |
We are currently working out the best way to detect current cpu support at runtime, which will support multiple compilers/CPUs. We are currently testing |
We have an upcoming release due in mid-September and plan to deliver this fix with that release. |
Took a bit to come out, but we have pushed the patch with our latest release: #9389 |
IIUC, this patch will detect at runtime if compatible with the selected build march and cleanly fail if not. However it means that distros will have to build with the most compatible arch meaning SSE4.2 to avoid compatibility issues. |
We require AVX/AVX2 to compile because there is a significant performance improvement observed in most environments (when compiled with). We also do not technically support/validate on older CPUs (AVX was added ~10 years ago). If you have a need for PSM3 to run on an older platform, feel free to email me ([email protected]) and we can discuss more. |
Describe the bug
The ILL signal arrives here:
psm3_context_set_affinity (ep=0x7fb3bc20c5c0, nic_cpuset=...) at prov/psm3/psm3/psm_context.c:730
727 if (cpu_count > nic_count) { 728 andcpuset = cpuset; 729 } else { 730 CPU_AND(&andcpuset, &cpuset, &nic_cpuset); 731 }
The call stack leading down to it is
#0 0x00007fb3ba5f718e in psm3_context_set_affinity (ep=0x7fb3bc20c5c0, nic_cpuset=...) at prov/psm3/psm3/psm_context.c:730 #1 0x00007fb3ba5ea72e in psm3_hfp_sockets_context_open (unit=<optimized out>, port=<optimized out>, addr_index=<optimized out>, open_timeout=<optimized out>, ep=<optimized out>, job_key=<optimized out>, retryCnt=<optimized out>) at prov/psm3/psm3/hal_sockets/sockets_hal_inline_i.h:122 #2 0x00007fb3ba604067 in psm3_context_open (timeout_ns=<optimized out>, network_pkey=<optimized out>, job_key=0x7fb3c0d99678 "
|\003ij\177", addr_index=, port=,unit_param=, ep=) at prov/psm3/psm3/psm_context.c:558
#3 psm3_ep_open_device (unique_job_key=, opts=, ep=) at prov/psm3/psm3/psm_ep.c:1544
#4 psm3_ep_open_internal (unique_job_key=unique_job_key@entry=0x7fb3bc1576d8 "$};\257\307=$\n=\256UD\310M\360D", devid_enabled=devid_enabled@entry=0x7fb3c0d99954, opts_i=opts_i@entry=0x7fb3c0d99900,
mq=mq@entry=0x7fb3bc15e920, epo=epo@entry=0x7fb3c0d998d8, epido=epido@entry=0x7fb3c0d99960) at prov/psm3/psm3/psm_ep.c:983
#5 0x00007fb3ba606bf0 in psm3_ep_open (unique_job_key=0x7fb3bc1576d8 "$};\257\307=$\n=\256UD\310M\360D", opts_i=, epo=0x7fb3bc15e7a0, epido=0x7fb3bc15e7a8) at prov/psm3/psm3/psm_ep.c:1202
#6 0x00007fb3ba66f5a0 in psmx3_trx_ctxt_alloc.isra.0 (domain=domain@entry=0x7fb3bc157760, src_addr=src_addr@entry=0x7fb3bc14afd0, usage_flags=usage_flags@entry=3,
uuid=0x7fb3bc1576d8 "$};\257\307=$\n=\256UD\310M\360D", uuid@entry=0x0, sep_ctxt_idx=) at prov/psm3/src/psmx3_trx_ctxt.c:317
#7 0x00007fb3ba5d3440 in psmx3_sep_open (domain=0x7fb3bc157760, info=0x7fb3bc14c8a0, sep=, context=0x0) at prov/psm3/src/psmx3_ep.c:1036
#8 0x00007fb3c0362f63 in fi_scalable_ep (context=0x0, sep=0x7fb3c0d99d20, info=0x7fb3bc14c8a0, domain=) at /usr/include/rdma/fi_endpoint.h:196
#9 mca_btl_ofi_init_device (info=) at /usr/src/debug/openmpi-4.1.4-8.fc38.x86_64/opal/mca/btl/ofi/btl_ofi_component.c:542
#10 mca_btl_ofi_component_init (num_btl_modules=0x7fb3c0d9a104, enable_progress_threads=, enable_mpi_threads=)
at /usr/src/debug/openmpi-4.1.4-8.fc38.x86_64/opal/mca/btl/ofi/btl_ofi_component.c:400
#11 0x00007fb3c6ad1966 in mca_btl_base_select (enable_progress_threads=true, enable_mpi_threads=false) at mca/btl/base/btl_base_select.c:110
#12 0x00007fb3c036d2fe in mca_bml_r2_component_init (priority=0x7fb3c0d9a184, enable_progress_threads=, enable_mpi_threads=)
at /usr/src/debug/openmpi-4.1.4-8.fc38.x86_64/ompi/mca/bml/r2/bml_r2_component.c:86
#13 0x00007fb3c6bcb059 in mca_bml_base_init (enable_progress_threads=true, enable_mpi_threads=false) at mca/bml/base/bml_base_init.c:74
#14 0x00007fb3c6c093bb in ompi_mpi_init (argc=, argv=, requested=2, provided=0x7fb3c0d9a320, reinit_ok=) at runtime/ompi_mpi_init.c:613
#15 0x00007fb3c6ba8631 in PMPI_Init_thread (argc=0x0, argv=0x0, required=2, provided=0x7fb3c0d9a320) at mpi/c/profile/pinit_thread.c:69
`
Is it very obviously a CPU that is simply too old?
To Reproduce
call ompi_mpi_init
Expected behavior
No ILL
Output
N/A
Environment:
OS (if not Linux), provider, endpoint type, etc.
Additional context
I use boost.mpi which uses openmpi-4.1.4-8 which uses libfabric-1.17.0-3. The system is Linux Fedora 38.
The processor I have is a old Intel(R) Core(TM) i7-2720QM sandy brige micro arch.
The text was updated successfully, but these errors were encountered: