Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/psm3: illegal instruction #8933

Closed
finjulhich opened this issue May 13, 2023 · 7 comments
Closed

prov/psm3: illegal instruction #8933

finjulhich opened this issue May 13, 2023 · 7 comments
Assignees

Comments

@finjulhich
Copy link

Describe the bug
The ILL signal arrives here:
psm3_context_set_affinity (ep=0x7fb3bc20c5c0, nic_cpuset=...) at prov/psm3/psm3/psm_context.c:730
727 if (cpu_count > nic_count) { 728 andcpuset = cpuset; 729 } else { 730 CPU_AND(&andcpuset, &cpuset, &nic_cpuset); 731 }
The call stack leading down to it is
#0 0x00007fb3ba5f718e in psm3_context_set_affinity (ep=0x7fb3bc20c5c0, nic_cpuset=...) at prov/psm3/psm3/psm_context.c:730 #1 0x00007fb3ba5ea72e in psm3_hfp_sockets_context_open (unit=<optimized out>, port=<optimized out>, addr_index=<optimized out>, open_timeout=<optimized out>, ep=<optimized out>, job_key=<optimized out>, retryCnt=<optimized out>) at prov/psm3/psm3/hal_sockets/sockets_hal_inline_i.h:122 #2 0x00007fb3ba604067 in psm3_context_open (timeout_ns=<optimized out>, network_pkey=<optimized out>, job_key=0x7fb3c0d99678 "|\003ij\177", addr_index=, port=,
unit_param=, ep=) at prov/psm3/psm3/psm_context.c:558
#3 psm3_ep_open_device (unique_job_key=, opts=, ep=) at prov/psm3/psm3/psm_ep.c:1544
#4 psm3_ep_open_internal (unique_job_key=unique_job_key@entry=0x7fb3bc1576d8 "$};\257\307=$\n=\256UD\310M\360D", devid_enabled=devid_enabled@entry=0x7fb3c0d99954, opts_i=opts_i@entry=0x7fb3c0d99900,
mq=mq@entry=0x7fb3bc15e920, epo=epo@entry=0x7fb3c0d998d8, epido=epido@entry=0x7fb3c0d99960) at prov/psm3/psm3/psm_ep.c:983
#5 0x00007fb3ba606bf0 in psm3_ep_open (unique_job_key=0x7fb3bc1576d8 "$};\257\307=$\n=\256UD\310M\360D", opts_i=, epo=0x7fb3bc15e7a0, epido=0x7fb3bc15e7a8) at prov/psm3/psm3/psm_ep.c:1202
#6 0x00007fb3ba66f5a0 in psmx3_trx_ctxt_alloc.isra.0 (domain=domain@entry=0x7fb3bc157760, src_addr=src_addr@entry=0x7fb3bc14afd0, usage_flags=usage_flags@entry=3,
uuid=0x7fb3bc1576d8 "$};\257\307=$\n=\256UD\310M\360D", uuid@entry=0x0, sep_ctxt_idx=) at prov/psm3/src/psmx3_trx_ctxt.c:317
#7 0x00007fb3ba5d3440 in psmx3_sep_open (domain=0x7fb3bc157760, info=0x7fb3bc14c8a0, sep=, context=0x0) at prov/psm3/src/psmx3_ep.c:1036
#8 0x00007fb3c0362f63 in fi_scalable_ep (context=0x0, sep=0x7fb3c0d99d20, info=0x7fb3bc14c8a0, domain=) at /usr/include/rdma/fi_endpoint.h:196
#9 mca_btl_ofi_init_device (info=) at /usr/src/debug/openmpi-4.1.4-8.fc38.x86_64/opal/mca/btl/ofi/btl_ofi_component.c:542
#10 mca_btl_ofi_component_init (num_btl_modules=0x7fb3c0d9a104, enable_progress_threads=, enable_mpi_threads=)
at /usr/src/debug/openmpi-4.1.4-8.fc38.x86_64/opal/mca/btl/ofi/btl_ofi_component.c:400
#11 0x00007fb3c6ad1966 in mca_btl_base_select (enable_progress_threads=true, enable_mpi_threads=false) at mca/btl/base/btl_base_select.c:110
#12 0x00007fb3c036d2fe in mca_bml_r2_component_init (priority=0x7fb3c0d9a184, enable_progress_threads=, enable_mpi_threads=)
at /usr/src/debug/openmpi-4.1.4-8.fc38.x86_64/ompi/mca/bml/r2/bml_r2_component.c:86
#13 0x00007fb3c6bcb059 in mca_bml_base_init (enable_progress_threads=true, enable_mpi_threads=false) at mca/bml/base/bml_base_init.c:74
#14 0x00007fb3c6c093bb in ompi_mpi_init (argc=, argv=, requested=2, provided=0x7fb3c0d9a320, reinit_ok=) at runtime/ompi_mpi_init.c:613
#15 0x00007fb3c6ba8631 in PMPI_Init_thread (argc=0x0, argv=0x0, required=2, provided=0x7fb3c0d9a320) at mpi/c/profile/pinit_thread.c:69
`

Is it very obviously a CPU that is simply too old?

To Reproduce
call ompi_mpi_init

Expected behavior
No ILL

Output
N/A

Environment:
OS (if not Linux), provider, endpoint type, etc.

Additional context
I use boost.mpi which uses openmpi-4.1.4-8 which uses libfabric-1.17.0-3. The system is Linux Fedora 38.
The processor I have is a old Intel(R) Core(TM) i7-2720QM sandy brige micro arch.

@finjulhich finjulhich added the bug label May 13, 2023
@acgoldma
Copy link
Contributor

At build time psm3 will decide on what instruction set to compile with. The minimum is to support avx (v1). However, we will also test for and compile with avx2 if available on the build machine as it provides a noticeable performance improvement. From the description, this looks like it was compiled with AVX2 support, which means you will need a gen3 (or Haswell) processor to run this binary.

However, Intel(R) Core(TM) i7-2720QM should support avx (v1). While we do not 'claim' to validate/support psm3 on your processor, you may be able to run it by recompiling psm3/libfabric yourself on this system.

@nmorey
Copy link
Contributor

nmorey commented Jul 28, 2023

Could a dynamic check for AVX be added?
I had a similar issue reported for SUSE Leap: https://bugzilla.suse.com/show_bug.cgi?id=1213538

The libfabric package is compiled with AVX enabled (or PSM3 would not be part of the package), but the runtime CPU does not have AVX support.
It would be pretty handy if PSM3 could detect whether AVX is supported at runtime (or not) before the decision is taken to use it as a provider.

@acgoldma
Copy link
Contributor

It would be pretty handy if PSM3 could detect whether AVX is supported at runtime

We are currently working out the best way to detect current cpu support at runtime, which will support multiple compilers/CPUs. We are currently testing __builtin_cpu_supports("avx") which seem to support all cases, but need to be sure.
Seems to be supported by common Linux compilers (gcc/clang & icc/icx).

@acgoldma acgoldma self-assigned this Jul 28, 2023
@acgoldma
Copy link
Contributor

We have an upcoming release due in mid-September and plan to deliver this fix with that release.

@acgoldma
Copy link
Contributor

acgoldma commented Oct 2, 2023

Took a bit to come out, but we have pushed the patch with our latest release: #9389

@nmorey
Copy link
Contributor

nmorey commented Oct 2, 2023

IIUC, this patch will detect at runtime if compatible with the selected build march and cleanly fail if not.
Which is definitly better than the current state.

However it means that distros will have to build with the most compatible arch meaning SSE4.2 to avoid compatibility issues.
I had a very very quick glance in the code and didn't see anything _mm256 related. Only the _mm_crc32_u64 which only requires SSE4.2.
So why is it needed to test/enable AVX or AVX2?

@acgoldma
Copy link
Contributor

acgoldma commented Oct 2, 2023

So why is it needed to test/enable AVX or AVX2?

We require AVX/AVX2 to compile because there is a significant performance improvement observed in most environments (when compiled with). We also do not technically support/validate on older CPUs (AVX was added ~10 years ago).

If you have a need for PSM3 to run on an older platform, feel free to email me ([email protected]) and we can discuss more.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants