Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Regresssions in MKL 2025.0 #83

Open
h-vetinari opened this issue Nov 9, 2024 · 15 comments
Open

Regresssions in MKL 2025.0 #83

h-vetinari opened this issue Nov 9, 2024 · 15 comments

Comments

@h-vetinari
Copy link
Member

h-vetinari commented Nov 9, 2024

In addition to the question whether mkl now really requires __glibc >=2.28 on linux, I tested MKL 2025.0 against the test suite from netlib lapack, and it seems there's some substantial test failures.

  • the simplest upgrade runs into constraints with libhwloc, see here (--> not the fault of this feedstock per se, but cannot test)

  • testing MKL 2025.0 against LAPACK 3.9.0 together with the switch to flang yields 75/95 failures (logs):

    21% tests passed, 75 tests failed out of 95
    
    Total Test time (real) =  14.87 sec
    
    The following tests FAILED:
        1 - LAPACK-xlintsts_stest_in (Failed)
        2 - LAPACK-xlintstrfs_stest_rfp_in (Failed)
        3 - LAPACK-xeigtsts_nep_in (Failed)
        4 - LAPACK-xeigtsts_sep_in (Failed)
        5 - LAPACK-xeigtsts_se2_in (Failed)
        6 - LAPACK-xeigtsts_sec_in (Failed)
        7 - LAPACK-xeigtsts_sed_in (Failed)
        8 - LAPACK-xeigtsts_sgg_in (Failed)
        9 - LAPACK-xeigtsts_sgd_in (Failed)
       10 - LAPACK-xeigtsts_ssb_in (Failed)
       11 - LAPACK-xeigtsts_ssg_in (Failed)
       15 - LAPACK-xeigtsts_sgbak_in (Failed)
       16 - LAPACK-xeigtsts_sbb_in (Failed)
       17 - LAPACK-xeigtsts_glm_in (Failed)
       18 - LAPACK-xeigtsts_gqr_in (Failed)
       19 - LAPACK-xeigtsts_gsv_in (Failed)
       20 - LAPACK-xeigtsts_csd_in (Failed)
       21 - LAPACK-xeigtsts_lse_in (Failed)
       22 - LAPACK-xlintstd_dtest_in (Failed)
       23 - LAPACK-xlintstrfd_dtest_rfp_in (Failed)
       24 - LAPACK-xeigtstd_nep_in (Failed)
       25 - LAPACK-xeigtstd_sep_in (Failed)
       26 - LAPACK-xeigtstd_se2_in (Failed)
       27 - LAPACK-xeigtstd_dec_in (Failed)
       28 - LAPACK-xeigtstd_ded_in (Failed)
       29 - LAPACK-xeigtstd_dgg_in (Failed)
       30 - LAPACK-xeigtstd_dgd_in (Failed)
       31 - LAPACK-xeigtstd_dsb_in (Failed)
       32 - LAPACK-xeigtstd_dsg_in (Failed)
       36 - LAPACK-xeigtstd_dgbak_in (Failed)
       37 - LAPACK-xeigtstd_dbb_in (Failed)
       38 - LAPACK-xeigtstd_glm_in (Failed)
       39 - LAPACK-xeigtstd_gqr_in (Failed)
       40 - LAPACK-xeigtstd_gsv_in (Failed)
       41 - LAPACK-xeigtstd_csd_in (Failed)
       42 - LAPACK-xeigtstd_lse_in (Failed)
       43 - LAPACK-xeigtstc_nep_in (Failed)
       44 - LAPACK-xeigtstc_cec_in (Failed)
       45 - LAPACK-xeigtstc_cgg_in (Failed)
       46 - LAPACK-xeigtstc_cgd_in (Failed)
       50 - LAPACK-xeigtstc_cgbak_in (Failed)
       51 - LAPACK-xeigtstc_cbb_in (Failed)
       52 - LAPACK-xeigtstc_glm_in (Failed)
       53 - LAPACK-xeigtstc_gqr_in (Failed)
       54 - LAPACK-xeigtstc_gsv_in (Failed)
       55 - LAPACK-xeigtstc_csd_in (Failed)
       56 - LAPACK-xeigtstc_lse_in (Failed)
       57 - LAPACK-xlintstz_ztest_in (Failed)
       58 - LAPACK-xlintstrfz_ztest_rfp_in (Failed)
       59 - LAPACK-xeigtstz_nep_in (Failed)
       60 - LAPACK-xeigtstz_sep_in (Failed)
       61 - LAPACK-xeigtstz_se2_in (Failed)
       62 - LAPACK-xeigtstz_zec_in (Failed)
       63 - LAPACK-xeigtstz_zed_in (Failed)
       64 - LAPACK-xeigtstz_zgg_in (Failed)
       65 - LAPACK-xeigtstz_zgd_in (Failed)
       66 - LAPACK-xeigtstz_zsb_in (Failed)
       67 - LAPACK-xeigtstz_zsg_in (Failed)
       71 - LAPACK-xeigtstz_zgbak_in (Failed)
       72 - LAPACK-xeigtstz_zbb_in (Failed)
       73 - LAPACK-xeigtstz_glm_in (Failed)
       74 - LAPACK-xeigtstz_gqr_in (Failed)
       75 - LAPACK-xeigtstz_gsv_in (Failed)
       76 - LAPACK-xeigtstz_csd_in (Failed)
       77 - LAPACK-xeigtstz_lse_in (Failed)
       78 - LAPACK-xlintstds_dstest_in (Failed)
       79 - LAPACK-xlintstzc_zctest_in (Failed)
       83 - BLAS-xblat3s (Failed)
       86 - BLAS-xblat3d (Failed)
       88 - BLAS-xblat3c (Failed)
       91 - BLAS-xblat3z (Failed)
       92 - example_DGESV_rowmajor (Exit code 0xc06d007e)
       93 - example_DGESV_colmajor (Exit code 0xc06d007e)
       94 - example_DGELS_rowmajor (Exit code 0xc06d007e)
       95 - example_DGELS_colmajor (Exit code 0xc06d007e)
    
  • testing MKL 2025.0 against LAPACK 3.11.0 (together with the switch to flang) also yields 75/95 failures (logs):

    21% tests passed, 75 tests failed out of 95
    
    Total Test time (real) =  15.80 sec
    
    The following tests FAILED:
        2 - LAPACK-xlintsts_stest_in (Failed)
        3 - LAPACK-xlintstrfs_stest_rfp_in (Failed)
        4 - LAPACK-xeigtsts_nep_in (Failed)
        5 - LAPACK-xeigtsts_sep_in (Failed)
        6 - LAPACK-xeigtsts_se2_in (Failed)
        7 - LAPACK-xeigtsts_sec_in (Failed)
        8 - LAPACK-xeigtsts_sed_in (Failed)
        9 - LAPACK-xeigtsts_sgg_in (Failed)
       10 - LAPACK-xeigtsts_sgd_in (Failed)
       11 - LAPACK-xeigtsts_ssb_in (Failed)
       12 - LAPACK-xeigtsts_ssg_in (Failed)
       16 - LAPACK-xeigtsts_sgbak_in (Failed)
       17 - LAPACK-xeigtsts_sbb_in (Failed)
       18 - LAPACK-xeigtsts_glm_in (Failed)
       19 - LAPACK-xeigtsts_gqr_in (Failed)
       20 - LAPACK-xeigtsts_gsv_in (Failed)
       21 - LAPACK-xeigtsts_csd_in (Failed)
       22 - LAPACK-xeigtsts_lse_in (Failed)
       23 - LAPACK-xlintstd_dtest_in (Failed)
       24 - LAPACK-xlintstrfd_dtest_rfp_in (Failed)
       25 - LAPACK-xeigtstd_nep_in (Failed)
       26 - LAPACK-xeigtstd_sep_in (Failed)
       27 - LAPACK-xeigtstd_se2_in (Failed)
       28 - LAPACK-xeigtstd_dec_in (Failed)
       29 - LAPACK-xeigtstd_ded_in (Failed)
       30 - LAPACK-xeigtstd_dgg_in (Failed)
       31 - LAPACK-xeigtstd_dgd_in (Failed)
       32 - LAPACK-xeigtstd_dsb_in (Failed)
       33 - LAPACK-xeigtstd_dsg_in (Failed)
       37 - LAPACK-xeigtstd_dgbak_in (Failed)
       38 - LAPACK-xeigtstd_dbb_in (Failed)
       39 - LAPACK-xeigtstd_glm_in (Failed)
       40 - LAPACK-xeigtstd_gqr_in (Failed)
       41 - LAPACK-xeigtstd_gsv_in (Failed)
       42 - LAPACK-xeigtstd_csd_in (Failed)
       43 - LAPACK-xeigtstd_lse_in (Failed)
       44 - LAPACK-xeigtstc_nep_in (Failed)
       45 - LAPACK-xeigtstc_cec_in (Failed)
       46 - LAPACK-xeigtstc_cgg_in (Failed)
       47 - LAPACK-xeigtstc_cgd_in (Failed)
       51 - LAPACK-xeigtstc_cgbak_in (Failed)
       52 - LAPACK-xeigtstc_cbb_in (Failed)
       53 - LAPACK-xeigtstc_glm_in (Failed)
       54 - LAPACK-xeigtstc_gqr_in (Failed)
       55 - LAPACK-xeigtstc_gsv_in (Failed)
       56 - LAPACK-xeigtstc_csd_in (Failed)
       57 - LAPACK-xeigtstc_lse_in (Failed)
       58 - LAPACK-xlintstz_ztest_in (Failed)
       59 - LAPACK-xlintstrfz_ztest_rfp_in (Failed)
       60 - LAPACK-xeigtstz_nep_in (Failed)
       61 - LAPACK-xeigtstz_sep_in (Failed)
       62 - LAPACK-xeigtstz_se2_in (Failed)
       63 - LAPACK-xeigtstz_zec_in (Failed)
       64 - LAPACK-xeigtstz_zed_in (Failed)
       65 - LAPACK-xeigtstz_zgg_in (Failed)
       66 - LAPACK-xeigtstz_zgd_in (Failed)
       67 - LAPACK-xeigtstz_zsb_in (Failed)
       68 - LAPACK-xeigtstz_zsg_in (Failed)
       72 - LAPACK-xeigtstz_zgbak_in (Failed)
       73 - LAPACK-xeigtstz_zbb_in (Failed)
       74 - LAPACK-xeigtstz_glm_in (Failed)
       75 - LAPACK-xeigtstz_gqr_in (Failed)
       76 - LAPACK-xeigtstz_gsv_in (Failed)
       77 - LAPACK-xeigtstz_csd_in (Failed)
       78 - LAPACK-xeigtstz_lse_in (Failed)
       79 - LAPACK-xlintstds_dstest_in (Failed)
       80 - LAPACK-xlintstzc_zctest_in (Failed)
       83 - BLAS-xblat3s (Failed)
       86 - BLAS-xblat3d (Failed)
       88 - BLAS-xblat3c (Failed)
       91 - BLAS-xblat3z (Failed)
       92 - example_DGESV_rowmajor (Exit code 0xc06d007e)
       93 - example_DGESV_colmajor (Exit code 0xc06d007e)
       94 - example_DGELS_rowmajor (Exit code 0xc06d007e)
       95 - example_DGELS_colmajor (Exit code 0xc06d007e)
    
    			-->   LAPACK TESTING SUMMARY  <--
    		Processing LAPACK Testing output found in the TESTING directory
    SUMMARY             	nb test run 	numerical error   	other error  
    ================   	===========	=================	================  
    REAL             	0		0	(0.000%)	0	(0.000%)	
    DOUBLE PRECISION	0		0	(0.000%)	0	(0.000%)	
    COMPLEX          	0		0	(0.000%)	41	(0.000%)	
    COMPLEX16         	0		0	(0.000%)	41	(0.000%)	
    
    --> ALL PRECISIONS	0		0	(0.000%)	82	(0.000%)	
    

The reason why I'm almost certain that it's unrelated to the switch to flang, is that MKL 2024.2 + flang only has the following failures (logs):

97% tests passed, 3 tests failed out of 95

Total Test time (real) =  35.58 sec

The following tests FAILED:
	  1 - LAPACK-xlintsts_stest_in (Failed)
	 22 - LAPACK-xlintstd_dtest_in (Failed)
	 57 - LAPACK-xlintstz_ztest_in (Failed)

The errors roughly look as follows

Intel oneMKL ERROR: Parameter 1 was incorrect on entry to ZGEMM .
Intel oneMKL ERROR: Parameter 2 was incorrect on entry to ZGEMM .
Intel oneMKL ERROR: Parameter 3 was incorrect on entry to ZGEMM .

Perhaps this is created to some linkage issue? Was something changed w.r.t. the compiler setup for MKL 2025.0 that could have affected the symbol names?

CC @ZzEeKkAa @Alexsandruss @oleksandr-pavlyk @isuruf

@oleksandr-pavlyk
Copy link
Contributor

Only SYCL components of MKL need 2.28 as it is needed by DPC++ runtime.

Defer to @mkrainiuk for the remaining issues.

@h-vetinari
Copy link
Member Author

the simplest upgrade runs into constraints with libhwloc, see here (--> not the fault of this feedstock per se, but cannot test)

That constraint was fixed, and the same 75/95 failures now also appear completely without any change to the compilers (logs).

@mkrainiuk, please advise what's going on here or how we can fix it.

@mkrainiuk
Copy link

Looks like oneMKL might have some API changes, adding @sknepper for confirmation.
Another potential problem might be the compilation and link with oneMKL are not correct (e.g. the test was built with -DMKL_ILP64 flag but it used LP64 oneMKL interface library), could someone help me to get the exact build logs with compilation and link lines? Unfortunately I can't find this information in the log of failed step from conda-forge/blas-feedstock#128 ...

@h-vetinari
Copy link
Member Author

Thanks for the response!

could someone help me to get the exact build logs with compilation and link lines? Unfortunately I can't find this information in the log of failed step from conda-forge/blas-feedstock#128 ...

In the blas metapackage we only build the tests from https://github.com/Reference-LAPACK/lapack/ and run them against the various blas implementations. The MKL packages themselves aren't built in conda-forge, they're only repackaged, so I cannot offer logs on that. Presumably they should be available somewhere Intel-internally?

e.g. the test was built with -DMKL_ILP64 flag but it used LP64 oneMKL interface library

Not sure if my info there is incorrect or out of date, but didn't MKL use to build both ILP64 & LP64 symbols into the same library?

@sknepper
Copy link

That constraint was fixed, and the same 75/95 failures now also appear completely without any change to the compilers (logs).

In these logs, it looks like Linux was successful while Windows had failures. Am I understanding the logs correctly, @h-vetinari ?

As Maria said, these "Parameter x was incorrect on entry to" errors often relate to incorrect configuration of the LP64/ILP64 interfaces.

Selected domains provide API extensions with the _64 suffix (for example, SGEMM_64) for supporting large data arrays in the LP64 library, which enables the mixing of data types in one application.
Are you using the LP64 or ILP64 interface library?

@h-vetinari
Copy link
Member Author

In these logs, it looks like Linux was successful while Windows had failures. Am I understanding the logs correctly, @h-vetinari ?

Yes, the linux issue has been resolved in #84, all the remaining problems are on windows.

Are you using the LP64 or ILP64 interface library?

So far we haven't been actively distinguishing (that I know of) which integer model we use for MKL (though we do for OpenBLAS for example). So the answer is probably whatever Reference-LAPACK (3.9 resp. 3.11) does by default on windows.

How would I be able to set this correctly? Just define -DMKL_LP64=1 resp. -DMKL_ILP64=1? Has the default for this changed in MKL 2025.0 somehow?

@ZzEeKkAa
Copy link
Contributor

May be not the direct answer, but there is a tool from intel to figure out proper linker arguments:
https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl-link-line-advisor.html

@h-vetinari
Copy link
Member Author

Thanks. This suggests to link mkl_blas95_lp64.lib mkl_lapack95_lp64.lib mkl_intel_lp64_dll.lib mkl_tbb_thread_dll.lib mkl_core_dll.lib

So far, we've only needed to point to mkl_rt.2.dll, which is what we've been using as the backend behind the reference-LAPACK interface (which is what we use consistently to compile against, allowing users to choose resp. exchange the actual BLAS implementation in their environments).

Is that not sufficient anymore, presumably?

@sknepper
Copy link

One other thought I had - there are some known issues on AMD Windows, which will be fixed in an upcoming patch release (oneMKL 2025.0.1). Was this run on an AMD or Intel system?

@h-vetinari
Copy link
Member Author

I think azure pipelines has various CI agents in their pool, but most are intel AFAIK (Skylake X or so). OTOH, the fact that it's reproducible exactly across 4+ runs also means that it's either independent of the CPU architecture, or that it's happening on all of the agents that we happened to draw.

@napetrov
Copy link
Contributor

One other thought I had - there are some known issues on AMD Windows, which will be fixed in an upcoming patch release (oneMKL 2025.0.1). Was this run on an AMD or Intel system?

in general with those pipelines based on experience it's around 90/10 Intel/AMD ratio that you can expect.

@h-vetinari
Copy link
Member Author

Any updates here? @mkrainiuk @ZzEeKkAa @sknepper @napetrov @oleksandr-pavlyk

@vmalia
Copy link

vmalia commented Dec 13, 2024

Thanks. This suggests to link mkl_blas95_lp64.lib mkl_lapack95_lp64.lib mkl_intel_lp64_dll.lib mkl_tbb_thread_dll.lib mkl_core_dll.lib

So far, we've only needed to point to mkl_rt.2.dll, which is what we've been using as the backend behind the reference-LAPACK interface (which is what we use consistently to compile against, allowing users to choose resp. exchange the actual BLAS implementation in their environments).

Is that not sufficient anymore, presumably?

@h-vetinari
While I figure out how to access azure-devops and logs, you may try to check if the link-type was correctly selected in the link line advisor:
Image
This link type enables mkl_rt usage. In this case, you link with mkl_rt.lib on Windows which then resolves to mkl_rt.x.dll at runtime.

To select the required interface, there are some environment variables that can be defined. Check out this page: https://www.intel.com/content/www/us/en/docs/onemkl/developer-guide-windows/2025-0/using-the-single-dynamic-library.html

From above:

SDL enables you to select the interface and threading library for Intel® oneAPI Math Kernel Library (oneMKL) at run time. By default, linking with SDL provides:

  • Intel LP64 interface on systems based on the Intel® 64 architecture
  • Intel threading

By default, mkl_rt linking enables LP64 interfaces and Intel OpenMP threading. -DMKL_ILP64 should NOT be used in the compile-line if mkl_rt default behavior is expected.

I am new to conda-forge testing - can you please provide more details about how Netlib lapack is used here? How does netlib-lapack work with MKL here?

@h-vetinari
Copy link
Member Author

Hi! :)

While I figure out how to access azure-devops and logs, [...]

Everything is public, you just need to click on the link at the end of a PR

Image

and then

Image

Do note that azure will delete the logs of PRs after a month, so if see something like "cannot be found", we'll just have to rerun things.

you may try to check if the link-type was correctly selected in the link line advisor:

As I pointed out, we don't use the link advisor. We want to link to the actual library (or libraries) that contain the symbols, behind whatever amount of indirection or symlinks happens. It's fine by us if the SOVERSION of that DLL changes occasionally, that's not the issue.

Of course I don't mind changing the link setup if necessary, but that's why I was asking what changed.

I am new to conda-forge testing - can you please provide more details about how Netlib lapack is used here? How does netlib-lapack work with MKL here?

The blas setup in conda-forge is somewhat unusual. By default, every project (needing BLAS/LAPACK) will compile against the netlib API & ABI, but because we've set up the other BLAS flavours to conform to that same ABI, we can switch out the BLAS implementation based on user choice upon installation, without having to recompile the artefact that got built (or without having to build everything multiple times).

To validate that this process works, the blas-feedstock will build the test suite from netlib, link it against whatever BLAS flavour (in this case MKL), and then run the tests to see that everything is working (more details). This is the part that started failing since MKL 2025.0

@h-vetinari
Copy link
Member Author

Gentle ping @vmalia @mkrainiuk @ZzEeKkAa, and happy new year! 🥳

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants