Regresssions in MKL 2025.0 #83

h-vetinari · 2024-11-09T05:56:15Z

In addition to the question whether mkl now really requires __glibc >=2.28 on linux, I tested MKL 2025.0 against the test suite from netlib lapack, and it seems there's some substantial test failures.

the simplest upgrade runs into constraints with libhwloc, see here (--> not the fault of this feedstock per se, but cannot test)

testing MKL 2025.0 against LAPACK 3.9.0 together with the switch to flang yields 75/95 failures (logs):

21% tests passed, 75 tests failed out of 95

Total Test time (real) =  14.87 sec

The following tests FAILED:
    1 - LAPACK-xlintsts_stest_in (Failed)
    2 - LAPACK-xlintstrfs_stest_rfp_in (Failed)
    3 - LAPACK-xeigtsts_nep_in (Failed)
    4 - LAPACK-xeigtsts_sep_in (Failed)
    5 - LAPACK-xeigtsts_se2_in (Failed)
    6 - LAPACK-xeigtsts_sec_in (Failed)
    7 - LAPACK-xeigtsts_sed_in (Failed)
    8 - LAPACK-xeigtsts_sgg_in (Failed)
    9 - LAPACK-xeigtsts_sgd_in (Failed)
   10 - LAPACK-xeigtsts_ssb_in (Failed)
   11 - LAPACK-xeigtsts_ssg_in (Failed)
   15 - LAPACK-xeigtsts_sgbak_in (Failed)
   16 - LAPACK-xeigtsts_sbb_in (Failed)
   17 - LAPACK-xeigtsts_glm_in (Failed)
   18 - LAPACK-xeigtsts_gqr_in (Failed)
   19 - LAPACK-xeigtsts_gsv_in (Failed)
   20 - LAPACK-xeigtsts_csd_in (Failed)
   21 - LAPACK-xeigtsts_lse_in (Failed)
   22 - LAPACK-xlintstd_dtest_in (Failed)
   23 - LAPACK-xlintstrfd_dtest_rfp_in (Failed)
   24 - LAPACK-xeigtstd_nep_in (Failed)
   25 - LAPACK-xeigtstd_sep_in (Failed)
   26 - LAPACK-xeigtstd_se2_in (Failed)
   27 - LAPACK-xeigtstd_dec_in (Failed)
   28 - LAPACK-xeigtstd_ded_in (Failed)
   29 - LAPACK-xeigtstd_dgg_in (Failed)
   30 - LAPACK-xeigtstd_dgd_in (Failed)
   31 - LAPACK-xeigtstd_dsb_in (Failed)
   32 - LAPACK-xeigtstd_dsg_in (Failed)
   36 - LAPACK-xeigtstd_dgbak_in (Failed)
   37 - LAPACK-xeigtstd_dbb_in (Failed)
   38 - LAPACK-xeigtstd_glm_in (Failed)
   39 - LAPACK-xeigtstd_gqr_in (Failed)
   40 - LAPACK-xeigtstd_gsv_in (Failed)
   41 - LAPACK-xeigtstd_csd_in (Failed)
   42 - LAPACK-xeigtstd_lse_in (Failed)
   43 - LAPACK-xeigtstc_nep_in (Failed)
   44 - LAPACK-xeigtstc_cec_in (Failed)
   45 - LAPACK-xeigtstc_cgg_in (Failed)
   46 - LAPACK-xeigtstc_cgd_in (Failed)
   50 - LAPACK-xeigtstc_cgbak_in (Failed)
   51 - LAPACK-xeigtstc_cbb_in (Failed)
   52 - LAPACK-xeigtstc_glm_in (Failed)
   53 - LAPACK-xeigtstc_gqr_in (Failed)
   54 - LAPACK-xeigtstc_gsv_in (Failed)
   55 - LAPACK-xeigtstc_csd_in (Failed)
   56 - LAPACK-xeigtstc_lse_in (Failed)
   57 - LAPACK-xlintstz_ztest_in (Failed)
   58 - LAPACK-xlintstrfz_ztest_rfp_in (Failed)
   59 - LAPACK-xeigtstz_nep_in (Failed)
   60 - LAPACK-xeigtstz_sep_in (Failed)
   61 - LAPACK-xeigtstz_se2_in (Failed)
   62 - LAPACK-xeigtstz_zec_in (Failed)
   63 - LAPACK-xeigtstz_zed_in (Failed)
   64 - LAPACK-xeigtstz_zgg_in (Failed)
   65 - LAPACK-xeigtstz_zgd_in (Failed)
   66 - LAPACK-xeigtstz_zsb_in (Failed)
   67 - LAPACK-xeigtstz_zsg_in (Failed)
   71 - LAPACK-xeigtstz_zgbak_in (Failed)
   72 - LAPACK-xeigtstz_zbb_in (Failed)
   73 - LAPACK-xeigtstz_glm_in (Failed)
   74 - LAPACK-xeigtstz_gqr_in (Failed)
   75 - LAPACK-xeigtstz_gsv_in (Failed)
   76 - LAPACK-xeigtstz_csd_in (Failed)
   77 - LAPACK-xeigtstz_lse_in (Failed)
   78 - LAPACK-xlintstds_dstest_in (Failed)
   79 - LAPACK-xlintstzc_zctest_in (Failed)
   83 - BLAS-xblat3s (Failed)
   86 - BLAS-xblat3d (Failed)
   88 - BLAS-xblat3c (Failed)
   91 - BLAS-xblat3z (Failed)
   92 - example_DGESV_rowmajor (Exit code 0xc06d007e)
   93 - example_DGESV_colmajor (Exit code 0xc06d007e)
   94 - example_DGELS_rowmajor (Exit code 0xc06d007e)
   95 - example_DGELS_colmajor (Exit code 0xc06d007e)

testing MKL 2025.0 against LAPACK 3.11.0 (together with the switch to flang) also yields 75/95 failures (logs):

21% tests passed, 75 tests failed out of 95

Total Test time (real) =  15.80 sec

The following tests FAILED:
    2 - LAPACK-xlintsts_stest_in (Failed)
    3 - LAPACK-xlintstrfs_stest_rfp_in (Failed)
    4 - LAPACK-xeigtsts_nep_in (Failed)
    5 - LAPACK-xeigtsts_sep_in (Failed)
    6 - LAPACK-xeigtsts_se2_in (Failed)
    7 - LAPACK-xeigtsts_sec_in (Failed)
    8 - LAPACK-xeigtsts_sed_in (Failed)
    9 - LAPACK-xeigtsts_sgg_in (Failed)
   10 - LAPACK-xeigtsts_sgd_in (Failed)
   11 - LAPACK-xeigtsts_ssb_in (Failed)
   12 - LAPACK-xeigtsts_ssg_in (Failed)
   16 - LAPACK-xeigtsts_sgbak_in (Failed)
   17 - LAPACK-xeigtsts_sbb_in (Failed)
   18 - LAPACK-xeigtsts_glm_in (Failed)
   19 - LAPACK-xeigtsts_gqr_in (Failed)
   20 - LAPACK-xeigtsts_gsv_in (Failed)
   21 - LAPACK-xeigtsts_csd_in (Failed)
   22 - LAPACK-xeigtsts_lse_in (Failed)
   23 - LAPACK-xlintstd_dtest_in (Failed)
   24 - LAPACK-xlintstrfd_dtest_rfp_in (Failed)
   25 - LAPACK-xeigtstd_nep_in (Failed)
   26 - LAPACK-xeigtstd_sep_in (Failed)
   27 - LAPACK-xeigtstd_se2_in (Failed)
   28 - LAPACK-xeigtstd_dec_in (Failed)
   29 - LAPACK-xeigtstd_ded_in (Failed)
   30 - LAPACK-xeigtstd_dgg_in (Failed)
   31 - LAPACK-xeigtstd_dgd_in (Failed)
   32 - LAPACK-xeigtstd_dsb_in (Failed)
   33 - LAPACK-xeigtstd_dsg_in (Failed)
   37 - LAPACK-xeigtstd_dgbak_in (Failed)
   38 - LAPACK-xeigtstd_dbb_in (Failed)
   39 - LAPACK-xeigtstd_glm_in (Failed)
   40 - LAPACK-xeigtstd_gqr_in (Failed)
   41 - LAPACK-xeigtstd_gsv_in (Failed)
   42 - LAPACK-xeigtstd_csd_in (Failed)
   43 - LAPACK-xeigtstd_lse_in (Failed)
   44 - LAPACK-xeigtstc_nep_in (Failed)
   45 - LAPACK-xeigtstc_cec_in (Failed)
   46 - LAPACK-xeigtstc_cgg_in (Failed)
   47 - LAPACK-xeigtstc_cgd_in (Failed)
   51 - LAPACK-xeigtstc_cgbak_in (Failed)
   52 - LAPACK-xeigtstc_cbb_in (Failed)
   53 - LAPACK-xeigtstc_glm_in (Failed)
   54 - LAPACK-xeigtstc_gqr_in (Failed)
   55 - LAPACK-xeigtstc_gsv_in (Failed)
   56 - LAPACK-xeigtstc_csd_in (Failed)
   57 - LAPACK-xeigtstc_lse_in (Failed)
   58 - LAPACK-xlintstz_ztest_in (Failed)
   59 - LAPACK-xlintstrfz_ztest_rfp_in (Failed)
   60 - LAPACK-xeigtstz_nep_in (Failed)
   61 - LAPACK-xeigtstz_sep_in (Failed)
   62 - LAPACK-xeigtstz_se2_in (Failed)
   63 - LAPACK-xeigtstz_zec_in (Failed)
   64 - LAPACK-xeigtstz_zed_in (Failed)
   65 - LAPACK-xeigtstz_zgg_in (Failed)
   66 - LAPACK-xeigtstz_zgd_in (Failed)
   67 - LAPACK-xeigtstz_zsb_in (Failed)
   68 - LAPACK-xeigtstz_zsg_in (Failed)
   72 - LAPACK-xeigtstz_zgbak_in (Failed)
   73 - LAPACK-xeigtstz_zbb_in (Failed)
   74 - LAPACK-xeigtstz_glm_in (Failed)
   75 - LAPACK-xeigtstz_gqr_in (Failed)
   76 - LAPACK-xeigtstz_gsv_in (Failed)
   77 - LAPACK-xeigtstz_csd_in (Failed)
   78 - LAPACK-xeigtstz_lse_in (Failed)
   79 - LAPACK-xlintstds_dstest_in (Failed)
   80 - LAPACK-xlintstzc_zctest_in (Failed)
   83 - BLAS-xblat3s (Failed)
   86 - BLAS-xblat3d (Failed)
   88 - BLAS-xblat3c (Failed)
   91 - BLAS-xblat3z (Failed)
   92 - example_DGESV_rowmajor (Exit code 0xc06d007e)
   93 - example_DGESV_colmajor (Exit code 0xc06d007e)
   94 - example_DGELS_rowmajor (Exit code 0xc06d007e)
   95 - example_DGELS_colmajor (Exit code 0xc06d007e)

			-->   LAPACK TESTING SUMMARY  <--
		Processing LAPACK Testing output found in the TESTING directory
SUMMARY             	nb test run 	numerical error   	other error  
================   	===========	=================	================  
REAL             	0		0	(0.000%)	0	(0.000%)	
DOUBLE PRECISION	0		0	(0.000%)	0	(0.000%)	
COMPLEX          	0		0	(0.000%)	41	(0.000%)	
COMPLEX16         	0		0	(0.000%)	41	(0.000%)	

--> ALL PRECISIONS	0		0	(0.000%)	82	(0.000%)

The reason why I'm almost certain that it's unrelated to the switch to flang, is that MKL 2024.2 + flang only has the following failures (logs):

97% tests passed, 3 tests failed out of 95

Total Test time (real) =  35.58 sec

The following tests FAILED:
	  1 - LAPACK-xlintsts_stest_in (Failed)
	 22 - LAPACK-xlintstd_dtest_in (Failed)
	 57 - LAPACK-xlintstz_ztest_in (Failed)

The errors roughly look as follows

Intel oneMKL ERROR: Parameter 1 was incorrect on entry to ZGEMM .
Intel oneMKL ERROR: Parameter 2 was incorrect on entry to ZGEMM .
Intel oneMKL ERROR: Parameter 3 was incorrect on entry to ZGEMM .

Perhaps this is created to some linkage issue? Was something changed w.r.t. the compiler setup for MKL 2025.0 that could have affected the symbol names?

CC @ZzEeKkAa @Alexsandruss @oleksandr-pavlyk @isuruf

The text was updated successfully, but these errors were encountered:

oleksandr-pavlyk · 2024-11-09T13:44:39Z

Only SYCL components of MKL need 2.28 as it is needed by DPC++ runtime.

Defer to @mkrainiuk for the remaining issues.

h-vetinari · 2024-11-12T02:53:53Z

the simplest upgrade runs into constraints with libhwloc, see here (--> not the fault of this feedstock per se, but cannot test)

That constraint was fixed, and the same 75/95 failures now also appear completely without any change to the compilers (logs).

@mkrainiuk, please advise what's going on here or how we can fix it.

mkrainiuk · 2024-11-12T23:22:10Z

Looks like oneMKL might have some API changes, adding @sknepper for confirmation.
Another potential problem might be the compilation and link with oneMKL are not correct (e.g. the test was built with -DMKL_ILP64 flag but it used LP64 oneMKL interface library), could someone help me to get the exact build logs with compilation and link lines? Unfortunately I can't find this information in the log of failed step from conda-forge/blas-feedstock#128 ...

h-vetinari · 2024-11-12T23:40:05Z

Thanks for the response!

could someone help me to get the exact build logs with compilation and link lines? Unfortunately I can't find this information in the log of failed step from conda-forge/blas-feedstock#128 ...

In the blas metapackage we only build the tests from https://github.com/Reference-LAPACK/lapack/ and run them against the various blas implementations. The MKL packages themselves aren't built in conda-forge, they're only repackaged, so I cannot offer logs on that. Presumably they should be available somewhere Intel-internally?

e.g. the test was built with -DMKL_ILP64 flag but it used LP64 oneMKL interface library

Not sure if my info there is incorrect or out of date, but didn't MKL use to build both ILP64 & LP64 symbols into the same library?

sknepper · 2024-11-13T01:57:31Z

That constraint was fixed, and the same 75/95 failures now also appear completely without any change to the compilers (logs).

In these logs, it looks like Linux was successful while Windows had failures. Am I understanding the logs correctly, @h-vetinari ?

As Maria said, these "Parameter x was incorrect on entry to" errors often relate to incorrect configuration of the LP64/ILP64 interfaces.

Selected domains provide API extensions with the _64 suffix (for example, SGEMM_64) for supporting large data arrays in the LP64 library, which enables the mixing of data types in one application.
Are you using the LP64 or ILP64 interface library?

h-vetinari · 2024-11-13T03:00:17Z

In these logs, it looks like Linux was successful while Windows had failures. Am I understanding the logs correctly, @h-vetinari ?

Yes, the linux issue has been resolved in #84, all the remaining problems are on windows.

Are you using the LP64 or ILP64 interface library?

So far we haven't been actively distinguishing (that I know of) which integer model we use for MKL (though we do for OpenBLAS for example). So the answer is probably whatever Reference-LAPACK (3.9 resp. 3.11) does by default on windows.

How would I be able to set this correctly? Just define -DMKL_LP64=1 resp. -DMKL_ILP64=1? Has the default for this changed in MKL 2025.0 somehow?

ZzEeKkAa · 2024-11-13T03:41:59Z

May be not the direct answer, but there is a tool from intel to figure out proper linker arguments:
https://www.intel.com/content/www/us/en/developer/tools/oneapi/onemkl-link-line-advisor.html

h-vetinari · 2024-11-13T04:18:45Z

Thanks. This suggests to link mkl_blas95_lp64.lib mkl_lapack95_lp64.lib mkl_intel_lp64_dll.lib mkl_tbb_thread_dll.lib mkl_core_dll.lib

So far, we've only needed to point to mkl_rt.2.dll, which is what we've been using as the backend behind the reference-LAPACK interface (which is what we use consistently to compile against, allowing users to choose resp. exchange the actual BLAS implementation in their environments).

Is that not sufficient anymore, presumably?

sknepper · 2024-11-13T05:50:24Z

One other thought I had - there are some known issues on AMD Windows, which will be fixed in an upcoming patch release (oneMKL 2025.0.1). Was this run on an AMD or Intel system?

h-vetinari · 2024-11-13T06:13:44Z

I think azure pipelines has various CI agents in their pool, but most are intel AFAIK (Skylake X or so). OTOH, the fact that it's reproducible exactly across 4+ runs also means that it's either independent of the CPU architecture, or that it's happening on all of the agents that we happened to draw.

napetrov · 2024-11-13T16:23:04Z

One other thought I had - there are some known issues on AMD Windows, which will be fixed in an upcoming patch release (oneMKL 2025.0.1). Was this run on an AMD or Intel system?

in general with those pipelines based on experience it's around 90/10 Intel/AMD ratio that you can expect.

h-vetinari · 2024-12-08T02:15:08Z

Any updates here? @mkrainiuk @ZzEeKkAa @sknepper @napetrov @oleksandr-pavlyk

vmalia · 2024-12-13T01:16:15Z

Thanks. This suggests to link mkl_blas95_lp64.lib mkl_lapack95_lp64.lib mkl_intel_lp64_dll.lib mkl_tbb_thread_dll.lib mkl_core_dll.lib

So far, we've only needed to point to mkl_rt.2.dll, which is what we've been using as the backend behind the reference-LAPACK interface (which is what we use consistently to compile against, allowing users to choose resp. exchange the actual BLAS implementation in their environments).

Is that not sufficient anymore, presumably?

@h-vetinari
While I figure out how to access azure-devops and logs, you may try to check if the link-type was correctly selected in the link line advisor:

This link type enables mkl_rt usage. In this case, you link with mkl_rt.lib on Windows which then resolves to mkl_rt.x.dll at runtime.

To select the required interface, there are some environment variables that can be defined. Check out this page: https://www.intel.com/content/www/us/en/docs/onemkl/developer-guide-windows/2025-0/using-the-single-dynamic-library.html

From above:

SDL enables you to select the interface and threading library for Intel® oneAPI Math Kernel Library (oneMKL) at run time. By default, linking with SDL provides:

Intel LP64 interface on systems based on the Intel® 64 architecture

Intel threading

By default, mkl_rt linking enables LP64 interfaces and Intel OpenMP threading. -DMKL_ILP64 should NOT be used in the compile-line if mkl_rt default behavior is expected.

I am new to conda-forge testing - can you please provide more details about how Netlib lapack is used here? How does netlib-lapack work with MKL here?

h-vetinari · 2024-12-13T02:18:00Z

Hi! :)

While I figure out how to access azure-devops and logs, [...]

Everything is public, you just need to click on the link at the end of a PR

and then

Do note that azure will delete the logs of PRs after a month, so if see something like "cannot be found", we'll just have to rerun things.

you may try to check if the link-type was correctly selected in the link line advisor:

As I pointed out, we don't use the link advisor. We want to link to the actual library (or libraries) that contain the symbols, behind whatever amount of indirection or symlinks happens. It's fine by us if the SOVERSION of that DLL changes occasionally, that's not the issue.

Of course I don't mind changing the link setup if necessary, but that's why I was asking what changed.

I am new to conda-forge testing - can you please provide more details about how Netlib lapack is used here? How does netlib-lapack work with MKL here?

The blas setup in conda-forge is somewhat unusual. By default, every project (needing BLAS/LAPACK) will compile against the netlib API & ABI, but because we've set up the other BLAS flavours to conform to that same ABI, we can switch out the BLAS implementation based on user choice upon installation, without having to recompile the artefact that got built (or without having to build everything multiple times).

To validate that this process works, the blas-feedstock will build the test suite from netlib, link it against whatever BLAS flavour (in this case MKL), and then run the tests to see that everything is working (more details). This is the part that started failing since MKL 2025.0

h-vetinari · 2025-01-08T12:43:50Z

Gentle ping @vmalia @mkrainiuk @ZzEeKkAa, and happy new year! 🥳

h-vetinari mentioned this issue Dec 10, 2024

TEST: 2.2.x + blas variants conda-forge/numpy-feedstock#341

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regresssions in MKL 2025.0 #83

Regresssions in MKL 2025.0 #83

h-vetinari commented Nov 9, 2024 •

edited

Loading

oleksandr-pavlyk commented Nov 9, 2024

h-vetinari commented Nov 12, 2024

mkrainiuk commented Nov 12, 2024

h-vetinari commented Nov 12, 2024

sknepper commented Nov 13, 2024

h-vetinari commented Nov 13, 2024

ZzEeKkAa commented Nov 13, 2024

h-vetinari commented Nov 13, 2024

sknepper commented Nov 13, 2024

h-vetinari commented Nov 13, 2024

napetrov commented Nov 13, 2024

h-vetinari commented Dec 8, 2024

vmalia commented Dec 13, 2024 •

edited

Loading

h-vetinari commented Dec 13, 2024

h-vetinari commented Jan 8, 2025

Regresssions in MKL 2025.0 #83

Regresssions in MKL 2025.0 #83

Comments

h-vetinari commented Nov 9, 2024 • edited Loading

oleksandr-pavlyk commented Nov 9, 2024

h-vetinari commented Nov 12, 2024

mkrainiuk commented Nov 12, 2024

h-vetinari commented Nov 12, 2024

sknepper commented Nov 13, 2024

h-vetinari commented Nov 13, 2024

ZzEeKkAa commented Nov 13, 2024

h-vetinari commented Nov 13, 2024

sknepper commented Nov 13, 2024

h-vetinari commented Nov 13, 2024

napetrov commented Nov 13, 2024

h-vetinari commented Dec 8, 2024

vmalia commented Dec 13, 2024 • edited Loading

h-vetinari commented Dec 13, 2024

h-vetinari commented Jan 8, 2025

h-vetinari commented Nov 9, 2024 •

edited

Loading

vmalia commented Dec 13, 2024 •

edited

Loading