-
Notifications
You must be signed in to change notification settings - Fork 374
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failures with AVX512 in numpy test suite when using blis on windows (through conda-forge) #514
Comments
Actually, the intermittency comes from the SIMD capabilities that the Azure agents have - and in particular AVX512 - these differ basically randomly from run to run. The passing runs have:
The failing runs have:
with the difference being |
@h-vetinari it would be really helpful if you could come up with a minimal reproducer in C/C++/Fortran, e.g. a single |
Right, we don't want to place an undue burden on you, but is it maybe possible to easily instrument which BLAS calls (m,n,k,transa,transb,strides,etc.) the tests are making and to determine which one fails? |
Hey Devin, thanks for the quick response. Coming up with a C reproducer will not be trivial - essentially each of the failures above would need a separate reproducer (though I hope of course that several of them share the same underlying issue). I wanted to wait until I get a response before opening more issues, but this problem only gets more severe once I get into the ~440 failures of the scipy test suite. Essentially, numpy & scipy provide a large external corpus of tests, and I think it could be interesting to consider these as regression suite (it would even be possible to come up with a conda-based CI job with some effort). While I have to profess ignorance about Intel SDE, it might make sense to investigate how to install conda within that? (it's basically calling an installer once, and then the rest of the steps from the OP should be trivial, given an internet connection). I can dig into how numpy interfaces with BLAS for each of the affected failures, but it's a bit tricky to unravel. Maybe I can get some help from the numpy developers on this, but it'll be a slow process. |
For example, just chasing down the first failure:
leads all over the place in the code base (with lots of templating etc. in between) and it's hard to work out which path gets taken exactly in order to construct a C-reproducer (ultimately for each of the failures...). Not sure if @seberg @charris @eric-wieser have some cycles to spare, but I doubt it. |
Printing out |
One thing that dot and matmul do differently, is that So it could be that But this is just a random guess, beyond that, I think you have to track down the exact calling parameters. |
Ah no, that's trivial. Will do that.
OK, I didn't know that, sorry. Also thanks a lot for chiming in @seberg! |
The
Some extra information about these matrices:
|
See if you can get this test program to run in the AVX512 Azure environment: #include <blis.h>
#include <assert.h>
#include <math.h>
int main(int argc, char** argv)
{
double M[5][3] = {{ 0, 1, 2},
{ 3, 4, 5},
{ 6, 7, 8},
{ 9,10,11},
{12,13,14}};
double C[5][5];
int m = 5, n = 5, k = 3;
double one = 1.0, zero = 0.0;
for (int i = 0;i < m;i++)
for (int j = 0;j < n;j++)
C[i][j] = NAN;
bli_dgemm(BLIS_NO_TRANSPOSE, BLIS_TRANSPOSE, m, n, k,
&one, &M[0][0], 3, 1,
&M[0][0], 3, 1,
&zero, &C[0][0], 5, 1);
for (int i = 0;i < m;i++)
for (int j = 0;j < n;j++)
{
double ref = 0;
for (int p = 0;p < k;p++)
ref += M[i][p]*M[j][p];
assert(fabs(ref - C[i][j]) < 1e-14);
}
return 0;
} |
Doing this in conda-forge/blis-feedstock#23, seems more/different headers are necessary?
|
You'll need to set the include path to the "combined" blis.h header, which is available after compiling at |
We're using the combined headers AFAICT (here's the build script). The headers end up in |
Ah, I somehow missed "undefined reference". You are linking the BLIS library, right? You might also need libm. |
My bad, I had indeed forgotten the linker options (because I had to do this by hand, not with cmake etc.). Currently trying to retrigger the build until it runs on an agent with AVX512 (which I have no influence over). |
Ha! Got one, and indeed it failed:
|
OK! Can you do one more thing and modify the program to print out the entire C matrix? Meanwhile I'll look over the code. |
So I've added // print matrix C
for (int i = 0;i < m;i++)
{
for (int j = 0;j < n;j++)
{
printf("%f ", C[i][j]);
}
printf("\n");
}
fflush(stdout); to your reproducer, which yields the following (when correct)
but with AVX512 (on windows only; works under linux), it gives:
|
Ignore me if I am off the target completely ... but Unrelated to BLIS What happened to us: the customer was using (for development) a Win10 PRO machine with i5-2500K ... And he is using JSON lib of his choice, with the AVX512 core, and for that lib, that CPU was "too old" (almost 10 years AFAIK); and of course, that lib was "failing silently" on Windows only. JSON lib maker has done a quick (and simple) fix for "older" CPUs and all worketh. After few days of general puzzlement and time-wasting. |
Yeah, it's a massive pain to debug. I had seen inexplicably flaky failures while testing numpy/scipy with blis for over a year, but only recently realised that it has to do with presence/absence of AVX512 (in the randomly assigned CI agents)... Hoping we can make some inroads now that the source has finally been identified. |
I'm trying to reproduce locally under SDE. Do you know if this is actually a Windows-only bug or if it also happens under Linux (and maybe clang vs gcc as well)? |
Cool, thanks
It's Windows-only within the conda-forge Azure CI, with the following caveats:
|
I can't reproduce on Mac or Linux and it's going to be a pain to try Windows (I would have to set up the development env. in a VM from scratch). Can you try this on your end: In the file for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
{
bli_ddcopys( *(x + ii*rs_x + jj*cs_x),
*(y + ii*rs_y + jj*cs_y) );
printf("%lld %lld %lld %lld %lld %lld %p %p %f %f\n", ii, jj, rs_x, cs_x, rs_y, cs_y, x, y, *(x + ii*rs_x + jj*cs_x), *(y + ii*rs_y + jj*cs_y));
} The problem either has to be a compiler bug with this part of the code or a problem with compiling the inline assembly kernel. |
And to be clear, you are configuring blis with either the |
As I said, Linux works, but windows is the crucial case ATM.
This comes out as follows (just adding the print statement) - let me know if I got something wrong. diffdiff --git a/frame/include/level0/bli_copys_mxn.h b/frame/include/level0/bli_copys_mxn.h
index a8ead1c3..4e4e22b6 100644
--- a/frame/include/level0/bli_copys_mxn.h
+++ b/frame/include/level0/bli_copys_mxn.h
@@ -204,8 +204,11 @@ BLIS_INLINE void bli_ddcopys_mxn( const dim_t m, const dim_t n, double* restri
{
for ( dim_t jj = 0; jj < n; ++jj )
for ( dim_t ii = 0; ii < m; ++ii )
+ {
bli_ddcopys( *(x + ii*rs_x + jj*cs_x),
*(y + ii*rs_y + jj*cs_y) );
+ printf("%lld %lld %lld %lld %lld %lld %p %p %f %f\n", ii, jj, rs_x, cs_x, rs_y, cs_y, x, y, *(x + ii*rs_x + jj*cs_x), *(y + ii*rs_y + jj*cs_y));
+ }
}
}
BLIS_INLINE void bli_cdcopys_mxn( const dim_t m, const dim_t n, scomplex* restrict x, const inc_t rs_x, const inc_t cs_x,
Should be the default configuration (whatever that may be), the build invocation is here. |
FYI if you're up for generating assembly the steps would be:
|
This is better than working from the .o files because it preserves any comments generated by clang. |
I'm not sure I followed the instructions correctly (also adapted to conda-forge infra), but here's my attempt: conda-forge/blis-feedstock@625bcce |
@h-vetinari surely there are more compiler flags for the compile lines? Did you run with |
No, but happy to take pointers how these lines should look like (and I've added |
After adding |
Thanks, that's also what I wanted to do, see outcome |
Here's the output of a failing run (raw log) Hope this helps. 🙃 Contents of frame/3/gemm/bli_gemm_ker_var2.c
Contents of kernels/skx/3/bli_dgemm_skx_asm_16x14.c
Contents of bli_gemm_ker_var2.s
Contents of bli_dgemm_skx_asm_16x14.s
|
The problem seems to be here:
The beta variable is loaded in |
I guess the "hacky" solution is to zero out |
If you could also generate the assembly for |
Indeed the haswell kernel is saving |
Try using `-march=haswell` for kernels. Fixes #514.
@h-vetinari please try the |
Not sure if related, but this reminds me of a recent numpy/scipy bug caused by microsoft shipping a broken ucrtbase runtime that ended up corrupting various registers (and took them half a year to fix). |
@h-vetinari If I'm reading the logs right, seems like it is fixed? |
Yes, switching to the haswell kernel made the reproducer run through also on AVX512 🥳 Thanks a lot for your help on this, much appreciated! I'll discuss with Isuru to build a new version for conda-forge with this patch, and see if this helps with the other failures I've been seeing (e.g. #517, resp. other scipy errors I haven't raised a ticket for yet). |
Also note, |
Great, I'll also add a comment to the kernel makefile so this doesn't pop back up. FYI it's not using the haswell kernel on AVX512 (that would lead to much lower performance), just using |
Use `-march=haswell` for kernels. Fixes #514.
Great, thanks, and thanks for the explanation.
Not that it's a priority, but shouldn't it be possible to eventually fix this natively (not knowing ATM whether it's the compiler, the runtime, blis or something else that's at fault)? |
I consider this solved "once and for all" since it seems that the cause is a difference in calling convention between AVX512 and non-AVX512 code (or a compiler bug). I can't find any Microsoft documentation to support this but then again I didn't think I would. |
Just to report the results of this (took a while, sorry) - the errors on the numpy test suite are now gone: conda-forge/numpy-feedstock#237, which is awesome! Unfortunately, on the scipy side conda-forge/scipy-feedstock#172, blis+avx512 now hangs indefinitely. I'm going to open another issue for that. Edit: #526 At least #517 did not reappear, which I'm going to close. |
Conda-forge is the community driven channel for the packaging & environment tool
conda
, which is used heavily in many scientific & corporate environments, especially (but not only) in terms of python.conda-forge implements a blas meta-package that allows selecting between different blas/lapack implementations. Currently available are: netlib, mkl, openblas & blis (where blis uses the netlib binaries for lapack).
For the last couple of releases, I've been running the numpy (& scipy) test-suites against all those blas variants to root out any potential errors. Blis on windows (at least as packaged through conda-forge) was always a problem child, but has gotten a bit better recently. Still, I keep getting some
flakytest failures both for the numpy & scipy test suites- flaky as in: they appear in one run, and then disappear again, with the only difference being a CI restart.due to absence/presence of AVX512.Those errors look pretty scary numerically - e.g. all NaNs (where non-NaN would be expected), or a matrix that's completely zero instead of the identity - more details below.
For some example CI runs see here and here (coming from conda-forge/numpy-feedstock#227, but also occurring in conda-forge/numpy-feedstock#237).
Reproducer (on a windows system with AVX512):
Short log of failures
Failure of
TestMatmul.test_dot_equivalent[args4]
Failure of
TestSolve.test_sq_cases
Failure of
TestSolve.test_generalized_sq_cases
Failure of
TestInv.test_sq_cases
Failure of
TestInv.test_generalized_sq_cases
Failure of
TestPinv.test_generalized_sq_cases
Failure of
TestPinv.test_generalized_nonsq_cases
Failure of
TestDet.test_sq_cases
Failure of
TestDet.test_generalized_sq_cases
Failure of
TestMatrixPower.test_power_is_minus_one[dt13]
Failure of
TestCholesky::test_basic_property
The text was updated successfully, but these errors were encountered: