Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failures with AVX512 in numpy test suite when using blis on windows (through conda-forge) #514

Closed
h-vetinari opened this issue Jun 29, 2021 · 93 comments · Fixed by #522
Closed
Labels

Comments

@h-vetinari
Copy link
Contributor

h-vetinari commented Jun 29, 2021

Conda-forge is the community driven channel for the packaging & environment tool conda, which is used heavily in many scientific & corporate environments, especially (but not only) in terms of python.

conda-forge implements a blas meta-package that allows selecting between different blas/lapack implementations. Currently available are: netlib, mkl, openblas & blis (where blis uses the netlib binaries for lapack).

For the last couple of releases, I've been running the numpy (& scipy) test-suites against all those blas variants to root out any potential errors. Blis on windows (at least as packaged through conda-forge) was always a problem child, but has gotten a bit better recently. Still, I keep getting some flaky test failures both for the numpy & scipy test suites - flaky as in: they appear in one run, and then disappear again, with the only difference being a CI restart. due to absence/presence of AVX512.

Those errors look pretty scary numerically - e.g. all NaNs (where non-NaN would be expected), or a matrix that's completely zero instead of the identity - more details below.

For some example CI runs see here and here (coming from conda-forge/numpy-feedstock#227, but also occurring in conda-forge/numpy-feedstock#237).

Reproducer (on a windows system with AVX512):

# install miniconda, e.g. from https://github.com/conda-forge/miniforge
conda config --add channels conda-forge
# set up environment
conda create -n test_env numpy "blas =*=blis" pytest hypothesis setuptools
# activate environment
conda activate test_env
# confirm that instruction sets '... AVX512F* AVX512CD* AVX512_SKX* ...' are detected
python -c "import numpy; numpy._pytesttester._show_numpy_info()"
# run (relevant subset of) the test suite
pytest --pyargs numpy.linalg.tests.test_linalg -v
Short log of failures
=========================== short test summary info ===========================
FAILED core/tests/test_multiarray.py::TestMatmul::test_dot_equivalent[args4]
FAILED linalg/tests/test_linalg.py::TestSolve::test_sq_cases - AssertionError...
FAILED linalg/tests/test_linalg.py::TestSolve::test_generalized_sq_cases - As...
FAILED linalg/tests/test_linalg.py::TestInv::test_sq_cases - AssertionError: ...
FAILED linalg/tests/test_linalg.py::TestInv::test_generalized_sq_cases - Asse...
FAILED linalg/tests/test_linalg.py::TestPinv::test_generalized_sq_cases - Ass...
FAILED linalg/tests/test_linalg.py::TestPinv::test_generalized_nonsq_cases - ...
FAILED linalg/tests/test_linalg.py::TestDet::test_sq_cases - AssertionError: ...
FAILED linalg/tests/test_linalg.py::TestDet::test_generalized_sq_cases - Asse...
FAILED linalg/tests/test_linalg.py::TestMatrixPower::test_power_is_minus_one[dt13]
FAILED linalg/tests/test_linalg.py::TestCholesky::test_basic_property - Asser...
= 11 failed, 13581 passed, 714 skipped, 1 deselected, 20 xfailed, 1 xpassed, 229 warnings in 447.85s (0:07:27) =
Failure of TestMatmul.test_dot_equivalent[args4]
____________________ TestMatmul.test_dot_equivalent[args4] ____________________

self = <numpy.core.tests.test_multiarray.TestMatmul object at 0x000002A1BE3DFFD0>
args = (array([[ 0.,  1.,  2.],
       [ 3.,  4.,  5.],
       [ 6.,  7.,  8.],
       [ 9., 10., 11.],
       [12., 13., 14.]]), array([[ 0.,  3.,  6.,  9., 12.],
       [ 1.,  4.,  7., 10., 13.],
       [ 2.,  5.,  8., 11., 14.]]))

    @pytest.mark.parametrize('args', (
            [...]
        ))
    def test_dot_equivalent(self, args):
        r1 = np.matmul(*args)
        r2 = np.dot(*args)
>       assert_equal(r1, r2)
E       AssertionError:
E       Arrays are not equal
E
E       x and y nan location mismatch:
E        x: array([[nan, nan, nan, nan, nan],
E              [nan, nan, nan, nan, nan],
E              [nan, nan, nan, nan, nan],...
E        y: array([[  5.,  14.,  23.,  32.,  41.],
E              [ 14.,  50.,  86., 122., 158.],
E              [ 23.,  86., 149., 212., 275.],...
Failure of TestSolve.test_sq_cases
___________________________ TestSolve.test_sq_cases ___________________________

actual = array([2.+1.j, 1.+2.j], dtype=complex64)
desired = array([0.+0.j, 0.+0.j], dtype=complex64), decimal = 6, err_msg = ''
verbose = True
Failure of TestSolve.test_generalized_sq_cases
_____________________ TestSolve.test_generalized_sq_cases _____________________

actual = array([[ 2. +1.j,  1. +2.j],
       [14. +7.j,  7.+14.j],
       [12. +6.j,  6.+12.j]], dtype=complex64)
desired = array([[0.+0.j, 0.+0.j],
       [0.+0.j, 0.+0.j],
       [0.+0.j, 0.+0.j]], dtype=complex64)
decimal = 6, err_msg = '', verbose = True
Failure of TestInv.test_sq_cases
____________________________ TestInv.test_sq_cases ____________________________

actual = array([[0.+0.j, 0.+0.j],
       [0.+0.j, 0.+0.j]], dtype=complex64)
desired = array([[1., 0.],
       [0., 1.]]), decimal = 6, err_msg = ''
verbose = True
Failure of TestInv.test_generalized_sq_cases
______________________ TestInv.test_generalized_sq_cases ______________________

actual = array([[[0.+0.j, 0.+0.j],
        [0.+0.j, 0.+0.j]],

       [[0.+0.j, 0.+0.j],
        [0.+0.j, 0.+0.j]],

       [[0.+0.j, 0.+0.j],
        [0.+0.j, 0.+0.j]]], dtype=complex64)
desired = array([[[1.+0.j, 0.+0.j],
        [0.+0.j, 1.+0.j]],

       [[1.+0.j, 0.+0.j],
        [0.+0.j, 1.+0.j]],

       [[1.+0.j, 0.+0.j],
        [0.+0.j, 1.+0.j]]], dtype=complex64)
decimal = 6, err_msg = '', verbose = True
Failure of TestPinv.test_generalized_sq_cases
_____________________ TestPinv.test_generalized_sq_cases ______________________

[...]

self = <numpy.linalg.tests.test_linalg.TestPinv object at 0x000002A1C2382E50>
a = array([[[0.19151945, 0.62210877, 0.43772774, 0.78535858, 0.77997581,
         0.27259261, 0.27646426, 0.80187218],
   ...],
        [2.3715724 , 2.9762444 , 2.87640529, 2.37589241, 0.85575288,
         1.87475012, 1.43428139, 0.58702554]]])
b = array([[0.38231745, 0.05387369, 0.45164841, 0.98200474, 0.1239427 ,
        0.1193809 , 0.73852306, 0.58730363],
     ...543],
       [2.29390471, 0.32324211, 2.70989045, 5.89202845, 0.7436562 ,
        0.71628539, 4.43113834, 3.5238218 ]])
tags = frozenset({'generalized', 'square', 'strided'})

    def do(self, a, b, tags):
        a_ginv = linalg.pinv(a)
        # `a @ a_ginv == I` does not hold if a is singular
        dot = dot_generalized
>       assert_almost_equal(dot(dot(a, a_ginv), a), a, single_decimal=5, double_decimal=11)

[...]

actual = array([[[nan, nan, nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan, nan, nan],
        [nan, nan,..., nan, nan, nan],
        [nan, nan, nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan, nan, nan]]])
desired = array([[[0.19151945, 0.62210877, 0.43772774, 0.78535858, 0.77997581,
         0.27259261, 0.27646426, 0.80187218],
   ...],
        [2.3715724 , 2.9762444 , 2.87640529, 2.37589241, 0.85575288,
         1.87475012, 1.43428139, 0.58702554]]])
decimal = 11, err_msg = '', verbose = True
Failure of TestPinv.test_generalized_nonsq_cases
____________________ TestPinv.test_generalized_nonsq_cases ____________________

[...]

self = <numpy.linalg.tests.test_linalg.TestPinv object at 0x000002A1C2395E80>
a = array([[[0.22921857, 0.89996519, 0.41675354, 0.53585166, 0.00620852,
         0.30064171, 0.43689317, 0.612149  , 0.91...8, 2.32854217, 2.34743398,
         2.28481174, 2.74320934, 1.97586835, 1.70510274, 0.60526708,
         2.09488913]]])
b = array([[0.95219541, 0.88996329, 0.99356736, 0.81870351, 0.54512217,
        0.45125405, 0.89055719, 0.97326479],
     ...354],
       [5.71317246, 5.33977972, 5.96140418, 4.91222106, 3.270733  ,
        2.70752433, 5.34334313, 5.83958875]])
tags = frozenset({'generalized', 'nonsquare', 'strided'})

    def do(self, a, b, tags):
        a_ginv = linalg.pinv(a)
        # `a @ a_ginv == I` does not hold if a is singular
        dot = dot_generalized
>       assert_almost_equal(dot(dot(a, a_ginv), a), a, single_decimal=5, double_decimal=11)

[...]

actual = array([[[nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan, nan, nan, nan,..., nan, nan, nan, nan, nan, nan, nan, nan, nan, nan],
        [nan, nan, nan, nan, nan, nan, nan, nan, nan, nan, nan]]])
desired = array([[[0.22921857, 0.89996519, 0.41675354, 0.53585166, 0.00620852,
         0.30064171, 0.43689317, 0.612149  , 0.91...8, 2.32854217, 2.34743398,
         2.28481174, 2.74320934, 1.97586835, 1.70510274, 0.60526708,
         2.09488913]]])
decimal = 11, err_msg = '', verbose = True
Failure of TestDet.test_sq_cases
____________________________ TestDet.test_sq_cases ____________________________

actual = (6-17j), desired = (3.9968028886505635e-15-4.000000000000001j)
decimal = 6, err_msg = '', verbose = True
Failure of TestDet.test_generalized_sq_cases
______________________ TestDet.test_generalized_sq_cases ______________________

actual = array([ 6. -17.j, 24. -68.j, 54.-153.j], dtype=complex64)
desired = array([3.99680289e-15 -4.j, 1.59872116e-14-16.j, 2.13162821e-14-36.j])
decimal = 6, err_msg = '', verbose = True
Failure of TestMatrixPower.test_power_is_minus_one[dt13]
________________ TestMatrixPower.test_power_is_minus_one[dt13] ________________

actual = array([[0.+0.j, 0.+0.j, 0.+0.j, 0.+0.j],
       [0.+0.j, 0.+0.j, 0.+0.j, 0.+0.j],
       [0.+0.j, 0.+0.j, 0.+0.j, 0.+0.j],
       [0.+0.j, 0.+0.j, 0.+0.j, 0.+0.j]], dtype=complex64)
desired = array([[1., 0., 0., 0.],
       [0., 1., 0., 0.],
       [0., 0., 1., 0.],
       [0., 0., 0., 1.]])
decimal = 6, err_msg = '', verbose = True
Failure of TestCholesky::test_basic_property
______________________ TestCholesky.test_basic_property _______________________

self = <numpy.linalg.tests.test_linalg.TestCholesky object at 0x000002A1C0E0A940>

    def test_basic_property(self):
        # Check A = L L^H
        shapes = [(1, 1), (2, 2), (3, 3), (50, 50), (3, 10, 10)]
        dtypes = (np.float32, np.float64, np.complex64, np.complex128)

        for shape, dtype in itertools.product(shapes, dtypes):
            np.random.seed(1)
            a = np.random.randn(*shape)
            if np.issubdtype(dtype, np.complexfloating):
                a = a + 1j*np.random.randn(*shape)

            t = list(range(len(shape)))
            t[-2:] = -1, -2

            a = np.matmul(a.transpose(t).conj(), a)
            a = np.asarray(a, dtype=dtype)

            c = np.linalg.cholesky(a)

            b = np.matmul(c, c.transpose(t).conj())
>           assert_allclose(b, a,
                            err_msg=f'{shape} {dtype}\n{a}\n{c}',
                            atol=500 * a.shape[0] * np.finfo(dtype).eps)
E           AssertionError:
E           Not equal to tolerance rtol=1e-07, atol=0.000119209
E           (2, 2) <class 'numpy.complex64'>
E           [[ 6.7107615-6.181477e-17j -3.746924 -9.348988e-01j]
E            [-3.746924 +9.348988e-01j  7.402024 -6.601837e-17j]]
E           [[2.5905137+0.j 0.       +0.j]
E            [0.       -0.j 2.7206662+0.j]]
E           Mismatched elements: 2 / 4 (50%)
E           Max absolute difference: 3.8617969
E           Max relative difference: 1.
E            x: array([[6.710761+0.j, 0.      +0.j],
E                  [0.      +0.j, 7.402024+0.j]], dtype=complex64)
E            y: array([[ 6.710762-6.181477e-17j, -3.746924-9.348988e-01j],
E                  [-3.746924+9.348988e-01j,  7.402024-6.601837e-17j]],
E                 dtype=complex64)
@h-vetinari
Copy link
Contributor Author

Actually, the intermittency comes from the SIMD capabilities that the Azure agents have - and in particular AVX512 - these differ basically randomly from run to run.

The passing runs have:

>python -c "import numpy; numpy._pytesttester._show_numpy_info()" 
NumPy version 1.21.0
NumPy relaxed strides checking option: True
NumPy CPU features:  SSE SSE2 SSE3 SSSE3* SSE41* POPCNT* SSE42* AVX* F16C* FMA3* AVX2* AVX512F? AVX512CD? AVX512_SKX? AVX512_CLX? AVX512_CNL?

The failing runs have:

>python -c "import numpy; numpy._pytesttester._show_numpy_info()" 
NumPy version 1.21.0
NumPy relaxed strides checking option: True
NumPy CPU features:  SSE SSE2 SSE3 SSSE3* SSE41* POPCNT* SSE42* AVX* F16C* FMA3* AVX2* AVX512F* AVX512CD* AVX512_SKX* AVX512_CLX? AVX512_CNL?

with the difference being AVX512F? AVX512CD? AVX512_SKX? (i.e. absence of these instruction sets) for the passing ones, and AVX512F* AVX512CD* AVX512_SKX* (i.e. presence of these instruction sets) for the failing ones.

@h-vetinari h-vetinari changed the title Intermittent failures in numpy test suite when using blis on windows (through conda-forge) Failures with AVX512 in numpy test suite when using blis on windows (through conda-forge) Jun 30, 2021
@devinamatthews
Copy link
Member

@h-vetinari it would be really helpful if you could come up with a minimal reproducer in C/C++/Fortran, e.g. a single gemm call that displays anomalous behavior. For AVX-512 I'll have to run it under Intel SDE so I'm not confident about running tests through Python.

@devinamatthews
Copy link
Member

Right, we don't want to place an undue burden on you, but is it maybe possible to easily instrument which BLAS calls (m,n,k,transa,transb,strides,etc.) the tests are making and to determine which one fails?

@h-vetinari
Copy link
Contributor Author

@h-vetinari it would be really helpful if you could come up with a minimal reproducer in C/C++/Fortran, e.g. a single gemm call that displays anomalous behavior. For AVX-512 I'll have to run it under Intel SDE so I'm not confident about running tests through Python.

Right, we don't want to place an undue burden on you, but is it maybe possible to easily instrument which BLAS calls (m,n,k,transa,transb,strides,etc.) the tests are making and to determine which one fails?

Hey Devin, thanks for the quick response. Coming up with a C reproducer will not be trivial - essentially each of the failures above would need a separate reproducer (though I hope of course that several of them share the same underlying issue). I wanted to wait until I get a response before opening more issues, but this problem only gets more severe once I get into the ~440 failures of the scipy test suite.

Essentially, numpy & scipy provide a large external corpus of tests, and I think it could be interesting to consider these as regression suite (it would even be possible to come up with a conda-based CI job with some effort). While I have to profess ignorance about Intel SDE, it might make sense to investigate how to install conda within that? (it's basically calling an installer once, and then the rest of the steps from the OP should be trivial, given an internet connection).

I can dig into how numpy interfaces with BLAS for each of the affected failures, but it's a bit tricky to unravel. Maybe I can get some help from the numpy developers on this, but it'll be a slow process.

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Jun 30, 2021

For example, just chasing down the first failure:

        r1 = np.matmul(*args)
        r2 = np.dot(*args)
>       assert_equal(r1, r2)
E       AssertionError:
E       Arrays are not equal

leads all over the place in the code base (with lots of templating etc. in between)
https://github.com/numpy/numpy/blob/v1.21.0/numpy/core/multiarray.py#L736
https://github.com/numpy/numpy/blob/v1.21.0/numpy/core/multiarray.py#L75
https://github.com/numpy/numpy/blob/v1.21.0/numpy/core/overrides.py#L219
https://github.com/numpy/numpy/blob/v1.21.0/numpy/core/overrides.py#L141
...
https://github.com/numpy/numpy/blob/v1.21.0/numpy/core/src/multiarray/arraytypes.c.src#L3536
...
https://github.com/numpy/numpy/blob/v1.21.0/numpy/core/src/umath/matmul.c.src#L398
etc.

and it's hard to work out which path gets taken exactly in order to construct a C-reproducer (ultimately for each of the failures...). Not sure if @seberg @charris @eric-wieser have some cycles to spare, but I doubt it.

@devinamatthews
Copy link
Member

Printing out *args would probably be enough to construct something, but of course you would need to run it in Azure to reproduce probably. SDE is a singly-binary emulator with so I highly doubt running a full python instance inside it would work or be performant enough.

@seberg
Copy link

seberg commented Jun 30, 2021

One thing that dot and matmul do differently, is that dot NULL's out the array before the call I think (someone investigated something unrelated noted that recently).

So it could be that blis is not ignorning the input array here when the scaling factor beta(?) is 0? (If my very brief look is right, and this is about NaN being returned often, or just random values). In that case cblas contract seems to say that zeroing the work-array is not necessary.

But this is just a random guess, beyond that, I think you have to track down the exact calling parameters.

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Jun 30, 2021

Printing out *args would probably be enough to construct something, but of course you would need to run it in Azure to reproduce probably.

Ah no, that's trivial. Will do that.

SDE is a singly-binary emulator with so I highly doubt running a full python instance inside it would work or be performant enough.

OK, I didn't know that, sorry.

Also thanks a lot for chiming in @seberg!

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Jun 30, 2021

The args* from the first failure in TestMatmul.test_dot_equivalent[args4] is set as follows (5th element in the parametrization here - pytest assigns ids starting from 0 of course):

m1 = np.arange(15.).reshape(5, 3)
# array([[ 0.,  1.,  2.],
#        [ 3.,  4.,  5.],
#        [ 6.,  7.,  8.],
#        [ 9., 10., 11.],
#        [12., 13., 14.]])
args = (m1, m1.T)

Some extra information about these matrices:

m1.dtype
# dtype('float64')
m1.flags
#   C_CONTIGUOUS : True
#   F_CONTIGUOUS : False
#   OWNDATA : False
#   WRITEABLE : True
#   ALIGNED : True
#   WRITEBACKIFCOPY : False
#   UPDATEIFCOPY : False
m1.T.flags
#   C_CONTIGUOUS : False
#   F_CONTIGUOUS : True
#   OWNDATA : False
#   WRITEABLE : True
#   ALIGNED : True
#   WRITEBACKIFCOPY : False
#   UPDATEIFCOPY : False

@devinamatthews
Copy link
Member

See if you can get this test program to run in the AVX512 Azure environment:

#include <blis.h>

#include <assert.h>
#include <math.h>

int main(int argc, char** argv)
{
    double M[5][3] = {{ 0, 1, 2},
                      { 3, 4, 5},
                      { 6, 7, 8},
                      { 9,10,11},
                      {12,13,14}};
    double C[5][5];
    int m = 5, n = 5, k = 3;
    double one = 1.0, zero = 0.0;

    for (int i = 0;i < m;i++)
    for (int j = 0;j < n;j++)
        C[i][j] = NAN;

    bli_dgemm(BLIS_NO_TRANSPOSE, BLIS_TRANSPOSE, m, n, k,
               &one, &M[0][0], 3, 1,
                     &M[0][0], 3, 1,
              &zero, &C[0][0], 5, 1);

    for (int i = 0;i < m;i++)
    for (int j = 0;j < n;j++)
    {
        double ref = 0;
        for (int p = 0;p < k;p++)
            ref += M[i][p]*M[j][p];
        assert(fabs(ref - C[i][j]) < 1e-14);
    }

    return 0;
}

h-vetinari pushed a commit to h-vetinari/blis-feedstock that referenced this issue Jul 1, 2021
h-vetinari pushed a commit to h-vetinari/blis-feedstock that referenced this issue Jul 1, 2021
@h-vetinari
Copy link
Contributor Author

See if you can get this test program to run in the AVX512 Azure environment:

Doing this in conda-forge/blis-feedstock#23, seems more/different headers are necessary?

+ /home/[...]/bin/x86_64-conda-linux-gnu-cc -I/home/[...]/include debug_blis.c
/home/[...]/bin/ld: /tmp/ccyxRQDz.o: in function `bli_round':
debug_blis.c:(.text+0x2c9): undefined reference to `round'
/home/[...]/bin/ld: /tmp/ccyxRQDz.o: in function `bli_thread_broadcast':
debug_blis.c:(.text+0xe956): undefined reference to `bli_thrcomm_bcast'
/home/[...]/bin/ld: /tmp/ccyxRQDz.o: in function `bli_thread_barrier':
debug_blis.c:(.text+0xe97e): undefined reference to `bli_thrcomm_barrier'
/home/[...]/bin/ld: /tmp/ccyxRQDz.o: in function `bli_thread_range_jrir_sl':
debug_blis.c:(.text+0xea1d): undefined reference to `bli_thread_range_sub'
/home/[...]/bin/ld: /tmp/ccyxRQDz.o: in function `bli_membrk_init_mutex':
debug_blis.c:(.text+0x1071f): undefined reference to `bli_pthread_mutex_init'
/home/[...]/bin/ld: /tmp/ccyxRQDz.o: in function `bli_membrk_finalize_mutex':
debug_blis.c:(.text+0x10740): undefined reference to `bli_pthread_mutex_destroy'
/home/[...]/bin/ld: /tmp/ccyxRQDz.o: in function `bli_membrk_lock':
debug_blis.c:(.text+0x10821): undefined reference to `bli_pthread_mutex_lock'
/home/[...]/bin/ld: /tmp/ccyxRQDz.o: in function `bli_membrk_unlock':
debug_blis.c:(.text+0x10842): undefined reference to `bli_pthread_mutex_unlock'
/home/[...]/bin/ld: /tmp/ccyxRQDz.o: in function `bli_apool_lock':
debug_blis.c:(.text+0x10b97): undefined reference to `bli_pthread_mutex_lock'
/home/[...]/bin/ld: /tmp/ccyxRQDz.o: in function `bli_apool_unlock':
debug_blis.c:(.text+0x10bba): undefined reference to `bli_pthread_mutex_unlock'
/home/[...]/bin/ld: /tmp/ccyxRQDz.o: in function `bli_gemmsup_ref_var1n2m_opt_cases':
debug_blis.c:(.text+0x1144d): undefined reference to `bli_abort'
/home/[...]/bin/ld: debug_blis.c:(.text+0x119c8): undefined reference to `bli_obj_imag_is_zero'
/home/[...]/bin/ld: /tmp/ccyxRQDz.o: in function `main':
debug_blis.c:(.text+0x11c6a): undefined reference to `bli_dgemm'
collect2: error: ld returned 1 exit status

@devinamatthews
Copy link
Member

You'll need to set the include path to the "combined" blis.h header, which is available after compiling at <build_dir>/include/<arch>/blis.h or <install_prefix>/include/blis.h after installation.

@h-vetinari
Copy link
Contributor Author

We're using the combined headers AFAICT (here's the build script).

The headers end up in $PREFIX/include/blis/blis.h, but there's not much else there (except for cblas.h).

@devinamatthews
Copy link
Member

Ah, I somehow missed "undefined reference". You are linking the BLIS library, right? You might also need libm.

@h-vetinari
Copy link
Contributor Author

My bad, I had indeed forgotten the linker options (because I had to do this by hand, not with cmake etc.). Currently trying to retrigger the build until it runs on an agent with AVX512 (which I have no influence over).

@h-vetinari
Copy link
Contributor Author

Ha! Got one, and indeed it failed:

NumPy CPU features:  SSE SSE2 SSE3 SSSE3* SSE41* POPCNT* SSE42* AVX* F16C* FMA3* AVX2* AVX512F* AVX512CD* AVX512_SKX* AVX512_CLX? AVX512_CNL?
"clang-cl -I%PREFIX%\Library\include debug_blis.c blis.lib /link /LIBPATH:%PREFIX%\Library\lib"
debug_blis.exe
Assertion failed: fabs(ref - C[i][j]) < 1e-14, file debug_blis.c, line 32

@devinamatthews
Copy link
Member

OK! Can you do one more thing and modify the program to print out the entire C matrix? Meanwhile I'll look over the code.

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Jul 3, 2021

So I've added

    // print matrix C
    for (int i = 0;i < m;i++)
    {
        for (int j = 0;j < n;j++)
        {
            printf("%f     ", C[i][j]);
        }
        printf("\n");
    }
    fflush(stdout);

to your reproducer, which yields the following (when correct)

5.000000     14.000000     23.000000     32.000000     41.000000     
14.000000     50.000000     86.000000     122.000000     158.000000     
23.000000     86.000000     149.000000     212.000000     275.000000     
32.000000     122.000000     212.000000     302.000000     392.000000     
41.000000     158.000000     275.000000     392.000000     509.000000  

but with AVX512 (on windows only; works under linux), it gives:

nan     nan     nan     nan     nan     
nan     nan     nan     nan     nan     
nan     nan     nan     nan     nan     
nan     nan     nan     nan     nan     
nan     nan     nan     nan     nan 

@DBJDBJ
Copy link

DBJDBJ commented Jul 5, 2021

Ignore me if I am off the target completely ... but

Unrelated to BLIS

What happened to us: the customer was using (for development) a Win10 PRO machine with i5-2500K ... And he is using JSON lib of his choice, with the AVX512 core, and for that lib, that CPU was "too old" (almost 10 years AFAIK); and of course, that lib was "failing silently" on Windows only.

JSON lib maker has done a quick (and simple) fix for "older" CPUs and all worketh. After few days of general puzzlement and time-wasting.

@h-vetinari
Copy link
Contributor Author

Yeah, it's a massive pain to debug. I had seen inexplicably flaky failures while testing numpy/scipy with blis for over a year, but only recently realised that it has to do with presence/absence of AVX512 (in the randomly assigned CI agents)...

Hoping we can make some inroads now that the source has finally been identified.

@devinamatthews
Copy link
Member

I'm trying to reproduce locally under SDE. Do you know if this is actually a Windows-only bug or if it also happens under Linux (and maybe clang vs gcc as well)?

@h-vetinari
Copy link
Contributor Author

I'm trying to reproduce locally under SDE.

Cool, thanks

Do you know if this is actually a Windows-only bug or if it also happens under Linux (and maybe clang vs gcc as well)?

It's Windows-only within the conda-forge Azure CI, with the following caveats:

  • Linux uses GCC, passes on agents both with & without AVX512
  • Windows uses clang + MSVC, passes on agents without AVX512, fails with AVX512
  • OSX uses clang and passes (though AFAICT OSX does not get any AVX512 agents)

@devinamatthews
Copy link
Member

I can't reproduce on Mac or Linux and it's going to be a pain to try Windows (I would have to set up the development env. in a VM from scratch). Can you try this on your end:

In the file frame/include/level0/bli_copys_mxn.h, replace lines 205-208 with the following:

		for ( dim_t jj = 0; jj < n; ++jj )
		for ( dim_t ii = 0; ii < m; ++ii )
                {
		bli_ddcopys( *(x + ii*rs_x + jj*cs_x),
		             *(y + ii*rs_y + jj*cs_y) );
                printf("%lld %lld %lld %lld %lld %lld %p %p %f %f\n", ii, jj, rs_x, cs_x, rs_y, cs_y, x, y, *(x + ii*rs_x + jj*cs_x), *(y + ii*rs_y + jj*cs_y));
                }

The problem either has to be a compiler bug with this part of the code or a problem with compiling the inline assembly kernel.

@devinamatthews
Copy link
Member

And to be clear, you are configuring blis with either the intel64 or x86_64 configuration, and not skx, right?

@h-vetinari
Copy link
Contributor Author

I can't reproduce on Mac or Linux and it's going to be a pain to try Windows (I would have to set up the development env. in a VM from scratch).

As I said, Linux works, but windows is the crucial case ATM.

Can you try this on your end:

This comes out as follows (just adding the print statement) - let me know if I got something wrong.

diff
diff --git a/frame/include/level0/bli_copys_mxn.h b/frame/include/level0/bli_copys_mxn.h
index a8ead1c3..4e4e22b6 100644
--- a/frame/include/level0/bli_copys_mxn.h
+++ b/frame/include/level0/bli_copys_mxn.h
@@ -204,8 +204,11 @@ BLIS_INLINE void bli_ddcopys_mxn( const dim_t m, const dim_t n, double*   restri
        {
                for ( dim_t jj = 0; jj < n; ++jj )
                for ( dim_t ii = 0; ii < m; ++ii )
+               {
                bli_ddcopys( *(x + ii*rs_x + jj*cs_x),
                             *(y + ii*rs_y + jj*cs_y) );
+               printf("%lld %lld %lld %lld %lld %lld %p %p %f %f\n", ii, jj, rs_x, cs_x, rs_y, cs_y, x, y, *(x + ii*rs_x + jj*cs_x), *(y + ii*rs_y + jj*cs_y));
+               }
        }
 }
 BLIS_INLINE void bli_cdcopys_mxn( const dim_t m, const dim_t n, scomplex* restrict x, const inc_t rs_x, const inc_t cs_x,

And to be clear, you are configuring blis with either the intel64 or x86_64 configuration, and not skx, right?

Should be the default configuration (whatever that may be), the build invocation is here.

@devinamatthews
Copy link
Member

devinamatthews commented Jul 6, 2021

FYI if you're up for generating assembly the steps would be:

  1. Build BLIS as usual with make V=1.
  2. Save the complete compile lines for frame/3/gemm/bli_gemm_ker_var2.c and kernels/skx/3/bli_dgemm_skx_asm_16x14.c.
  3. Replace -c with -S and -o <whatever>.o with -o <whatever>.s insert the new commands after make.
  4. Save the generated .s files (or I guess you could cat them if that's easier).

@devinamatthews
Copy link
Member

This is better than working from the .o files because it preserves any comments generated by clang.

@h-vetinari
Copy link
Contributor Author

FYI if you're up for generating assembly the steps would be:

I'm not sure I followed the instructions correctly (also adapted to conda-forge infra), but here's my attempt: conda-forge/blis-feedstock@625bcce

@devinamatthews
Copy link
Member

@h-vetinari surely there are more compiler flags for the compile lines? Did you run with make V=1 and copy-paste the compiler line it used for those files?

@h-vetinari
Copy link
Contributor Author

Did you run with make V=1 and copy-paste the compiler line it used for those files?

No, but happy to take pointers how these lines should look like (and I've added V=1 now, which had not been necessary so far) I'm doing everything in CI (and battling with differences in GCC vs. MSVC options, much less the conda-forge setup), so it's generally quite a painful process.

@devinamatthews
Copy link
Member

After adding V=1 you'll see in the build log the full compiler commands used by make. Just copy-paste the two lines for the files named above and modify as indicated. This way you get the exact same flags as used by BLIS in case any of them affect code generation.

@h-vetinari
Copy link
Contributor Author

Thanks, that's also what I wanted to do, see outcome

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Jul 7, 2021

Here's the output of a failing run (raw log)

Hope this helps. 🙃

Contents of frame/3/gemm/bli_gemm_ker_var2.c
/*

   BLIS
   An object-based framework for developing high-performance BLAS-like
   libraries.

   Copyright (C) 2014, The University of Texas at Austin
   Copyright (C) 2018 - 2019, Advanced Micro Devices, Inc.

   Redistribution and use in source and binary forms, with or without
   modification, are permitted provided that the following conditions are
   met:
    - Redistributions of source code must retain the above copyright
      notice, this list of conditions and the following disclaimer.
    - Redistributions in binary form must reproduce the above copyright
      notice, this list of conditions and the following disclaimer in the
      documentation and/or other materials provided with the distribution.
    - Neither the name(s) of the copyright holder(s) nor the names of its
      contributors may be used to endorse or promote products derived
      from this software without specific prior written permission.

   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
   "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
   HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
   SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
   LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
   DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
   THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

*/

#include "blis.h"

#define FUNCPTR_T gemm_fp

typedef void (*FUNCPTR_T)
     (
       pack_t  schema_a,
       pack_t  schema_b,
       dim_t   m,
       dim_t   n,
       dim_t   k,
       void*   alpha,
       void*   a, inc_t cs_a, inc_t is_a,
                  dim_t pd_a, inc_t ps_a,
       void*   b, inc_t rs_b, inc_t is_b,
                  dim_t pd_b, inc_t ps_b,
       void*   beta,
       void*   c, inc_t rs_c, inc_t cs_c,
       cntx_t* cntx,
       rntm_t* rntm,
       thrinfo_t* thread
     );

static FUNCPTR_T GENARRAY(ftypes,gemm_ker_var2);


void bli_gemm_ker_var2
     (
       obj_t*  a,
       obj_t*  b,
       obj_t*  c,
       cntx_t* cntx,
       rntm_t* rntm,
       cntl_t* cntl,
       thrinfo_t* thread
     )
{
#ifdef BLIS_ENABLE_GEMM_MD
	// By now, A and B have been packed and cast to the execution precision.
	// In most cases, such as when storage precision of C differs from the
	// execution precision, we utilize the mixed datatype code path. However,
	// a few cases still fall within this kernel, such as mixed domain with
	// equal precision (ccr, crc, rcc), hence those expressions being disabled
	// in the conditional below.
	if ( //( bli_obj_domain( c ) != bli_obj_domain( a ) ) ||
	     //( bli_obj_domain( c ) != bli_obj_domain( b ) ) ||
	     ( bli_obj_dt( c ) != bli_obj_exec_dt( c ) ) )
	{
		bli_gemm_ker_var2_md( a, b, c, cntx, rntm, cntl, thread );
		return;
	}
#endif

	num_t     dt_exec   = bli_obj_exec_dt( c );

	pack_t    schema_a  = bli_obj_pack_schema( a );
	pack_t    schema_b  = bli_obj_pack_schema( b );

	dim_t     m         = bli_obj_length( c );
	dim_t     n         = bli_obj_width( c );
	dim_t     k         = bli_obj_width( a );

	void*     buf_a     = bli_obj_buffer_at_off( a );
	inc_t     cs_a      = bli_obj_col_stride( a );
	inc_t     is_a      = bli_obj_imag_stride( a );
	dim_t     pd_a      = bli_obj_panel_dim( a );
	inc_t     ps_a      = bli_obj_panel_stride( a );

	void*     buf_b     = bli_obj_buffer_at_off( b );
	inc_t     rs_b      = bli_obj_row_stride( b );
	inc_t     is_b      = bli_obj_imag_stride( b );
	dim_t     pd_b      = bli_obj_panel_dim( b );
	inc_t     ps_b      = bli_obj_panel_stride( b );

	void*     buf_c     = bli_obj_buffer_at_off( c );
	inc_t     rs_c      = bli_obj_row_stride( c );
	inc_t     cs_c      = bli_obj_col_stride( c );

	obj_t     scalar_a;
	obj_t     scalar_b;

	void*     buf_alpha;
	void*     buf_beta;

	FUNCPTR_T f;

	// Detach and multiply the scalars attached to A and B.
	bli_obj_scalar_detach( a, &scalar_a );
	bli_obj_scalar_detach( b, &scalar_b );
	bli_mulsc( &scalar_a, &scalar_b );

	// Grab the addresses of the internal scalar buffers for the scalar
	// merged above and the scalar attached to C.
	buf_alpha = bli_obj_internal_scalar_buffer( &scalar_b );
	buf_beta  = bli_obj_internal_scalar_buffer( c );

	// If 1m is being employed on a column- or row-stored matrix with a
	// real-valued beta, we can use the real domain macro-kernel, which
	// eliminates a little overhead associated with the 1m virtual
	// micro-kernel.
#if 1
	if ( bli_cntx_method( cntx ) == BLIS_1M )
	{
		bli_gemm_ind_recast_1m_params
		(
		  &dt_exec,
		  schema_a,
		  c,
		  &m, &n, &k,
		  &pd_a, &ps_a,
		  &pd_b, &ps_b,
		  &rs_c, &cs_c
		);
	}
#endif

#ifdef BLIS_ENABLE_GEMM_MD
	// Tweak parameters in select mixed domain cases (rcc, crc, ccr).
	bli_gemm_md_ker_var2_recast
	(
	  &dt_exec,
	  bli_obj_dt( a ),
	  bli_obj_dt( b ),
	  bli_obj_dt( c ),
	  &m, &n, &k,
	  &pd_a, &ps_a,
	  &pd_b, &ps_b,
	  c,
	  &rs_c, &cs_c
	);
#endif

	// Index into the type combination array to extract the correct
	// function pointer.
	f = ftypes[dt_exec];

	// Invoke the function.
	f( schema_a,
	   schema_b,
	   m,
	   n,
	   k,
	   buf_alpha,
	   buf_a, cs_a, is_a,
	          pd_a, ps_a,
	   buf_b, rs_b, is_b,
	          pd_b, ps_b,
	   buf_beta,
	   buf_c, rs_c, cs_c,
	   cntx,
	   rntm,
	   thread );
}


#undef  GENTFUNC
#define GENTFUNC( ctype, ch, varname ) \
\
void PASTEMAC(ch,varname) \
     ( \
       pack_t  schema_a, \
       pack_t  schema_b, \
       dim_t   m, \
       dim_t   n, \
       dim_t   k, \
       void*   alpha, \
       void*   a, inc_t cs_a, inc_t is_a, \
                  dim_t pd_a, inc_t ps_a, \
       void*   b, inc_t rs_b, inc_t is_b, \
                  dim_t pd_b, inc_t ps_b, \
       void*   beta, \
       void*   c, inc_t rs_c, inc_t cs_c, \
       cntx_t* cntx, \
       rntm_t* rntm, \
       thrinfo_t* thread  \
     ) \
{ \
	const num_t     dt         = PASTEMAC(ch,type); \
\
	/* Alias some constants to simpler names. */ \
	const dim_t     MR         = pd_a; \
	const dim_t     NR         = pd_b; \
	/*const dim_t     PACKMR     = cs_a;*/ \
	/*const dim_t     PACKNR     = rs_b;*/ \
\
	/* Query the context for the micro-kernel address and cast it to its
	   function pointer type. */ \
	PASTECH(ch,gemm_ukr_ft) \
	                gemm_ukr   = bli_cntx_get_l3_vir_ukr_dt( dt, BLIS_GEMM_UKR, cntx ); \
\
	/* Temporary C buffer for edge cases. Note that the strides of this
	   temporary buffer are set so that they match the storage of the
	   original C matrix. For example, if C is column-stored, ct will be
	   column-stored as well. */ \
	ctype           ct[ BLIS_STACK_BUF_MAX_SIZE \
	                    / sizeof( ctype ) ] \
	                    __attribute__((aligned(BLIS_STACK_BUF_ALIGN_SIZE))); \
	const bool      col_pref    = bli_cntx_l3_vir_ukr_prefers_cols_dt( dt, BLIS_GEMM_UKR, cntx ); \
	const inc_t     rs_ct       = ( col_pref ? 1 : NR ); \
	const inc_t     cs_ct       = ( col_pref ? MR : 1 ); \
\
	ctype* restrict zero       = PASTEMAC(ch,0); \
	ctype* restrict a_cast     = a; \
	ctype* restrict b_cast     = b; \
	ctype* restrict c_cast     = c; \
	ctype* restrict alpha_cast = alpha; \
	ctype* restrict beta_cast  = beta; \
	ctype* restrict b1; \
	ctype* restrict c1; \
\
	dim_t           m_iter, m_left; \
	dim_t           n_iter, n_left; \
	dim_t           i, j; \
	dim_t           m_cur; \
	dim_t           n_cur; \
	inc_t           rstep_a; \
	inc_t           cstep_b; \
	inc_t           rstep_c, cstep_c; \
	auxinfo_t       aux; \
\
	/*
	   Assumptions/assertions:
	     rs_a == 1
	     cs_a == PACKMR
	     pd_a == MR
	     ps_a == stride to next micro-panel of A
	     rs_b == PACKNR
	     cs_b == 1
	     pd_b == NR
	     ps_b == stride to next micro-panel of B
	     rs_c == (no assumptions)
	     cs_c == (no assumptions)
	*/ \
\
	/* If any dimension is zero, return immediately. */ \
	if ( bli_zero_dim3( m, n, k ) ) return; \
\
	/* Clear the temporary C buffer in case it has any infs or NaNs. */ \
	PASTEMAC(ch,set0s_mxn)( MR, NR, \
	                        ct, rs_ct, cs_ct ); \
\
	/* Compute number of primary and leftover components of the m and n
	   dimensions. */ \
	n_iter = n / NR; \
	n_left = n % NR; \
\
	m_iter = m / MR; \
	m_left = m % MR; \
\
	if ( n_left ) ++n_iter; \
	if ( m_left ) ++m_iter; \
\
	/* Determine some increments used to step through A, B, and C. */ \
	rstep_a = ps_a; \
\
	cstep_b = ps_b; \
\
	rstep_c = rs_c * MR; \
	cstep_c = cs_c * NR; \
\
	/* Save the pack schemas of A and B to the auxinfo_t object. */ \
	bli_auxinfo_set_schema_a( schema_a, &aux ); \
	bli_auxinfo_set_schema_b( schema_b, &aux ); \
\
	/* Save the imaginary stride of A and B to the auxinfo_t object. */ \
	bli_auxinfo_set_is_a( is_a, &aux ); \
	bli_auxinfo_set_is_b( is_b, &aux ); \
\
	/* The 'thread' argument points to the thrinfo_t node for the 2nd (jr)
	   loop around the microkernel. Here we query the thrinfo_t node for the
	   1st (ir) loop around the microkernel. */ \
	thrinfo_t* caucus = bli_thrinfo_sub_node( thread ); \
\
	/* Query the number of threads and thread ids for each loop. */ \
	dim_t jr_nt  = bli_thread_n_way( thread ); \
	dim_t jr_tid = bli_thread_work_id( thread ); \
	dim_t ir_nt  = bli_thread_n_way( caucus ); \
	dim_t ir_tid = bli_thread_work_id( caucus ); \
\
	dim_t jr_start, jr_end; \
	dim_t ir_start, ir_end; \
	dim_t jr_inc,   ir_inc; \
\
	/* Determine the thread range and increment for the 2nd and 1st loops.
	   NOTE: The definition of bli_thread_range_jrir() will depend on whether
	   slab or round-robin partitioning was requested at configure-time. */ \
	bli_thread_range_jrir( thread, n_iter, 1, FALSE, &jr_start, &jr_end, &jr_inc ); \
	bli_thread_range_jrir( caucus, m_iter, 1, FALSE, &ir_start, &ir_end, &ir_inc ); \
\
	/* Loop over the n dimension (NR columns at a time). */ \
	for ( j = jr_start; j < jr_end; j += jr_inc ) \
	{ \
		ctype* restrict a1; \
		ctype* restrict c11; \
		ctype* restrict b2; \
\
		b1 = b_cast + j * cstep_b; \
		c1 = c_cast + j * cstep_c; \
\
		n_cur = ( bli_is_not_edge_f( j, n_iter, n_left ) ? NR : n_left ); \
\
		/* Initialize our next panel of B to be the current panel of B. */ \
		b2 = b1; \
\
		/* Loop over the m dimension (MR rows at a time). */ \
		for ( i = ir_start; i < ir_end; i += ir_inc ) \
		{ \
			ctype* restrict a2; \
\
			a1  = a_cast + i * rstep_a; \
			c11 = c1     + i * rstep_c; \
\
			m_cur = ( bli_is_not_edge_f( i, m_iter, m_left ) ? MR : m_left ); \
\
			/* Compute the addresses of the next panels of A and B. */ \
			a2 = bli_gemm_get_next_a_upanel( a1, rstep_a, ir_inc ); \
			if ( bli_is_last_iter( i, ir_end, ir_tid, ir_nt ) ) \
			{ \
				a2 = a_cast; \
				b2 = bli_gemm_get_next_b_upanel( b1, cstep_b, jr_inc ); \
				if ( bli_is_last_iter( j, jr_end, jr_tid, jr_nt ) ) \
					b2 = b_cast; \
			} \
\
			/* Save addresses of next panels of A and B to the auxinfo_t
			   object. */ \
			bli_auxinfo_set_next_a( a2, &aux ); \
			bli_auxinfo_set_next_b( b2, &aux ); \
\
			/* Handle interior and edge cases separately. */ \
			if ( m_cur == MR && n_cur == NR ) \
			{ \
				/* Invoke the gemm micro-kernel. */ \
				gemm_ukr \
				( \
				  k, \
				  alpha_cast, \
				  a1, \
				  b1, \
				  beta_cast, \
				  c11, rs_c, cs_c, \
				  &aux, \
				  cntx  \
				); \
			} \
			else \
			{ \
				/* Invoke the gemm micro-kernel. */ \
				gemm_ukr \
				( \
				  k, \
				  alpha_cast, \
				  a1, \
				  b1, \
				  zero, \
				  ct, rs_ct, cs_ct, \
				  &aux, \
				  cntx  \
				); \
\
				/* Scale the bottom edge of C and add the result from above. */ \
				PASTEMAC(ch,xpbys_mxn)( m_cur, n_cur, \
				                        ct,  rs_ct, cs_ct, \
				                        beta_cast, \
				                        c11, rs_c,  cs_c ); \
			} \
		} \
	} \
\
/*
PASTEMAC(ch,fprintm)( stdout, "gemm_ker_var2: b1", k, NR, b1, NR, 1, "%4.1f", "" ); \
PASTEMAC(ch,fprintm)( stdout, "gemm_ker_var2: a1", MR, k, a1, 1, MR, "%4.1f", "" ); \
PASTEMAC(ch,fprintm)( stdout, "gemm_ker_var2: c after", m_cur, n_cur, c11, rs_c, cs_c, "%4.1f", "" ); \
*/ \
}

INSERT_GENTFUNC_BASIC0( gemm_ker_var2 )
Contents of kernels/skx/3/bli_dgemm_skx_asm_16x14.c
/*

   BLIS
   An object-based framework for developing high-performance BLAS-like
   libraries.

   Copyright (C) 2014, The University of Texas at Austin

   Redistribution and use in source and binary forms, with or without
   modification, are permitted provided that the following conditions are
   met:
    - Redistributions of source code must retain the above copyright
      notice, this list of conditions and the following disclaimer.
    - Redistributions in binary form must reproduce the above copyright
      notice, this list of conditions and the following disclaimer in the
      documentation and/or other materials provided with the distribution.
    - Neither the name(s) of the copyright holder(s) nor the names of its
      contributors may be used to endorse or promote products derived
      from this software without specific prior written permission.

   THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
   AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
   LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
   A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE UNIVERSITY
   OF TEXAS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL,
   EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO,
   PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR
   PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY
   OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
   (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
   OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

*/

#include "blis.h"
#include "bli_x86_asm_macros.h"

#define A_L1_PREFETCH_DIST 4 // in units of k iterations
#define B_L1_PREFETCH_DIST 4 // e.g. 4 k iterations ~= 56 cycles
#define TAIL_NITER 5 // in units of 4x unrolled k iterations
                     // e.g. 5 -> 4*5 k iterations ~= 280 cycles

#define PREFETCH_A_L1(n, k) \
    PREFETCH(0, MEM(RAX, A_L1_PREFETCH_DIST*16*8 + (2*n+k)*64))
#define PREFETCH_B_L1(n, k) \
    PREFETCH(0, MEM(RBX, B_L1_PREFETCH_DIST*14*8 + (2*n+k)*56))

#define LOOP_ALIGN ALIGN32

#define UPDATE_C(R1,R2) \
\
    VMULPD(ZMM(R1), ZMM(R1), ZMM(0)) \
    VMULPD(ZMM(R2), ZMM(R2), ZMM(0)) \
    VFMADD231PD(ZMM(R1), ZMM(1), MEM(RCX)) \
    VFMADD231PD(ZMM(R2), ZMM(1), MEM(RCX,64)) \
    VMOVUPD(MEM(RCX), ZMM(R1)) \
    VMOVUPD(MEM(RCX,64), ZMM(R2)) \
    LEA(RCX, MEM(RCX,RBX,1))

#define UPDATE_C_BZ(R1,R2) \
\
    VMULPD(ZMM(R1), ZMM(R1), ZMM(0)) \
    VMULPD(ZMM(R2), ZMM(R2), ZMM(0)) \
    VMOVUPD(MEM(RCX), ZMM(R1)) \
    VMOVUPD(MEM(RCX,64), ZMM(R2)) \
    LEA(RCX, MEM(RCX,RBX,1))

#define UPDATE_C_COL_SCATTERED(R1,R2) \
\
    KXNORW(K(1), K(0), K(0)) \
    KXNORW(K(2), K(0), K(0)) \
    KXNORW(K(3), K(0), K(0)) \
    KXNORW(K(4), K(0), K(0)) \
    VGATHERQPD(ZMM(0) MASK_K(1), MEM(RCX,ZMM(2),1)) \
    VFMADD231PD(ZMM(R1), ZMM(0), ZMM(1)) \
    VGATHERQPD(ZMM(0) MASK_K(2), MEM(RCX,ZMM(3),1)) \
    VFMADD231PD(ZMM(R2), ZMM(0), ZMM(1)) \
    VSCATTERQPD(MEM(RCX,ZMM(2),1) MASK_K(3), ZMM(R1)) \
    VSCATTERQPD(MEM(RCX,ZMM(3),1) MASK_K(4), ZMM(R2)) \
    LEA(RCX, MEM(RCX,RBX,1))

#define UPDATE_C_BZ_COL_SCATTERED(R1,R2) \
\
    KXNORW(K(1), K(0), K(0)) \
    KXNORW(K(2), K(0), K(0)) \
    VSCATTERQPD(MEM(RCX,ZMM(2),1) MASK_K(1), ZMM(R1)) \
    VSCATTERQPD(MEM(RCX,ZMM(3),1) MASK_K(2), ZMM(R2)) \
    LEA(RCX, MEM(RCX,RBX,1))

#define SUBITER(n) \
\
    PREFETCH_A_L1(n, 0) \
    \
    VBROADCASTSD(ZMM(2), MEM(RBX,(14*n+ 0)*8)) \
    VBROADCASTSD(ZMM(3), MEM(RBX,(14*n+ 1)*8)) \
    VFMADD231PD(ZMM( 4), ZMM(0), ZMM(2)) \
    VFMADD231PD(ZMM( 5), ZMM(1), ZMM(2)) \
    VFMADD231PD(ZMM( 6), ZMM(0), ZMM(3)) \
    VFMADD231PD(ZMM( 7), ZMM(1), ZMM(3)) \
    \
    VBROADCASTSD(ZMM(2), MEM(RBX,(14*n+ 2)*8)) \
    VBROADCASTSD(ZMM(3), MEM(RBX,(14*n+ 3)*8)) \
    VFMADD231PD(ZMM( 8), ZMM(0), ZMM(2)) \
    VFMADD231PD(ZMM( 9), ZMM(1), ZMM(2)) \
    VFMADD231PD(ZMM(10), ZMM(0), ZMM(3)) \
    VFMADD231PD(ZMM(11), ZMM(1), ZMM(3)) \
    \
    PREFETCH_B_L1(n, 0) \
    \
    VBROADCASTSD(ZMM(2), MEM(RBX,(14*n+ 4)*8)) \
    VBROADCASTSD(ZMM(3), MEM(RBX,(14*n+ 5)*8)) \
    VFMADD231PD(ZMM(12), ZMM(0), ZMM(2)) \
    VFMADD231PD(ZMM(13), ZMM(1), ZMM(2)) \
    VFMADD231PD(ZMM(14), ZMM(0), ZMM(3)) \
    VFMADD231PD(ZMM(15), ZMM(1), ZMM(3)) \
    \
    VBROADCASTSD(ZMM(2), MEM(RBX,(14*n+ 6)*8)) \
    VBROADCASTSD(ZMM(3), MEM(RBX,(14*n+ 7)*8)) \
    VFMADD231PD(ZMM(16), ZMM(0), ZMM(2)) \
    VFMADD231PD(ZMM(17), ZMM(1), ZMM(2)) \
    VFMADD231PD(ZMM(18), ZMM(0), ZMM(3)) \
    VFMADD231PD(ZMM(19), ZMM(1), ZMM(3)) \
    \
    PREFETCH_A_L1(n, 1) \
    \
    VBROADCASTSD(ZMM(2), MEM(RBX,(14*n+ 8)*8)) \
    VBROADCASTSD(ZMM(3), MEM(RBX,(14*n+ 9)*8)) \
    VFMADD231PD(ZMM(20), ZMM(0), ZMM(2)) \
    VFMADD231PD(ZMM(21), ZMM(1), ZMM(2)) \
    VFMADD231PD(ZMM(22), ZMM(0), ZMM(3)) \
    VFMADD231PD(ZMM(23), ZMM(1), ZMM(3)) \
    \
    VBROADCASTSD(ZMM(2), MEM(RBX,(14*n+10)*8)) \
    VBROADCASTSD(ZMM(3), MEM(RBX,(14*n+11)*8)) \
    VFMADD231PD(ZMM(24), ZMM(0), ZMM(2)) \
    VFMADD231PD(ZMM(25), ZMM(1), ZMM(2)) \
    VFMADD231PD(ZMM(26), ZMM(0), ZMM(3)) \
    VFMADD231PD(ZMM(27), ZMM(1), ZMM(3)) \
    \
    PREFETCH_B_L1(n, 1) \
    \
    VBROADCASTSD(ZMM(2), MEM(RBX,(14*n+12)*8)) \
    VBROADCASTSD(ZMM(3), MEM(RBX,(14*n+13)*8)) \
    VFMADD231PD(ZMM(28), ZMM(0), ZMM(2)) \
    VFMADD231PD(ZMM(29), ZMM(1), ZMM(2)) \
    VFMADD231PD(ZMM(30), ZMM(0), ZMM(3)) \
    VFMADD231PD(ZMM(31), ZMM(1), ZMM(3)) \
    \
    VMOVAPD(ZMM(0), MEM(RAX,(16*n+0)*8)) \
    VMOVAPD(ZMM(1), MEM(RAX,(16*n+8)*8))

//This is an array used for the scatter/gather instructions.
static int64_t offsets[16] __attribute__((aligned(64))) =
    { 0, 1, 2, 3, 4, 5, 6, 7, 8, 9,10,11,12,13,14,15};

void bli_dgemm_skx_asm_16x14(
                              dim_t            k_,
                              double* restrict alpha,
                              double* restrict a,
                              double* restrict b,
                              double* restrict beta,
                              double* restrict c, inc_t rs_c_, inc_t cs_c_,
                              auxinfo_t*       data,
                              cntx_t* restrict cntx
                            )
{
    (void)data;
    (void)cntx;

    const int64_t* offsetPtr = &offsets[0];
    const int64_t k = k_;
    const int64_t rs_c = rs_c_*8;
    const int64_t cs_c = cs_c_*8;

    BEGIN_ASM()

    VXORPD(YMM( 4), YMM( 4), YMM( 4)) //clear out registers
    VXORPD(YMM( 5), YMM( 5), YMM( 5))
    VXORPD(YMM( 6), YMM( 6), YMM( 6))
    VXORPD(YMM( 7), YMM( 7), YMM( 7))
    VXORPD(YMM( 8), YMM( 8), YMM( 8))
    VXORPD(YMM( 9), YMM( 9), YMM( 9))
    VXORPD(YMM(10), YMM(10), YMM(10))
    VXORPD(YMM(11), YMM(11), YMM(11))
    VXORPD(YMM(12), YMM(12), YMM(12))
    VXORPD(YMM(13), YMM(13), YMM(13))
    VXORPD(YMM(14), YMM(14), YMM(14))
    VXORPD(YMM(15), YMM(15), YMM(15))
    VXORPD(YMM(16), YMM(16), YMM(16))
    VXORPD(YMM(17), YMM(17), YMM(17))
    VXORPD(YMM(18), YMM(18), YMM(18))
    VXORPD(YMM(19), YMM(19), YMM(19))
    VXORPD(YMM(20), YMM(20), YMM(20))
    VXORPD(YMM(21), YMM(21), YMM(21))
    VXORPD(YMM(22), YMM(22), YMM(22))
    VXORPD(YMM(23), YMM(23), YMM(23))
    VXORPD(YMM(24), YMM(24), YMM(24))
    VXORPD(YMM(25), YMM(25), YMM(25))
    VXORPD(YMM(26), YMM(26), YMM(26))
    VXORPD(YMM(27), YMM(27), YMM(27))
    VXORPD(YMM(28), YMM(28), YMM(28))
    VXORPD(YMM(29), YMM(29), YMM(29))
    VXORPD(YMM(30), YMM(30), YMM(30))
    VXORPD(YMM(31), YMM(31), YMM(31))

    MOV(RSI, VAR(k)) //loop index
    MOV(RAX, VAR(a)) //load address of a
    MOV(RBX, VAR(b)) //load address of b
    MOV(RCX, VAR(c)) //load address of c

    LEA(RDX, MEM(RSI,RSI,2))
    LEA(RDX, MEM(,RDX,4))
    LEA(RDX, MEM(RDX,RSI,2)) // 14*k
    LEA(RDX, MEM(RBX,RDX,8,-128)) // b_next
    LEA(R9, MEM(RCX,63)) // c for prefetching

    VMOVAPD(ZMM(0), MEM(RAX, 0*8)) //pre-load a
    VMOVAPD(ZMM(1), MEM(RAX, 8*8)) //pre-load a
    LEA(RAX, MEM(RAX,16*8)) //adjust a for pre-load

    MOV(R12, VAR(rs_c))
    MOV(R10, VAR(cs_c))

    MOV(RDI, RSI)
    AND(RSI, IMM(3))
    SAR(RDI, IMM(2))

    SUB(RDI, IMM(14+TAIL_NITER))
    JLE(K_LE_80)

        LOOP_ALIGN
        LABEL(LOOP1)

            SUBITER(0)
            PREFETCH(1, MEM(RDX))
            SUBITER(1)
            SUB(RDI, IMM(1))
            SUBITER(2)
            PREFETCH(1, MEM(RDX,64))
            SUBITER(3)

            LEA(RAX, MEM(RAX,4*16*8))
            LEA(RBX, MEM(RBX,4*14*8))
            LEA(RDX, MEM(RDX,16*8))

        JNZ(LOOP1)

    LABEL(K_LE_80)

    ADD(RDI, IMM(14))
    JLE(K_LE_24)

        LOOP_ALIGN
        LABEL(LOOP2)

            PREFETCH(0, MEM(R9))
            SUBITER(0)
            PREFETCH(1, MEM(RDX))
            SUBITER(1)
            PREFETCH(0, MEM(R9,64))
            SUB(RDI, IMM(1))
            SUBITER(2)
            PREFETCH(1, MEM(RDX,64))
            SUBITER(3)

            LEA(RAX, MEM(RAX,4*16*8))
            LEA(RBX, MEM(RBX,4*14*8))
            LEA(RDX, MEM(RDX,16*8))
            LEA(R9, MEM(R9,R10,1))

        JNZ(LOOP2)

    LABEL(K_LE_24)

    ADD(RDI, IMM(0+TAIL_NITER))
    JLE(TAIL)

        LOOP_ALIGN
        LABEL(LOOP3)

            SUBITER(0)
            PREFETCH(1, MEM(RDX))
            SUBITER(1)
            SUB(RDI, IMM(1))
            SUBITER(2)
            PREFETCH(1, MEM(RDX,64))
            SUBITER(3)

            LEA(RAX, MEM(RAX,4*16*8))
            LEA(RBX, MEM(RBX,4*14*8))
            LEA(RDX, MEM(RDX,16*8))

        JNZ(LOOP3)

    LABEL(TAIL)

    TEST(RSI, RSI)
    JZ(POSTACCUM)

        LOOP_ALIGN
        LABEL(TAIL_LOOP)

            SUB(RSI, IMM(1))
            SUBITER(0)

            LEA(RAX, MEM(RAX,16*8))
            LEA(RBX, MEM(RBX,14*8))

        JNZ(TAIL_LOOP)

    LABEL(POSTACCUM)

    MOV(RAX, VAR(alpha))
    MOV(RBX, VAR(beta))
    VBROADCASTSD(ZMM(0), MEM(RAX))
    VBROADCASTSD(ZMM(1), MEM(RBX))

    VXORPD(YMM(2), YMM(2), YMM(2))

    MOV(RAX, R12)
    MOV(RBX, R10)

    // Check if C is column stride.
    CMP(RAX, IMM(8))
    JNE(SCATTEREDUPDATE)

        VCOMISD(XMM(1), XMM(2))
        JE(COLSTORBZ)

            UPDATE_C( 4, 5)
            UPDATE_C( 6, 7)
            UPDATE_C( 8, 9)
            UPDATE_C(10,11)
            UPDATE_C(12,13)
            UPDATE_C(14,15)
            UPDATE_C(16,17)
            UPDATE_C(18,19)
            UPDATE_C(20,21)
            UPDATE_C(22,23)
            UPDATE_C(24,25)
            UPDATE_C(26,27)
            UPDATE_C(28,29)
            UPDATE_C(30,31)

        JMP(END)
        LABEL(COLSTORBZ)

            UPDATE_C_BZ( 4, 5)
            UPDATE_C_BZ( 6, 7)
            UPDATE_C_BZ( 8, 9)
            UPDATE_C_BZ(10,11)
            UPDATE_C_BZ(12,13)
            UPDATE_C_BZ(14,15)
            UPDATE_C_BZ(16,17)
            UPDATE_C_BZ(18,19)
            UPDATE_C_BZ(20,21)
            UPDATE_C_BZ(22,23)
            UPDATE_C_BZ(24,25)
            UPDATE_C_BZ(26,27)
            UPDATE_C_BZ(28,29)
            UPDATE_C_BZ(30,31)

    JMP(END)
    LABEL(SCATTEREDUPDATE)

        VMULPD(ZMM( 4), ZMM( 4), ZMM(0))
        VMULPD(ZMM( 5), ZMM( 5), ZMM(0))
        VMULPD(ZMM( 6), ZMM( 6), ZMM(0))
        VMULPD(ZMM( 7), ZMM( 7), ZMM(0))
        VMULPD(ZMM( 8), ZMM( 8), ZMM(0))
        VMULPD(ZMM( 9), ZMM( 9), ZMM(0))
        VMULPD(ZMM(10), ZMM(10), ZMM(0))
        VMULPD(ZMM(11), ZMM(11), ZMM(0))
        VMULPD(ZMM(12), ZMM(12), ZMM(0))
        VMULPD(ZMM(13), ZMM(13), ZMM(0))
        VMULPD(ZMM(14), ZMM(14), ZMM(0))
        VMULPD(ZMM(15), ZMM(15), ZMM(0))
        VMULPD(ZMM(16), ZMM(16), ZMM(0))
        VMULPD(ZMM(17), ZMM(17), ZMM(0))
        VMULPD(ZMM(18), ZMM(18), ZMM(0))
        VMULPD(ZMM(19), ZMM(19), ZMM(0))
        VMULPD(ZMM(20), ZMM(20), ZMM(0))
        VMULPD(ZMM(21), ZMM(21), ZMM(0))
        VMULPD(ZMM(22), ZMM(22), ZMM(0))
        VMULPD(ZMM(23), ZMM(23), ZMM(0))
        VMULPD(ZMM(24), ZMM(24), ZMM(0))
        VMULPD(ZMM(25), ZMM(25), ZMM(0))
        VMULPD(ZMM(26), ZMM(26), ZMM(0))
        VMULPD(ZMM(27), ZMM(27), ZMM(0))
        VMULPD(ZMM(28), ZMM(28), ZMM(0))
        VMULPD(ZMM(29), ZMM(29), ZMM(0))
        VMULPD(ZMM(30), ZMM(30), ZMM(0))
        VMULPD(ZMM(31), ZMM(31), ZMM(0))

        VCOMISD(XMM(1), XMM(2))

        MOV(RDI, VAR(offsetPtr))
        VPBROADCASTQ(ZMM(0), RAX)
        VPMULLQ(ZMM(2), ZMM(0), MEM(RDI))
        VPMULLQ(ZMM(3), ZMM(0), MEM(RDI,64))

        JE(SCATTERBZ)

            UPDATE_C_COL_SCATTERED( 4, 5)
            UPDATE_C_COL_SCATTERED( 6, 7)
            UPDATE_C_COL_SCATTERED( 8, 9)
            UPDATE_C_COL_SCATTERED(10,11)
            UPDATE_C_COL_SCATTERED(12,13)
            UPDATE_C_COL_SCATTERED(14,15)
            UPDATE_C_COL_SCATTERED(16,17)
            UPDATE_C_COL_SCATTERED(18,19)
            UPDATE_C_COL_SCATTERED(20,21)
            UPDATE_C_COL_SCATTERED(22,23)
            UPDATE_C_COL_SCATTERED(24,25)
            UPDATE_C_COL_SCATTERED(26,27)
            UPDATE_C_COL_SCATTERED(28,29)
            UPDATE_C_COL_SCATTERED(30,31)

        JMP(END)
        LABEL(SCATTERBZ)

            UPDATE_C_BZ_COL_SCATTERED( 4, 5)
            UPDATE_C_BZ_COL_SCATTERED( 6, 7)
            UPDATE_C_BZ_COL_SCATTERED( 8, 9)
            UPDATE_C_BZ_COL_SCATTERED(10,11)
            UPDATE_C_BZ_COL_SCATTERED(12,13)
            UPDATE_C_BZ_COL_SCATTERED(14,15)
            UPDATE_C_BZ_COL_SCATTERED(16,17)
            UPDATE_C_BZ_COL_SCATTERED(18,19)
            UPDATE_C_BZ_COL_SCATTERED(20,21)
            UPDATE_C_BZ_COL_SCATTERED(22,23)
            UPDATE_C_BZ_COL_SCATTERED(24,25)
            UPDATE_C_BZ_COL_SCATTERED(26,27)
            UPDATE_C_BZ_COL_SCATTERED(28,29)
            UPDATE_C_BZ_COL_SCATTERED(30,31)

    LABEL(END)

    VZEROUPPER()

    END_ASM
    (
        : // output operands
        : // input operands
          [k]         "m" (k),
          [a]         "m" (a),
          [b]         "m" (b),
          [alpha]     "m" (alpha),
          [beta]      "m" (beta),
          [c]         "m" (c),
          [rs_c]      "m" (rs_c),
          [cs_c]      "m" (cs_c),
          [offsetPtr] "m" (offsetPtr)
        : // register clobber list
          "rax", "rbx", "rcx", "rdx", "rdi", "rsi", "r8", "r9", "r10", "r11", "r12",
          "r13", "r14", "r15", "zmm0", "zmm1", "zmm2", "zmm3", "zmm4", "zmm5",
          "zmm6", "zmm7", "zmm8", "zmm9", "zmm10", "zmm11", "zmm12", "zmm13",
          "zmm14", "zmm15", "zmm16", "zmm17", "zmm18", "zmm19", "zmm20", "zmm21",
          "zmm22", "zmm23", "zmm24", "zmm25", "zmm26", "zmm27", "zmm28", "zmm29",
          "zmm30", "zmm31", "memory"
    )
}
Contents of bli_gemm_ker_var2.s
	.text
	.def	 @feat.00;
	.scl	3;
	.type	0;
	.endef
	.globl	@feat.00
.set @feat.00, 0
	.file	"bli_gemm_ker_var2.c"
	.def	 bli_gemm_ker_var2;
	.scl	2;
	.type	32;
	.endef
	.globl	bli_gemm_ker_var2               # -- Begin function bli_gemm_ker_var2
	.p2align	4, 0x90
bli_gemm_ker_var2:                      # @bli_gemm_ker_var2
.seh_proc bli_gemm_ker_var2
# %bb.0:
	pushq	%r15
	.seh_pushreg %r15
	pushq	%r14
	.seh_pushreg %r14
	pushq	%r13
	.seh_pushreg %r13
	pushq	%r12
	.seh_pushreg %r12
	pushq	%rsi
	.seh_pushreg %rsi
	pushq	%rdi
	.seh_pushreg %rdi
	pushq	%rbp
	.seh_pushreg %rbp
	pushq	%rbx
	.seh_pushreg %rbx
	subq	$824, %rsp                      # imm = 0x338
	.seh_stackalloc 824
	.seh_endprologue
	movq	%r9, %rsi
	movq	%r8, %rbx
	movq	%rdx, %rbp
	movq	%rcx, %rdi
	movl	48(%r8), %r12d
	movl	%r12d, %r13d
	andl	$7, %r13d
	shrl	$13, %r12d
	movl	%r12d, %eax
	andl	$7, %eax
	cmpl	%eax, %r13d
	jne	.LBB0_35
# %bb.1:
	movq	944(%rsp), %rax
	movq	%rax, 328(%rsp)                 # 8-byte Spill
	movq	928(%rsp), %rax
	movq	%rax, 336(%rsp)                 # 8-byte Spill
	movl	48(%rdi), %eax
	movl	%eax, 268(%rsp)                 # 4-byte Spill
	movl	$8323072, %eax                  # imm = 0x7F0000
	andl	48(%rbp), %eax
	movl	%eax, 188(%rsp)                 # 4-byte Spill
	movq	24(%rbx), %rax
	movq	%rax, 224(%rsp)                 # 8-byte Spill
	movq	80(%rdi), %rcx
	movq	16(%rdi), %rax
	movq	72(%rdi), %rdx
	imulq	8(%rdi), %rdx
	movq	%rcx, 232(%rsp)                 # 8-byte Spill
	imulq	%rcx, %rax
	addq	%rax, %rdx
	imulq	56(%rdi), %rdx
	movq	32(%rbx), %rax
	movq	%rax, 216(%rsp)                 # 8-byte Spill
	addq	64(%rdi), %rdx
	movq	%rdx, 248(%rsp)                 # 8-byte Spill
	movq	72(%rbp), %rcx
	movq	80(%rbp), %rax
	imulq	16(%rbp), %rax
	movq	8(%rbp), %rdx
	movq	%rcx, 320(%rsp)                 # 8-byte Spill
	imulq	%rcx, %rdx
	addq	%rax, %rdx
	imulq	56(%rbp), %rdx
	movq	32(%rdi), %rax
	movq	%rax, 192(%rsp)                 # 8-byte Spill
	addq	64(%rbp), %rdx
	movq	%rdx, 312(%rsp)                 # 8-byte Spill
	movq	80(%rbx), %rcx
	movq	16(%rbx), %rax
	movq	%rcx, 240(%rsp)                 # 8-byte Spill
	imulq	%rcx, %rax
	movq	72(%rbx), %rcx
	movq	8(%rbx), %rdx
	movq	%rcx, 256(%rsp)                 # 8-byte Spill
	imulq	%rcx, %rdx
	addq	%rax, %rdx
	imulq	56(%rbx), %rdx
	movq	88(%rdi), %rax
	movq	%rax, 304(%rsp)                 # 8-byte Spill
	addq	64(%rbx), %rdx
	movq	%rdx, 296(%rsp)                 # 8-byte Spill
	movq	136(%rdi), %rax
	movq	%rax, 272(%rsp)                 # 8-byte Spill
	movq	128(%rdi), %rax
	movq	%rax, 208(%rsp)                 # 8-byte Spill
	movq	88(%rbp), %rax
	movq	%rax, 288(%rsp)                 # 8-byte Spill
	movq	136(%rbp), %rax
	movq	%rax, 280(%rsp)                 # 8-byte Spill
	movq	128(%rbp), %rax
	movq	%rax, 200(%rsp)                 # 8-byte Spill
	leaq	664(%rsp), %r14
	movq	%rdi, %rcx
	movq	%r14, %rdx
	callq	bli_obj_scalar_detach
	leaq	504(%rsp), %r15
	movq	%rbp, %rcx
	movq	%r15, %rdx
	callq	bli_obj_scalar_detach
	movq	%r14, %rcx
	movq	%r15, %rdx
	callq	bli_mulsc
	cmpl	$5, 5064(%rsi)
	jne	.LBB0_10
# %bb.2:
	movq	%rsi, %r14
	leaq	344(%rsp), %rsi
	movq	%rbx, %rcx
	movq	%rsi, %rdx
	callq	bli_obj_scalar_detach
	movq	%rsi, %rcx
	callq	bli_obj_imag_is_zero
	testb	%al, %al
	je	.LBB0_3
# %bb.4:
	movq	256(%rsp), %rcx                 # 8-byte Reload
	movq	%rcx, %rax
	negq	%rax
	cmovlq	%rcx, %rax
	movq	240(%rsp), %rdx                 # 8-byte Reload
	movq	%rdx, %rcx
	negq	%rcx
	cmovlq	%rdx, %rcx
	cmpq	$1, %rcx
	je	.LBB0_6
# %bb.5:
	cmpq	$1, %rax
	je	.LBB0_6
.LBB0_3:
	movl	%r13d, %r12d
.LBB0_9:
	movq	%r14, %rsi
	movl	%r12d, %r13d
.LBB0_10:
	movl	268(%rsp), %r12d                # 4-byte Reload
	andl	$8323072, %r12d                 # imm = 0x7F0000
	leaq	600(%rsp), %r14
	leaq	96(%rbx), %r15
	movl	48(%rbx), %edx
	movl	48(%rbp), %ecx
	movl	48(%rdi), %eax
	andl	$5, %edx
	cmpl	$1, %edx
	je	.LBB0_17
# %bb.11:
	testl	%edx, %edx
	movq	192(%rsp), %rbx                 # 8-byte Reload
	jne	.LBB0_12
# %bb.13:
	andl	$5, %ecx
	cmpl	$1, %ecx
	movl	188(%rsp), %edx                 # 4-byte Reload
	movq	224(%rsp), %r8                  # 8-byte Reload
	movq	248(%rsp), %r10                 # 8-byte Reload
	movq	216(%rsp), %r9                  # 8-byte Reload
	movq	240(%rsp), %r11                 # 8-byte Reload
	jne	.LBB0_14
# %bb.15:
	andl	$5, %eax
	cmpl	$1, %eax
	movq	232(%rsp), %rcx                 # 8-byte Reload
	jne	.LBB0_34
# %bb.16:
	addq	%rbx, %rbx
	movq	208(%rsp), %rax                 # 8-byte Reload
	addq	%rax, %rax
	movq	%rax, 208(%rsp)                 # 8-byte Spill
	movq	200(%rsp), %rax                 # 8-byte Reload
	addq	%rax, %rax
	movq	%rax, 200(%rsp)                 # 8-byte Spill
	jmp	.LBB0_34
.LBB0_35:
	movq	%rdi, %rcx
	movq	%rbp, %rdx
	movq	%rbx, %r8
	movq	%rsi, %r9
	addq	$824, %rsp                      # imm = 0x338
	popq	%rbx
	popq	%rbp
	popq	%rdi
	popq	%rsi
	popq	%r12
	popq	%r13
	popq	%r14
	popq	%r15
	jmp	bli_gemm_ker_var2_md            # TAILCALL
.LBB0_17:
	andl	$5, %eax
	andl	$5, %ecx
	cmpl	$1, %ecx
	jne	.LBB0_25
# %bb.18:
	testl	%eax, %eax
	jne	.LBB0_25
# %bb.19:
	movq	%rsi, %rdi
	leaq	344(%rsp), %rsi
	movq	%rbx, %rcx
	movq	%rsi, %rdx
	callq	bli_obj_scalar_detach
	movq	%rsi, %rcx
	callq	bli_obj_imag_is_zero
	movq	240(%rsp), %r11                 # 8-byte Reload
	movq	%r11, %rcx
	negq	%rcx
	cmovlq	%r11, %rcx
	cmpq	$1, %rcx
	jne	.LBB0_23
# %bb.20:
	testb	%al, %al
	je	.LBB0_23
# %bb.21:
	movl	48(%rbx), %eax
	movl	%eax, %ecx
	shrl	$29, %ecx
	xorl	%eax, %ecx
	testb	$2, %cl
	jne	.LBB0_23
# %bb.22:
	andl	$-2, %r13d
	movq	216(%rsp), %r9                  # 8-byte Reload
	addq	%r9, %r9
	movq	280(%rsp), %rax                 # 8-byte Reload
	addq	%rax, %rax
	movq	%rax, 280(%rsp)                 # 8-byte Spill
	movq	200(%rsp), %rax                 # 8-byte Reload
	addq	%rax, %rax
	movq	%rax, 200(%rsp)                 # 8-byte Spill
	movq	256(%rsp), %rax                 # 8-byte Reload
	addq	%rax, %rax
	movq	%rax, 256(%rsp)                 # 8-byte Spill
	movq	%rdi, %rsi
	movl	188(%rsp), %edx                 # 4-byte Reload
	movq	224(%rsp), %r8                  # 8-byte Reload
	movq	232(%rsp), %rcx                 # 8-byte Reload
	movq	248(%rsp), %r10                 # 8-byte Reload
	movq	192(%rsp), %rbx                 # 8-byte Reload
	jmp	.LBB0_34
.LBB0_12:
	movl	188(%rsp), %edx                 # 4-byte Reload
	movq	224(%rsp), %r8                  # 8-byte Reload
	movq	232(%rsp), %rcx                 # 8-byte Reload
	movq	248(%rsp), %r10                 # 8-byte Reload
	movq	216(%rsp), %r9                  # 8-byte Reload
	jmp	.LBB0_33
.LBB0_25:
	testl	%ecx, %ecx
	jne	.LBB0_32
# %bb.26:
	cmpl	$1, %eax
	jne	.LBB0_32
# %bb.27:
	movq	%rsi, %rdi
	leaq	344(%rsp), %rsi
	movq	%rbx, %rcx
	movq	%rsi, %rdx
	callq	bli_obj_scalar_detach
	movq	%rsi, %rcx
	callq	bli_obj_imag_is_zero
	movq	256(%rsp), %rdx                 # 8-byte Reload
	movq	%rdx, %rcx
	negq	%rcx
	cmovlq	%rdx, %rcx
	cmpq	$1, %rcx
	jne	.LBB0_31
# %bb.28:
	testb	%al, %al
	je	.LBB0_31
# %bb.29:
	movl	48(%rbx), %eax
	movl	%eax, %ecx
	shrl	$29, %ecx
	xorl	%eax, %ecx
	testb	$2, %cl
	jne	.LBB0_31
# %bb.30:
	andl	$-2, %r13d
	movq	224(%rsp), %r8                  # 8-byte Reload
	addq	%r8, %r8
	movq	272(%rsp), %rax                 # 8-byte Reload
	addq	%rax, %rax
	movq	%rax, 272(%rsp)                 # 8-byte Spill
	movq	208(%rsp), %rax                 # 8-byte Reload
	addq	%rax, %rax
	movq	%rax, 208(%rsp)                 # 8-byte Spill
	movq	240(%rsp), %r11                 # 8-byte Reload
	addq	%r11, %r11
	movq	%rdi, %rsi
	movl	188(%rsp), %edx                 # 4-byte Reload
	jmp	.LBB0_24
.LBB0_23:
	movq	208(%rsp), %rcx                 # 8-byte Reload
	movq	%rcx, %rax
	shrq	$63, %rax
	addq	%rcx, %rax
	sarq	%rax
	movq	%rax, 208(%rsp)                 # 8-byte Spill
	movq	%rdi, %rsi
	movl	188(%rsp), %edx                 # 4-byte Reload
	movq	224(%rsp), %r8                  # 8-byte Reload
.LBB0_24:
	movq	232(%rsp), %rcx                 # 8-byte Reload
	movq	248(%rsp), %r10                 # 8-byte Reload
	movq	216(%rsp), %r9                  # 8-byte Reload
	movq	192(%rsp), %rbx                 # 8-byte Reload
	jmp	.LBB0_34
.LBB0_6:
	andl	$6, %r12d
	movl	268(%rsp), %eax                 # 4-byte Reload
	andl	$3932160, %eax                  # imm = 0x3C0000
	cmpl	$2097152, %eax                  # imm = 0x200000
	jne	.LBB0_8
# %bb.7:
	movq	224(%rsp), %rax                 # 8-byte Reload
	addq	%rax, %rax
	movq	%rax, 224(%rsp)                 # 8-byte Spill
	movq	192(%rsp), %rax                 # 8-byte Reload
	addq	%rax, %rax
	movq	%rax, 192(%rsp)                 # 8-byte Spill
	movq	272(%rsp), %rax                 # 8-byte Reload
	addq	%rax, %rax
	movq	%rax, 272(%rsp)                 # 8-byte Spill
	movq	208(%rsp), %rax                 # 8-byte Reload
	addq	%rax, %rax
	movq	%rax, 208(%rsp)                 # 8-byte Spill
	movq	200(%rsp), %rax                 # 8-byte Reload
	addq	%rax, %rax
	movq	%rax, 200(%rsp)                 # 8-byte Spill
	addq	%rdx, %rdx
	movq	%rdx, 240(%rsp)                 # 8-byte Spill
	jmp	.LBB0_9
.LBB0_14:
	movq	232(%rsp), %rcx                 # 8-byte Reload
	jmp	.LBB0_34
.LBB0_31:
	movq	200(%rsp), %rcx                 # 8-byte Reload
	movq	%rcx, %rax
	shrq	$63, %rax
	addq	%rcx, %rax
	sarq	%rax
	movq	%rax, 200(%rsp)                 # 8-byte Spill
	movq	%rdi, %rsi
.LBB0_32:
	movl	188(%rsp), %edx                 # 4-byte Reload
	movq	224(%rsp), %r8                  # 8-byte Reload
	movq	232(%rsp), %rcx                 # 8-byte Reload
	movq	248(%rsp), %r10                 # 8-byte Reload
	movq	216(%rsp), %r9                  # 8-byte Reload
	movq	192(%rsp), %rbx                 # 8-byte Reload
.LBB0_33:
	movq	240(%rsp), %r11                 # 8-byte Reload
.LBB0_34:
	movl	%r13d, %eax
	leaq	ftypes(%rip), %rbp
	movq	328(%rsp), %rdi                 # 8-byte Reload
	movq	%rdi, 176(%rsp)
	movq	336(%rsp), %rdi                 # 8-byte Reload
	movq	%rdi, 168(%rsp)
	movq	%rsi, 160(%rsp)
	movq	%r11, 152(%rsp)
	movq	256(%rsp), %rdi                 # 8-byte Reload
	movq	%rdi, 144(%rsp)
	movq	296(%rsp), %rdi                 # 8-byte Reload
	movq	%rdi, 136(%rsp)
	movq	%r15, 128(%rsp)
	movq	200(%rsp), %rdi                 # 8-byte Reload
	movq	%rdi, 120(%rsp)
	movq	280(%rsp), %rdi                 # 8-byte Reload
	movq	%rdi, 112(%rsp)
	movq	288(%rsp), %rdi                 # 8-byte Reload
	movq	%rdi, 104(%rsp)
	movq	320(%rsp), %rdi                 # 8-byte Reload
	movq	%rdi, 96(%rsp)
	movq	312(%rsp), %rdi                 # 8-byte Reload
	movq	%rdi, 88(%rsp)
	movq	208(%rsp), %rdi                 # 8-byte Reload
	movq	%rdi, 80(%rsp)
	movq	272(%rsp), %rdi                 # 8-byte Reload
	movq	%rdi, 72(%rsp)
	movq	304(%rsp), %rdi                 # 8-byte Reload
	movq	%rdi, 64(%rsp)
	movq	%rcx, 56(%rsp)
	movq	%r10, 48(%rsp)
	movq	%r14, 40(%rsp)
	movq	%rbx, 32(%rsp)
	movl	%r12d, %ecx
	callq	*(%rbp,%rax,8)
	nop
	addq	$824, %rsp                      # imm = 0x338
	popq	%rbx
	popq	%rbp
	popq	%rdi
	popq	%rsi
	popq	%r12
	popq	%r13
	popq	%r14
	popq	%r15
	retq
.LBB0_8:
	movq	216(%rsp), %rax                 # 8-byte Reload
	addq	%rax, %rax
	movq	%rax, 216(%rsp)                 # 8-byte Spill
	movq	192(%rsp), %rax                 # 8-byte Reload
	addq	%rax, %rax
	movq	%rax, 192(%rsp)                 # 8-byte Spill
	movq	208(%rsp), %rax                 # 8-byte Reload
	addq	%rax, %rax
	movq	%rax, 208(%rsp)                 # 8-byte Spill
	movq	280(%rsp), %rax                 # 8-byte Reload
	addq	%rax, %rax
	movq	%rax, 280(%rsp)                 # 8-byte Spill
	movq	200(%rsp), %rax                 # 8-byte Reload
	addq	%rax, %rax
	movq	%rax, 200(%rsp)                 # 8-byte Spill
	movq	256(%rsp), %rax                 # 8-byte Reload
	addq	%rax, %rax
	movq	%rax, 256(%rsp)                 # 8-byte Spill
	jmp	.LBB0_9
	.seh_endproc
                                        # -- End function
	.def	 bli_sgemm_ker_var2;
	.scl	2;
	.type	32;
	.endef
	.globl	bli_sgemm_ker_var2              # -- Begin function bli_sgemm_ker_var2
	.p2align	4, 0x90
bli_sgemm_ker_var2:                     # @bli_sgemm_ker_var2
.seh_proc bli_sgemm_ker_var2
# %bb.0:
	pushq	%rbp
	.seh_pushreg %rbp
	pushq	%r15
	.seh_pushreg %r15
	pushq	%r14
	.seh_pushreg %r14
	pushq	%r13
	.seh_pushreg %r13
	pushq	%r12
	.seh_pushreg %r12
	pushq	%rsi
	.seh_pushreg %rsi
	pushq	%rdi
	.seh_pushreg %rdi
	pushq	%rbx
	.seh_pushreg %rbx
	movl	$4728, %eax                     # imm = 0x1278
	callq	__chkstk
	subq	%rax, %rsp
	.seh_stackalloc 4728
	leaq	128(%rsp), %rbp
	.seh_setframe %rbp, 128
	movaps	%xmm6, 4576(%rbp)               # 16-byte Spill
	.seh_savexmm %xmm6, 4704
	.seh_endprologue
	andq	$-64, %rsp
	movq	4832(%rbp), %rax
	cmpb	$0, 1072(%rax)
	movq	4784(%rbp), %r14
	movl	$1, %eax
	movq	%r14, %rbx
	cmoveq	%rax, %rbx
	movq	%rbx, 96(%rsp)                  # 8-byte Spill
	cmoveq	4744(%rbp), %rax
	movq	%rax, 144(%rsp)                 # 8-byte Spill
	testq	%r8, %r8
	je	.LBB1_78
# %bb.1:
	testq	%r9, %r9
	je	.LBB1_78
# %bb.2:
	cmpq	$0, 4704(%rbp)
	je	.LBB1_78
# %bb.3:
	movq	%r9, 160(%rsp)                  # 8-byte Spill
	movq	%r8, 152(%rsp)                  # 8-byte Spill
	movl	%ecx, 104(%rsp)                 # 4-byte Spill
	movl	%edx, 112(%rsp)                 # 4-byte Spill
	movq	4832(%rbp), %rax
	movq	752(%rax), %rax
	movq	%rax, 248(%rsp)                 # 8-byte Spill
	movq	BLIS_ZERO+64(%rip), %rax
	movq	%rax, 376(%rsp)                 # 8-byte Spill
	testq	%r14, %r14
	jle	.LBB1_24
# %bb.4:
	movq	4744(%rbp), %rcx
	leaq	-8(%rcx), %rax
	movq	%rax, 120(%rsp)                 # 8-byte Spill
	movq	%rax, %rdx
	shrq	$3, %rdx
	addq	$1, %rdx
	cmpq	$7, %rcx
	seta	%al
	movq	96(%rsp), %r11                  # 8-byte Reload
	cmpq	$1, %r11
	sete	%bl
	andb	%al, %bl
	movb	%bl, 88(%rsp)                   # 1-byte Spill
	movq	%rcx, %rax
	andq	$-8, %rax
	movq	%rax, 176(%rsp)                 # 8-byte Spill
	movl	%edx, %eax
	andl	$3, %eax
	movq	%rax, 136(%rsp)                 # 8-byte Spill
	movl	%ecx, %r12d
	andl	$3, %r12d
	movq	144(%rsp), %rax                 # 8-byte Reload
	leaq	(,%rax,4), %r15
	movq	%r11, %r10
	shlq	$7, %r10
	andq	$-4, %rdx
	negq	%rdx
	movq	%rdx, 168(%rsp)                 # 8-byte Spill
	movq	%r11, %rsi
	shlq	$5, %rsi
	leaq	(,%r11,4), %r13
	movq	%r12, 128(%rsp)                 # 8-byte Spill
	negq	%r12
	shlq	$4, %r11
	leaq	576(%rsp), %r9
	xorps	%xmm0, %xmm0
	xorl	%r8d, %r8d
	jmp	.LBB1_5
	.p2align	4, 0x90
.LBB1_23:                               #   in Loop: Header=BB1_5 Depth=1
	addq	$1, %r8
	addq	%r15, %r9
	movq	4784(%rbp), %r14
	cmpq	%r14, %r8
	je	.LBB1_24
.LBB1_5:                                # =>This Loop Header: Depth=1
                                        #     Child Loop BB1_11 Depth 2
                                        #     Child Loop BB1_14 Depth 2
                                        #     Child Loop BB1_18 Depth 2
                                        #     Child Loop BB1_22 Depth 2
	cmpq	$0, 4744(%rbp)
	jle	.LBB1_23
# %bb.6:                                #   in Loop: Header=BB1_5 Depth=1
	cmpb	$0, 88(%rsp)                    # 1-byte Folded Reload
	je	.LBB1_7
# %bb.8:                                #   in Loop: Header=BB1_5 Depth=1
	cmpq	$24, 120(%rsp)                  # 8-byte Folded Reload
	jae	.LBB1_10
# %bb.9:                                #   in Loop: Header=BB1_5 Depth=1
	xorl	%edi, %edi
	jmp	.LBB1_12
	.p2align	4, 0x90
.LBB1_7:                                #   in Loop: Header=BB1_5 Depth=1
	xorl	%r14d, %r14d
	jmp	.LBB1_16
.LBB1_10:                               #   in Loop: Header=BB1_5 Depth=1
	movq	168(%rsp), %rax                 # 8-byte Reload
	movq	%r9, %rdx
	xorl	%edi, %edi
	.p2align	4, 0x90
.LBB1_11:                               #   Parent Loop BB1_5 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
	movups	%xmm0, (%rdx)
	movups	%xmm0, 16(%rdx)
	leaq	(%rdx,%rsi), %rcx
	movups	%xmm0, (%rdx,%rsi)
	movups	%xmm0, 16(%rdx,%rsi)
	leaq	(%rcx,%rsi), %rbx
	movups	%xmm0, (%rsi,%rcx)
	movups	%xmm0, 16(%rsi,%rcx)
	movups	%xmm0, (%rsi,%rbx)
	movups	%xmm0, 16(%rsi,%rbx)
	addq	$32, %rdi
	addq	%r10, %rdx
	addq	$4, %rax
	jne	.LBB1_11
.LBB1_12:                               #   in Loop: Header=BB1_5 Depth=1
	cmpq	$0, 136(%rsp)                   # 8-byte Folded Reload
	je	.LBB1_15
# %bb.13:                               #   in Loop: Header=BB1_5 Depth=1
	imulq	%r13, %rdi
	movq	136(%rsp), %rax                 # 8-byte Reload
	.p2align	4, 0x90
.LBB1_14:                               #   Parent Loop BB1_5 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
	movups	%xmm0, (%r9,%rdi)
	movups	%xmm0, 16(%r9,%rdi)
	addq	%rsi, %rdi
	addq	$-1, %rax
	jne	.LBB1_14
.LBB1_15:                               #   in Loop: Header=BB1_5 Depth=1
	movq	176(%rsp), %rax                 # 8-byte Reload
	movq	%rax, %r14
	cmpq	4744(%rbp), %rax
	je	.LBB1_23
.LBB1_16:                               #   in Loop: Header=BB1_5 Depth=1
	movq	%r14, %rax
	notq	%rax
	addq	4744(%rbp), %rax
	cmpq	$0, 128(%rsp)                   # 8-byte Folded Reload
	je	.LBB1_20
# %bb.17:                               #   in Loop: Header=BB1_5 Depth=1
	movq	96(%rsp), %rcx                  # 8-byte Reload
	imulq	%r14, %rcx
	leaq	(%r9,%rcx,4), %rcx
	xorl	%edx, %edx
	.p2align	4, 0x90
.LBB1_18:                               #   Parent Loop BB1_5 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
	movl	$0, (%rcx)
	addq	$-1, %rdx
	addq	%r13, %rcx
	cmpq	%rdx, %r12
	jne	.LBB1_18
# %bb.19:                               #   in Loop: Header=BB1_5 Depth=1
	subq	%rdx, %r14
.LBB1_20:                               #   in Loop: Header=BB1_5 Depth=1
	cmpq	$3, %rax
	jb	.LBB1_23
# %bb.21:                               #   in Loop: Header=BB1_5 Depth=1
	movq	4744(%rbp), %rax
	subq	%r14, %rax
	leaq	3(%r14), %rdx
	imulq	%r13, %rdx
	leaq	2(%r14), %rbx
	imulq	%r13, %rbx
	leaq	1(%r14), %rcx
	imulq	%r13, %rcx
	imulq	%r13, %r14
	movq	%r9, %rdi
	.p2align	4, 0x90
.LBB1_22:                               #   Parent Loop BB1_5 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
	movl	$0, (%rdi,%r14)
	movl	$0, (%rdi,%rcx)
	movl	$0, (%rdi,%rbx)
	movl	$0, (%rdi,%rdx)
	addq	%r11, %rdi
	addq	$-4, %rax
	jne	.LBB1_22
	jmp	.LBB1_23
.LBB1_24:
	movq	160(%rsp), %rax                 # 8-byte Reload
	movq	%rax, %rcx
	orq	%r14, %rcx
	shrq	$32, %rcx
	je	.LBB1_25
# %bb.26:
	cqto
	idivq	%r14
	movq	%rdx, 184(%rsp)                 # 8-byte Spill
	movq	%rax, %rdi
	jmp	.LBB1_27
.LBB1_25:
                                        # kill: def $eax killed $eax killed $rax
	xorl	%edx, %edx
	divl	%r14d
                                        # kill: def $edx killed $edx def $rdx
	movq	%rdx, 184(%rsp)                 # 8-byte Spill
	movl	%eax, %edi
.LBB1_27:
	movq	4744(%rbp), %r15
	movq	152(%rsp), %rax                 # 8-byte Reload
	movq	4848(%rbp), %r9
	movq	4776(%rbp), %rcx
	movq	4736(%rbp), %rbx
	movq	%rax, %rdx
	orq	%r15, %rdx
	shrq	$32, %rdx
	movl	104(%rsp), %esi                 # 4-byte Reload
	je	.LBB1_28
# %bb.29:
	cqto
	idivq	%r15
	movq	%rdx, %r13
	movq	%rax, %r12
	jmp	.LBB1_30
.LBB1_28:
                                        # kill: def $eax killed $eax killed $rax
	xorl	%edx, %edx
	divl	%r15d
	movl	%edx, %r13d
	movl	%eax, %r12d
.LBB1_30:
	cmpq	$1, 184(%rsp)                   # 8-byte Folded Reload
	movq	%rdi, %rdx
	sbbq	$-1, %rdx
	cmpq	$1, %r13
	sbbq	$-1, %r12
	movl	%esi, 520(%rsp)
	movl	112(%rsp), %eax                 # 4-byte Reload
	movl	%eax, 524(%rsp)
	movq	%rbx, 544(%rsp)
	movq	%rcx, 552(%rsp)
	movq	48(%r9), %rdi
	leaq	336(%rsp), %rax
	movq	%rax, 40(%rsp)
	leaq	512(%rsp), %rax
	movq	%rax, 32(%rsp)
	movl	$1, %r8d
	movq	%r9, %rcx
	movq	%rdx, 240(%rsp)                 # 8-byte Spill
	xorl	%r9d, %r9d
	callq	bli_thread_range_sub
	leaq	328(%rsp), %rax
	movq	%rax, 40(%rsp)
	leaq	504(%rsp), %rax
	movq	%rax, 32(%rsp)
	movl	$1, %r8d
	movq	%rdi, %rcx
	movq	%r12, %rdx
	xorl	%r9d, %r9d
	callq	bli_thread_range_sub
	movq	512(%rsp), %r9
	movq	336(%rsp), %r8
	cmpq	%r8, %r9
	jge	.LBB1_78
# %bb.31:
	movq	4824(%rbp), %rcx
	movq	4816(%rbp), %rax
	movq	4808(%rbp), %r11
	movq	%rax, %r10
	imulq	%r15, %r10
	movq	%rcx, %rdi
	movq	%rcx, %rbx
	imulq	%r14, %rdi
	addq	$-1, 240(%rsp)                  # 8-byte Folded Spill
	addq	$-1, %r12
	movq	%rax, %rcx
	xorq	$1, %rcx
	movq	96(%rsp), %rsi                  # 8-byte Reload
	movq	%rsi, %rdx
	xorq	$1, %rdx
	orq	%rcx, %rdx
	setne	160(%rsp)                       # 1-byte Folded Spill
	movq	%rsi, %rcx
	shlq	$5, %rcx
	movq	%rcx, 320(%rsp)                 # 8-byte Spill
	movq	%rsi, %rcx
	shlq	$6, %rcx
	movq	%rcx, 312(%rsp)                 # 8-byte Spill
	movq	%r9, %rcx
	imulq	%rbx, %rcx
	imulq	%r14, %rcx
	leaq	(%r11,%rcx,4), %rdx
	addq	$16, %rdx
	movq	%rdx, 192(%rsp)                 # 8-byte Spill
	leaq	(%r11,%rcx,4), %rbx
	movq	%rax, %rcx
	shlq	$4, %rcx
	movq	%rcx, 288(%rsp)                 # 8-byte Spill
	movq	%rsi, %rcx
	shlq	$4, %rcx
	movq	%rcx, 488(%rsp)                 # 8-byte Spill
	xorps	%xmm6, %xmm6
	movq	328(%rsp), %r11
	movq	144(%rsp), %rdx                 # 8-byte Reload
	leaq	(,%rdx,4), %rdx
	movq	%rdx, 176(%rsp)                 # 8-byte Spill
	movq	%rdi, 352(%rsp)                 # 8-byte Spill
	leaq	(,%rdi,4), %rdx
	movq	%rdx, 360(%rsp)                 # 8-byte Spill
	leaq	(,%r15,4), %rdx
	movq	%rdx, 344(%rsp)                 # 8-byte Spill
	movq	%r10, 384(%rsp)                 # 8-byte Spill
	leaq	(,%r10,4), %rdx
	movq	%rdx, 264(%rsp)                 # 8-byte Spill
	movq	4824(%rbp), %rcx
	leaq	(,%rcx,4), %rdx
	movq	%rdx, 168(%rsp)                 # 8-byte Spill
	movq	192(%rsp), %rdi                 # 8-byte Reload
	movq	%rbx, %r10
	leaq	(,%rax,8), %rdx
	movq	%rdx, 304(%rsp)                 # 8-byte Spill
	leaq	(,%rax,4), %rdx
	movq	%rdx, 224(%rsp)                 # 8-byte Spill
	leaq	(,%rsi,4), %rdx
	movq	%rdx, 216(%rsp)                 # 8-byte Spill
	leaq	(,%rsi,8), %rdx
	movq	%rdx, 296(%rsp)                 # 8-byte Spill
	movq	%r13, 400(%rsp)                 # 8-byte Spill
	movq	%r12, 392(%rsp)                 # 8-byte Spill
	jmp	.LBB1_32
	.p2align	4, 0x90
.LBB1_77:                               #   in Loop: Header=BB1_32 Depth=1
	addq	$1, %r9
	movq	192(%rsp), %rdi                 # 8-byte Reload
	movq	360(%rsp), %rdx                 # 8-byte Reload
	addq	%rdx, %rdi
	movq	368(%rsp), %r10                 # 8-byte Reload
	addq	%rdx, %r10
	cmpq	%r8, %r9
	jge	.LBB1_78
.LBB1_32:                               # =>This Loop Header: Depth=1
                                        #     Child Loop BB1_34 Depth 2
                                        #       Child Loop BB1_40 Depth 3
                                        #         Child Loop BB1_66 Depth 4
                                        #         Child Loop BB1_74 Depth 4
                                        #       Child Loop BB1_45 Depth 3
                                        #         Child Loop BB1_51 Depth 4
                                        #         Child Loop BB1_57 Depth 4
                                        #         Child Loop BB1_61 Depth 4
	cmpq	%r9, 240(%rsp)                  # 8-byte Folded Reload
	movq	184(%rsp), %rdx                 # 8-byte Reload
	movq	%rdx, %rcx
	cmovneq	%r14, %rcx
	testq	%rdx, %rdx
	cmoveq	%r14, %rcx
	movq	504(%rsp), %rbx
	cmpq	%r11, %rbx
	movq	%rdi, 192(%rsp)                 # 8-byte Spill
	movq	%r10, 368(%rsp)                 # 8-byte Spill
	jge	.LBB1_77
# %bb.33:                               #   in Loop: Header=BB1_32 Depth=1
	movq	%r9, %rsi
	movq	4792(%rbp), %rdx
	imulq	%rdx, %rsi
	movq	352(%rsp), %rax                 # 8-byte Reload
	imulq	%r9, %rax
	movq	4760(%rbp), %rdx
	leaq	(%rdx,%rsi,4), %rsi
	movq	4808(%rbp), %rdx
	leaq	(%rdx,%rax,4), %rdx
	movq	%rdx, 416(%rsp)                 # 8-byte Spill
	movq	4792(%rbp), %rax
	leaq	(%rsi,%rax,4), %rdx
	movq	%rsi, %rax
	movq	%rdx, 408(%rsp)                 # 8-byte Spill
	movq	344(%rsp), %rdx                 # 8-byte Reload
	imulq	%rbx, %rdx
	addq	$32, %rdx
	imulq	4816(%rbp), %rdx
	leaq	(%rdi,%rdx), %rsi
	movq	%rsi, 272(%rsp)                 # 8-byte Spill
	movq	264(%rsp), %rsi                 # 8-byte Reload
	imulq	%rbx, %rsi
	addq	%rsi, %rdi
	movq	%rdi, 200(%rsp)                 # 8-byte Spill
	addq	%r10, %rsi
	movq	%rsi, 208(%rsp)                 # 8-byte Spill
	addq	%r10, %rdx
	movq	%rdx, 280(%rsp)                 # 8-byte Spill
	movq	%rax, 256(%rsp)                 # 8-byte Spill
	movq	%rax, %rdx
	movq	%r9, 424(%rsp)                  # 8-byte Spill
	movq	%rcx, 120(%rsp)                 # 8-byte Spill
	jmp	.LBB1_34
	.p2align	4, 0x90
.LBB1_36:                               #   in Loop: Header=BB1_34 Depth=2
	movq	4832(%rbp), %rax
	movq	%rax, 72(%rsp)
	leaq	520(%rsp), %rax
	movq	%rax, 64(%rsp)
	movq	4824(%rbp), %rax
	movq	%rax, 56(%rsp)
	movq	4816(%rbp), %rax
	movq	%rax, 48(%rsp)
	movq	232(%rsp), %rax                 # 8-byte Reload
	movq	%rax, 40(%rsp)
	movq	4800(%rbp), %rax
	movq	%rax, 32(%rsp)
	movq	4704(%rbp), %rcx
	movq	4712(%rbp), %rdx
	movq	256(%rsp), %r9                  # 8-byte Reload
	callq	*248(%rsp)                      # 8-byte Folded Reload
	movq	120(%rsp), %rcx                 # 8-byte Reload
.LBB1_76:                               #   in Loop: Header=BB1_34 Depth=2
	movq	448(%rsp), %rbx                 # 8-byte Reload
	addq	$1, %rbx
	movq	328(%rsp), %r11
	movq	336(%rsp), %r8
	movq	264(%rsp), %rdx                 # 8-byte Reload
	addq	%rdx, 272(%rsp)                 # 8-byte Folded Spill
	addq	%rdx, 200(%rsp)                 # 8-byte Folded Spill
	addq	%rdx, 208(%rsp)                 # 8-byte Folded Spill
	addq	%rdx, 280(%rsp)                 # 8-byte Folded Spill
	cmpq	%r11, %rbx
	movq	4784(%rbp), %r14
	movq	4744(%rbp), %r15
	movq	400(%rsp), %r13                 # 8-byte Reload
	movq	392(%rsp), %r12                 # 8-byte Reload
	movq	424(%rsp), %r9                  # 8-byte Reload
	movq	440(%rsp), %rdx                 # 8-byte Reload
	jge	.LBB1_77
.LBB1_34:                               #   Parent Loop BB1_32 Depth=1
                                        # =>  This Loop Header: Depth=2
                                        #       Child Loop BB1_40 Depth 3
                                        #         Child Loop BB1_66 Depth 4
                                        #         Child Loop BB1_74 Depth 4
                                        #       Child Loop BB1_45 Depth 3
                                        #         Child Loop BB1_51 Depth 4
                                        #         Child Loop BB1_57 Depth 4
                                        #         Child Loop BB1_61 Depth 4
	movq	%rdx, %rax
	movq	%rbx, %rsi
	movq	4752(%rbp), %r10
	imulq	%r10, %rsi
	movq	%r12, %rdx
	movq	%r14, %r12
	movq	384(%rsp), %r14                 # 8-byte Reload
	imulq	%rbx, %r14
	cmpq	%rbx, %rdx
	movq	%r13, %rdi
	cmovneq	%r15, %rdi
	testq	%r13, %r13
	cmoveq	%r15, %rdi
	addq	$-1, %r11
	addq	$-1, %r8
	cmpq	%r9, %r8
	movq	408(%rsp), %rdx                 # 8-byte Reload
	cmoveq	4760(%rbp), %rdx
	cmpq	%rbx, %r11
	cmovneq	%rax, %rdx
	movq	4720(%rbp), %r9
	leaq	(%r9,%rsi,4), %r8
	leaq	(%r8,%r10,4), %rax
	cmoveq	%r9, %rax
	movq	%rax, 528(%rsp)
	movq	416(%rsp), %rax                 # 8-byte Reload
	leaq	(%rax,%r14,4), %rax
	movq	%rax, 232(%rsp)                 # 8-byte Spill
	movq	%rdx, 440(%rsp)                 # 8-byte Spill
	movq	%rdx, 536(%rsp)
	cmpq	%r12, %rcx
	movq	%rbx, 448(%rsp)                 # 8-byte Spill
	jne	.LBB1_37
# %bb.35:                               #   in Loop: Header=BB1_34 Depth=2
	cmpq	%r15, %rdi
	je	.LBB1_36
.LBB1_37:                               #   in Loop: Header=BB1_34 Depth=2
	movq	%rdi, 88(%rsp)                  # 8-byte Spill
	movq	4832(%rbp), %rax
	movq	%rax, 72(%rsp)
	leaq	520(%rsp), %rax
	movq	%rax, 64(%rsp)
	movq	144(%rsp), %rax                 # 8-byte Reload
	movq	%rax, 56(%rsp)
	movq	96(%rsp), %rax                  # 8-byte Reload
	movq	%rax, 48(%rsp)
	leaq	576(%rsp), %rax
	movq	%rax, 40(%rsp)
	movq	376(%rsp), %rax                 # 8-byte Reload
	movq	%rax, 32(%rsp)
	movq	4704(%rbp), %rcx
	movq	4712(%rbp), %rdx
	movq	256(%rsp), %r9                  # 8-byte Reload
	callq	*248(%rsp)                      # 8-byte Folded Reload
	movq	4800(%rbp), %rax
	movss	(%rax), %xmm0                   # xmm0 = mem[0],zero,zero,zero
	ucomiss	%xmm6, %xmm0
	jne	.LBB1_38
	jp	.LBB1_38
# %bb.43:                               #   in Loop: Header=BB1_34 Depth=2
	movq	120(%rsp), %rcx                 # 8-byte Reload
	testq	%rcx, %rcx
	movq	288(%rsp), %r9                  # 8-byte Reload
	jle	.LBB1_76
# %bb.44:                               #   in Loop: Header=BB1_34 Depth=2
	movq	88(%rsp), %rax                  # 8-byte Reload
	leaq	-8(%rax), %rcx
	shrq	$3, %rcx
	movq	%rcx, 480(%rsp)                 # 8-byte Spill
	addq	$1, %rcx
	movq	%rax, %rdx
	andq	$-8, %rdx
	movq	%rdx, 464(%rsp)                 # 8-byte Spill
	movl	%eax, %edx
	andl	$3, %edx
	movq	%rcx, %rax
	movq	%rcx, 472(%rsp)                 # 8-byte Spill
	andq	$-2, %rcx
	negq	%rcx
	movq	%rcx, 432(%rsp)                 # 8-byte Spill
	movq	%rdx, 496(%rsp)                 # 8-byte Spill
	negq	%rdx
	movq	%rdx, 456(%rsp)                 # 8-byte Spill
	leaq	576(%rsp), %rax
	movq	%rax, 104(%rsp)                 # 8-byte Spill
	movq	208(%rsp), %rax                 # 8-byte Reload
	movq	%rax, 112(%rsp)                 # 8-byte Spill
	movq	200(%rsp), %rdx                 # 8-byte Reload
	movq	280(%rsp), %r10                 # 8-byte Reload
	leaq	592(%rsp), %rax
	movq	%rax, 152(%rsp)                 # 8-byte Spill
	xorl	%eax, %eax
	movq	%rax, 128(%rsp)                 # 8-byte Spill
	jmp	.LBB1_45
	.p2align	4, 0x90
.LBB1_62:                               #   in Loop: Header=BB1_45 Depth=3
	movq	%rsi, %r9
	movq	128(%rsp), %rdi                 # 8-byte Reload
	addq	$1, %rdi
	movq	176(%rsp), %rax                 # 8-byte Reload
	addq	%rax, 152(%rsp)                 # 8-byte Folded Spill
	movq	168(%rsp), %rcx                 # 8-byte Reload
	addq	%rcx, %r10
	movq	136(%rsp), %rdx                 # 8-byte Reload
	addq	%rcx, %rdx
	addq	%rcx, 112(%rsp)                 # 8-byte Folded Spill
	addq	%rax, 104(%rsp)                 # 8-byte Folded Spill
	movq	120(%rsp), %rcx                 # 8-byte Reload
	movq	%rdi, %rax
	movq	%rdi, 128(%rsp)                 # 8-byte Spill
	cmpq	%rcx, %rdi
	je	.LBB1_76
.LBB1_45:                               #   Parent Loop BB1_32 Depth=1
                                        #     Parent Loop BB1_34 Depth=2
                                        # =>    This Loop Header: Depth=3
                                        #         Child Loop BB1_51 Depth 4
                                        #         Child Loop BB1_57 Depth 4
                                        #         Child Loop BB1_61 Depth 4
	movq	%rdx, 136(%rsp)                 # 8-byte Spill
	cmpq	$0, 88(%rsp)                    # 8-byte Folded Reload
	movq	%r9, %rsi
	jle	.LBB1_62
# %bb.46:                               #   in Loop: Header=BB1_45 Depth=3
	movq	88(%rsp), %r11                  # 8-byte Reload
	cmpq	$8, %r11
	setb	%al
	orb	160(%rsp), %al                  # 1-byte Folded Reload
	je	.LBB1_48
# %bb.47:                               #   in Loop: Header=BB1_45 Depth=3
	xorl	%r8d, %r8d
	movq	%r9, %rsi
	jmp	.LBB1_55
	.p2align	4, 0x90
.LBB1_48:                               #   in Loop: Header=BB1_45 Depth=3
	cmpq	$0, 480(%rsp)                   # 8-byte Folded Reload
	je	.LBB1_49
# %bb.50:                               #   in Loop: Header=BB1_45 Depth=3
	xorl	%eax, %eax
	movq	432(%rsp), %rdi                 # 8-byte Reload
	movq	152(%rsp), %rsi                 # 8-byte Reload
	xorl	%ebx, %ebx
	movq	320(%rsp), %rdx                 # 8-byte Reload
	movq	312(%rsp), %r9                  # 8-byte Reload
	movq	304(%rsp), %r8                  # 8-byte Reload
	movq	136(%rsp), %rcx                 # 8-byte Reload
	.p2align	4, 0x90
.LBB1_51:                               #   Parent Loop BB1_32 Depth=1
                                        #     Parent Loop BB1_34 Depth=2
                                        #       Parent Loop BB1_45 Depth=3
                                        # =>      This Inner Loop Header: Depth=4
	movups	-16(%rsi), %xmm0
	movups	(%rsi), %xmm1
	movups	%xmm0, -16(%rcx,%rax,8)
	movups	%xmm1, (%rcx,%rax,8)
	movups	-16(%rsi,%rdx), %xmm0
	movups	(%rsi,%rdx), %xmm1
	movups	%xmm0, (%r10,%rax,8)
	movups	%xmm1, 16(%r10,%rax,8)
	addq	$16, %rbx
	addq	%r9, %rsi
	addq	%r8, %rax
	addq	$2, %rdi
	jne	.LBB1_51
# %bb.52:                               #   in Loop: Header=BB1_45 Depth=3
	testb	$1, 472(%rsp)                   # 1-byte Folded Reload
	je	.LBB1_54
.LBB1_53:                               #   in Loop: Header=BB1_45 Depth=3
	movq	128(%rsp), %rdx                 # 8-byte Reload
	movq	%rdx, %rax
	imulq	144(%rsp), %rax                 # 8-byte Folded Reload
	imulq	4824(%rbp), %rdx
	leaq	(%rsp,%rax,4), %rax
	addq	$576, %rax                      # imm = 0x240
	movq	232(%rsp), %rcx                 # 8-byte Reload
	leaq	(%rcx,%rdx,4), %rdx
	movq	%rbx, %rdi
	imulq	96(%rsp), %rdi                  # 8-byte Folded Reload
	movups	(%rax,%rdi,4), %xmm0
	movups	16(%rax,%rdi,4), %xmm1
	imulq	4816(%rbp), %rbx
	movups	%xmm0, (%rdx,%rbx,4)
	movups	%xmm1, 16(%rdx,%rbx,4)
.LBB1_54:                               #   in Loop: Header=BB1_45 Depth=3
	movq	464(%rsp), %rax                 # 8-byte Reload
	movq	%rax, %r8
	cmpq	%rax, %r11
	movq	288(%rsp), %rsi                 # 8-byte Reload
	je	.LBB1_62
.LBB1_55:                               #   in Loop: Header=BB1_45 Depth=3
	movq	%rsi, %r14
	movq	%r8, %r9
	notq	%r9
	addq	88(%rsp), %r9                   # 8-byte Folded Reload
	cmpq	$0, 496(%rsp)                   # 8-byte Folded Reload
	je	.LBB1_59
# %bb.56:                               #   in Loop: Header=BB1_45 Depth=3
	movq	224(%rsp), %rcx                 # 8-byte Reload
	movq	%rcx, %rax
	imulq	%r8, %rax
	addq	112(%rsp), %rax                 # 8-byte Folded Reload
	movq	96(%rsp), %rdx                  # 8-byte Reload
	imulq	%r8, %rdx
	movq	104(%rsp), %rbx                 # 8-byte Reload
	leaq	(%rbx,%rdx,4), %rdx
	xorl	%edi, %edi
	movq	216(%rsp), %rsi                 # 8-byte Reload
	movq	456(%rsp), %rbx                 # 8-byte Reload
	.p2align	4, 0x90
.LBB1_57:                               #   Parent Loop BB1_32 Depth=1
                                        #     Parent Loop BB1_34 Depth=2
                                        #       Parent Loop BB1_45 Depth=3
                                        # =>      This Inner Loop Header: Depth=4
	movss	(%rdx), %xmm0                   # xmm0 = mem[0],zero,zero,zero
	movss	%xmm0, (%rax)
	addq	$-1, %rdi
	addq	%rcx, %rax
	addq	%rsi, %rdx
	cmpq	%rdi, %rbx
	jne	.LBB1_57
# %bb.58:                               #   in Loop: Header=BB1_45 Depth=3
	subq	%rdi, %r8
.LBB1_59:                               #   in Loop: Header=BB1_45 Depth=3
	cmpq	$3, %r9
	movq	488(%rsp), %rcx                 # 8-byte Reload
	movq	%r14, %rsi
	jb	.LBB1_62
# %bb.60:                               #   in Loop: Header=BB1_45 Depth=3
	movq	88(%rsp), %rdi                  # 8-byte Reload
	subq	%r8, %rdi
	leaq	3(%r8), %rbx
	movq	224(%rsp), %r9                  # 8-byte Reload
	movq	%r9, %rax
	imulq	%rbx, %rax
	movq	216(%rsp), %rdx                 # 8-byte Reload
	imulq	%rdx, %rbx
	leaq	2(%r8), %r14
	movq	%r9, %r12
	imulq	%r14, %r12
	imulq	%rdx, %r14
	leaq	1(%r8), %r11
	movq	%r9, %r13
	imulq	%r11, %r13
	imulq	%rdx, %r11
	imulq	%r8, %r9
	imulq	%rdx, %r8
	movq	104(%rsp), %r15                 # 8-byte Reload
	movq	112(%rsp), %rdx                 # 8-byte Reload
	.p2align	4, 0x90
.LBB1_61:                               #   Parent Loop BB1_32 Depth=1
                                        #     Parent Loop BB1_34 Depth=2
                                        #       Parent Loop BB1_45 Depth=3
                                        # =>      This Inner Loop Header: Depth=4
	movss	(%r15,%r8), %xmm0               # xmm0 = mem[0],zero,zero,zero
	movss	%xmm0, (%rdx,%r9)
	movss	(%r15,%r11), %xmm0              # xmm0 = mem[0],zero,zero,zero
	movss	%xmm0, (%rdx,%r13)
	movss	(%r15,%r14), %xmm0              # xmm0 = mem[0],zero,zero,zero
	movss	%xmm0, (%rdx,%r12)
	movss	(%r15,%rbx), %xmm0              # xmm0 = mem[0],zero,zero,zero
	movss	%xmm0, (%rdx,%rax)
	addq	%rsi, %rdx
	addq	%rcx, %r15
	addq	$-4, %rdi
	jne	.LBB1_61
	jmp	.LBB1_62
.LBB1_49:                               #   in Loop: Header=BB1_45 Depth=3
	xorl	%ebx, %ebx
	testb	$1, 472(%rsp)                   # 1-byte Folded Reload
	jne	.LBB1_53
	jmp	.LBB1_54
.LBB1_38:                               #   in Loop: Header=BB1_34 Depth=2
	movq	120(%rsp), %rcx                 # 8-byte Reload
	testq	%rcx, %rcx
	jle	.LBB1_76
# %bb.39:                               #   in Loop: Header=BB1_34 Depth=2
	movq	88(%rsp), %rax                  # 8-byte Reload
	leaq	-8(%rax), %rcx
	shrq	$3, %rcx
	movq	%rcx, 136(%rsp)                 # 8-byte Spill
	addq	$1, %rcx
	andq	$-8, %rax
	movq	%rax, 112(%rsp)                 # 8-byte Spill
	movaps	%xmm0, %xmm1
	shufps	$0, %xmm0, %xmm1                # xmm1 = xmm1[0,0],xmm0[0,0]
	movq	%rcx, %rax
	movq	%rcx, 128(%rsp)                 # 8-byte Spill
	andq	$-2, %rcx
	negq	%rcx
	movq	%rcx, 104(%rsp)                 # 8-byte Spill
	movq	208(%rsp), %r14                 # 8-byte Reload
	movq	200(%rsp), %r12                 # 8-byte Reload
	movq	272(%rsp), %rdx                 # 8-byte Reload
	leaq	576(%rsp), %r13
	xorl	%r9d, %r9d
	jmp	.LBB1_40
	.p2align	4, 0x90
.LBB1_75:                               #   in Loop: Header=BB1_40 Depth=3
	addq	$1, %r9
	addq	176(%rsp), %r13                 # 8-byte Folded Reload
	movq	168(%rsp), %rax                 # 8-byte Reload
	addq	%rax, %rdx
	addq	%rax, %r12
	addq	%rax, %r14
	movq	120(%rsp), %rcx                 # 8-byte Reload
	cmpq	%rcx, %r9
	je	.LBB1_76
.LBB1_40:                               #   Parent Loop BB1_32 Depth=1
                                        #     Parent Loop BB1_34 Depth=2
                                        # =>    This Loop Header: Depth=3
                                        #         Child Loop BB1_66 Depth 4
                                        #         Child Loop BB1_74 Depth 4
	cmpq	$0, 88(%rsp)                    # 8-byte Folded Reload
	movq	320(%rsp), %rsi                 # 8-byte Reload
	movq	312(%rsp), %rdi                 # 8-byte Reload
	movq	304(%rsp), %r15                 # 8-byte Reload
	movq	296(%rsp), %r11                 # 8-byte Reload
	jle	.LBB1_75
# %bb.41:                               #   in Loop: Header=BB1_40 Depth=3
	cmpq	$8, 88(%rsp)                    # 8-byte Folded Reload
	setb	%al
	movq	%r9, %rcx
	imulq	144(%rsp), %rcx                 # 8-byte Folded Reload
	movq	%r9, %rbx
	imulq	4824(%rbp), %rbx
	leaq	(%rsp,%rcx,4), %r10
	addq	$576, %r10                      # imm = 0x240
	movq	232(%rsp), %rcx                 # 8-byte Reload
	leaq	(%rcx,%rbx,4), %rcx
	orb	160(%rsp), %al                  # 1-byte Folded Reload
	je	.LBB1_63
# %bb.42:                               #   in Loop: Header=BB1_40 Depth=3
	xorl	%r8d, %r8d
	jmp	.LBB1_70
	.p2align	4, 0x90
.LBB1_63:                               #   in Loop: Header=BB1_40 Depth=3
	movq	%r10, %rax
	cmpq	$0, 136(%rsp)                   # 8-byte Folded Reload
	je	.LBB1_64
# %bb.65:                               #   in Loop: Header=BB1_40 Depth=3
	xorl	%r10d, %r10d
	movq	104(%rsp), %r11                 # 8-byte Reload
	movq	%r13, %r8
	xorl	%ebx, %ebx
	.p2align	4, 0x90
.LBB1_66:                               #   Parent Loop BB1_32 Depth=1
                                        #     Parent Loop BB1_34 Depth=2
                                        #       Parent Loop BB1_40 Depth=3
                                        # =>      This Inner Loop Header: Depth=4
	movups	(%r8), %xmm2
	movups	16(%r8), %xmm3
	movups	-16(%r12,%r10,8), %xmm4
	movups	(%r12,%r10,8), %xmm5
	mulps	%xmm1, %xmm4
	addps	%xmm2, %xmm4
	mulps	%xmm1, %xmm5
	addps	%xmm3, %xmm5
	movups	%xmm4, -16(%r12,%r10,8)
	movups	%xmm5, (%r12,%r10,8)
	movups	(%r8,%rsi), %xmm2
	movups	16(%r8,%rsi), %xmm3
	movups	-16(%rdx,%r10,8), %xmm4
	movups	(%rdx,%r10,8), %xmm5
	mulps	%xmm1, %xmm4
	addps	%xmm2, %xmm4
	mulps	%xmm1, %xmm5
	addps	%xmm3, %xmm5
	movups	%xmm4, -16(%rdx,%r10,8)
	movups	%xmm5, (%rdx,%r10,8)
	addq	$16, %rbx
	addq	%rdi, %r8
	addq	%r15, %r10
	addq	$2, %r11
	jne	.LBB1_66
	jmp	.LBB1_67
.LBB1_64:                               #   in Loop: Header=BB1_40 Depth=3
	xorl	%ebx, %ebx
.LBB1_67:                               #   in Loop: Header=BB1_40 Depth=3
	testb	$1, 128(%rsp)                   # 1-byte Folded Reload
	movq	%rax, %r10
	je	.LBB1_69
# %bb.68:                               #   in Loop: Header=BB1_40 Depth=3
	movq	%rbx, %rax
	imulq	96(%rsp), %rax                  # 8-byte Folded Reload
	movups	(%r10,%rax,4), %xmm2
	movups	16(%r10,%rax,4), %xmm3
	imulq	4816(%rbp), %rbx
	movups	(%rcx,%rbx,4), %xmm4
	movups	16(%rcx,%rbx,4), %xmm5
	mulps	%xmm1, %xmm4
	addps	%xmm2, %xmm4
	mulps	%xmm1, %xmm5
	addps	%xmm3, %xmm5
	movups	%xmm4, (%rcx,%rbx,4)
	movups	%xmm5, 16(%rcx,%rbx,4)
.LBB1_69:                               #   in Loop: Header=BB1_40 Depth=3
	movq	112(%rsp), %rax                 # 8-byte Reload
	movq	%rax, %r8
	cmpq	%rax, 88(%rsp)                  # 8-byte Folded Reload
	movq	296(%rsp), %r11                 # 8-byte Reload
	je	.LBB1_75
.LBB1_70:                               #   in Loop: Header=BB1_40 Depth=3
	movq	%r8, %rax
	orq	$1, %rax
	testb	$1, 88(%rsp)                    # 1-byte Folded Reload
	je	.LBB1_72
# %bb.71:                               #   in Loop: Header=BB1_40 Depth=3
	movq	%rcx, %rsi
	movq	%r8, %rcx
	imulq	96(%rsp), %rcx                  # 8-byte Folded Reload
	imulq	4816(%rbp), %r8
	movss	(%rsi,%r8,4), %xmm2             # xmm2 = mem[0],zero,zero,zero
	mulss	%xmm0, %xmm2
	addss	(%r10,%rcx,4), %xmm2
	movss	%xmm2, (%rsi,%r8,4)
	movq	%rax, %r8
.LBB1_72:                               #   in Loop: Header=BB1_40 Depth=3
	cmpq	%rax, 88(%rsp)                  # 8-byte Folded Reload
	je	.LBB1_75
# %bb.73:                               #   in Loop: Header=BB1_40 Depth=3
	movq	88(%rsp), %rdi                  # 8-byte Reload
	subq	%r8, %rdi
	leaq	1(%r8), %r10
	movq	224(%rsp), %rbx                 # 8-byte Reload
	movq	%rbx, %rcx
	imulq	%r10, %rcx
	movq	216(%rsp), %rax                 # 8-byte Reload
	imulq	%rax, %r10
	imulq	%r8, %rbx
	imulq	%rax, %r8
	movq	%r13, %rsi
	movq	%r14, %rax
	.p2align	4, 0x90
.LBB1_74:                               #   Parent Loop BB1_32 Depth=1
                                        #     Parent Loop BB1_34 Depth=2
                                        #       Parent Loop BB1_40 Depth=3
                                        # =>      This Inner Loop Header: Depth=4
	movss	(%rax,%rbx), %xmm2              # xmm2 = mem[0],zero,zero,zero
	mulss	%xmm0, %xmm2
	addss	(%rsi,%r8), %xmm2
	movss	%xmm2, (%rax,%rbx)
	movss	(%rax,%rcx), %xmm2              # xmm2 = mem[0],zero,zero,zero
	mulss	%xmm0, %xmm2
	addss	(%rsi,%r10), %xmm2
	movss	%xmm2, (%rax,%rcx)
	addq	%r15, %rax
	addq	%r11, %rsi
	addq	$-2, %rdi
	jne	.LBB1_74
	jmp	.LBB1_75
.LBB1_78:
	movaps	4576(%rbp), %xmm6               # 16-byte Reload
	leaq	4600(%rbp), %rsp
	popq	%rbx
	popq	%rdi
	popq	%rsi
	popq	%r12
	popq	%r13
	popq	%r14
	popq	%r15
	popq	%rbp
	retq
	.seh_endproc
                                        # -- End function
	.def	 bli_dgemm_ker_var2;
	.scl	2;
	.type	32;
	.endef
	.globl	bli_dgemm_ker_var2              # -- Begin function bli_dgemm_ker_var2
	.p2align	4, 0x90
bli_dgemm_ker_var2:                     # @bli_dgemm_ker_var2
.seh_proc bli_dgemm_ker_var2
# %bb.0:
	pushq	%rbp
	.seh_pushreg %rbp
	pushq	%r15
	.seh_pushreg %r15
	pushq	%r14
	.seh_pushreg %r14
	pushq	%r13
	.seh_pushreg %r13
	pushq	%r12
	.seh_pushreg %r12
	pushq	%rsi
	.seh_pushreg %rsi
	pushq	%rdi
	.seh_pushreg %rdi
	pushq	%rbx
	.seh_pushreg %rbx
	movl	$4728, %eax                     # imm = 0x1278
	callq	__chkstk
	subq	%rax, %rsp
	.seh_stackalloc 4728
	leaq	128(%rsp), %rbp
	.seh_setframe %rbp, 128
	movapd	%xmm6, 4576(%rbp)               # 16-byte Spill
	.seh_savexmm %xmm6, 4704
	.seh_endprologue
	andq	$-64, %rsp
	movq	4832(%rbp), %rax
	cmpb	$0, 1074(%rax)
	movq	4784(%rbp), %rdi
	movl	$1, %eax
	movq	%rdi, %rbx
	cmoveq	%rax, %rbx
	movq	%rbx, 104(%rsp)                 # 8-byte Spill
	cmoveq	4744(%rbp), %rax
	movq	%rax, 136(%rsp)                 # 8-byte Spill
	testq	%r8, %r8
	je	.LBB2_78
# %bb.1:
	testq	%r9, %r9
	je	.LBB2_78
# %bb.2:
	cmpq	$0, 4704(%rbp)
	je	.LBB2_78
# %bb.3:
	movq	%r9, 144(%rsp)                  # 8-byte Spill
	movq	%r8, 152(%rsp)                  # 8-byte Spill
	movl	%ecx, 168(%rsp)                 # 4-byte Spill
	movl	%edx, 160(%rsp)                 # 4-byte Spill
	movq	4832(%rbp), %rax
	movq	768(%rax), %rax
	movq	%rax, 240(%rsp)                 # 8-byte Spill
	movq	BLIS_ZERO+64(%rip), %rax
	movq	%rax, 232(%rsp)                 # 8-byte Spill
	testq	%rdi, %rdi
	jle	.LBB2_24
# %bb.4:
	movq	4744(%rbp), %rdi
	leaq	-4(%rdi), %rax
	movq	%rax, 120(%rsp)                 # 8-byte Spill
	movq	%rax, %rbx
	shrq	$2, %rbx
	addq	$1, %rbx
	cmpq	$3, %rdi
	seta	%al
	movq	104(%rsp), %rcx                 # 8-byte Reload
	cmpq	$1, %rcx
	sete	%dl
	andb	%al, %dl
	movb	%dl, 88(%rsp)                   # 1-byte Spill
	movq	%rdi, %r8
	andq	$-4, %r8
	movl	%ebx, %eax
	andl	$3, %eax
	movq	%rax, 96(%rsp)                  # 8-byte Spill
	movl	%edi, %r12d
	andl	$3, %r12d
	movq	136(%rsp), %rax                 # 8-byte Reload
	leaq	(,%rax,8), %r14
	movq	%rcx, %r15
	shlq	$7, %r15
	andq	$-4, %rbx
	negq	%rbx
	movq	%rbx, 112(%rsp)                 # 8-byte Spill
	movq	%rcx, %rsi
	shlq	$5, %rsi
	leaq	(,%rcx,8), %r13
	movq	%r12, 128(%rsp)                 # 8-byte Spill
	negq	%r12
	leaq	576(%rsp), %r11
	xorl	%r10d, %r10d
	xorpd	%xmm0, %xmm0
	jmp	.LBB2_5
	.p2align	4, 0x90
.LBB2_23:                               #   in Loop: Header=BB2_5 Depth=1
	addq	$1, %r10
	addq	%r14, %r11
	movq	4784(%rbp), %rdi
	cmpq	%rdi, %r10
	je	.LBB2_24
.LBB2_5:                                # =>This Loop Header: Depth=1
                                        #     Child Loop BB2_11 Depth 2
                                        #     Child Loop BB2_14 Depth 2
                                        #     Child Loop BB2_18 Depth 2
                                        #     Child Loop BB2_22 Depth 2
	cmpq	$0, 4744(%rbp)
	jle	.LBB2_23
# %bb.6:                                #   in Loop: Header=BB2_5 Depth=1
	cmpb	$0, 88(%rsp)                    # 1-byte Folded Reload
	je	.LBB2_7
# %bb.8:                                #   in Loop: Header=BB2_5 Depth=1
	cmpq	$12, 120(%rsp)                  # 8-byte Folded Reload
	jae	.LBB2_10
# %bb.9:                                #   in Loop: Header=BB2_5 Depth=1
	xorl	%edi, %edi
	jmp	.LBB2_12
	.p2align	4, 0x90
.LBB2_7:                                #   in Loop: Header=BB2_5 Depth=1
	xorl	%r9d, %r9d
	jmp	.LBB2_16
.LBB2_10:                               #   in Loop: Header=BB2_5 Depth=1
	movq	112(%rsp), %rax                 # 8-byte Reload
	movq	%r11, %rdx
	xorl	%edi, %edi
	.p2align	4, 0x90
.LBB2_11:                               #   Parent Loop BB2_5 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
	movupd	%xmm0, (%rdx)
	movupd	%xmm0, 16(%rdx)
	leaq	(%rdx,%rsi), %rcx
	movupd	%xmm0, (%rdx,%rsi)
	movupd	%xmm0, 16(%rdx,%rsi)
	leaq	(%rcx,%rsi), %rbx
	movupd	%xmm0, (%rsi,%rcx)
	movupd	%xmm0, 16(%rsi,%rcx)
	movupd	%xmm0, (%rsi,%rbx)
	movupd	%xmm0, 16(%rsi,%rbx)
	addq	$16, %rdi
	addq	%r15, %rdx
	addq	$4, %rax
	jne	.LBB2_11
.LBB2_12:                               #   in Loop: Header=BB2_5 Depth=1
	cmpq	$0, 96(%rsp)                    # 8-byte Folded Reload
	je	.LBB2_15
# %bb.13:                               #   in Loop: Header=BB2_5 Depth=1
	imulq	%r13, %rdi
	movq	96(%rsp), %rax                  # 8-byte Reload
	.p2align	4, 0x90
.LBB2_14:                               #   Parent Loop BB2_5 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
	movupd	%xmm0, (%r11,%rdi)
	movupd	%xmm0, 16(%r11,%rdi)
	addq	%rsi, %rdi
	addq	$-1, %rax
	jne	.LBB2_14
.LBB2_15:                               #   in Loop: Header=BB2_5 Depth=1
	movq	%r8, %r9
	cmpq	4744(%rbp), %r8
	je	.LBB2_23
.LBB2_16:                               #   in Loop: Header=BB2_5 Depth=1
	movq	%r9, %rax
	notq	%rax
	addq	4744(%rbp), %rax
	cmpq	$0, 128(%rsp)                   # 8-byte Folded Reload
	je	.LBB2_20
# %bb.17:                               #   in Loop: Header=BB2_5 Depth=1
	movq	104(%rsp), %rcx                 # 8-byte Reload
	imulq	%r9, %rcx
	leaq	(%r11,%rcx,8), %rcx
	xorl	%edx, %edx
	.p2align	4, 0x90
.LBB2_18:                               #   Parent Loop BB2_5 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
	movq	$0, (%rcx)
	addq	$-1, %rdx
	addq	%r13, %rcx
	cmpq	%rdx, %r12
	jne	.LBB2_18
# %bb.19:                               #   in Loop: Header=BB2_5 Depth=1
	subq	%rdx, %r9
.LBB2_20:                               #   in Loop: Header=BB2_5 Depth=1
	cmpq	$3, %rax
	jb	.LBB2_23
# %bb.21:                               #   in Loop: Header=BB2_5 Depth=1
	movq	4744(%rbp), %rax
	subq	%r9, %rax
	leaq	3(%r9), %rdx
	imulq	%r13, %rdx
	leaq	2(%r9), %rcx
	imulq	%r13, %rcx
	leaq	1(%r9), %rbx
	imulq	%r13, %rbx
	imulq	%r13, %r9
	movq	%r11, %rdi
	.p2align	4, 0x90
.LBB2_22:                               #   Parent Loop BB2_5 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
	movq	$0, (%rdi,%r9)
	movq	$0, (%rdi,%rbx)
	movq	$0, (%rdi,%rcx)
	movq	$0, (%rdi,%rdx)
	addq	%rsi, %rdi
	addq	$-4, %rax
	jne	.LBB2_22
	jmp	.LBB2_23
.LBB2_24:
	movq	144(%rsp), %rax                 # 8-byte Reload
	movq	%rax, %rcx
	orq	%rdi, %rcx
	shrq	$32, %rcx
	je	.LBB2_25
# %bb.26:
	cqto
	idivq	%rdi
	movq	%rdx, 176(%rsp)                 # 8-byte Spill
	movq	%rax, %rdi
	jmp	.LBB2_27
.LBB2_25:
                                        # kill: def $eax killed $eax killed $rax
	xorl	%edx, %edx
	divl	%edi
                                        # kill: def $edx killed $edx def $rdx
	movq	%rdx, 176(%rsp)                 # 8-byte Spill
	movl	%eax, %edi
.LBB2_27:
	movq	4744(%rbp), %r14
	movq	152(%rsp), %rax                 # 8-byte Reload
	movq	4848(%rbp), %r9
	movq	4776(%rbp), %r8
	movq	4736(%rbp), %r10
	movq	%rax, %rcx
	orq	%r14, %rcx
	shrq	$32, %rcx
	movl	160(%rsp), %esi                 # 4-byte Reload
	movl	168(%rsp), %ecx                 # 4-byte Reload
	je	.LBB2_28
# %bb.29:
	cqto
	idivq	%r14
	movq	%rdx, %r15
	movq	%rax, %rbx
	jmp	.LBB2_30
.LBB2_28:
                                        # kill: def $eax killed $eax killed $rax
	xorl	%edx, %edx
	divl	%r14d
	movl	%edx, %r15d
	movl	%eax, %ebx
.LBB2_30:
	cmpq	$1, 176(%rsp)                   # 8-byte Folded Reload
	sbbq	$-1, %rdi
	cmpq	$1, %r15
	sbbq	$-1, %rbx
	movl	%ecx, 520(%rsp)
	movl	%esi, 524(%rsp)
	movq	%r10, 544(%rsp)
	movq	%r8, 552(%rsp)
	movq	48(%r9), %rsi
	leaq	336(%rsp), %rax
	movq	%rax, 40(%rsp)
	leaq	512(%rsp), %rax
	movq	%rax, 32(%rsp)
	movl	$1, %r8d
	movq	%r9, %rcx
	movq	%rdi, 216(%rsp)                 # 8-byte Spill
	movq	%rdi, %rdx
	xorl	%r9d, %r9d
	callq	bli_thread_range_sub
	leaq	328(%rsp), %rax
	movq	%rax, 40(%rsp)
	leaq	504(%rsp), %rax
	movq	%rax, 32(%rsp)
	movl	$1, %r8d
	movq	%rsi, %rcx
	movq	%rbx, %rdx
	xorl	%r9d, %r9d
	callq	bli_thread_range_sub
	movq	512(%rsp), %r8
	movq	336(%rsp), %r10
	cmpq	%r10, %r8
	movq	4784(%rbp), %r11
	jge	.LBB2_78
# %bb.31:
	movq	4824(%rbp), %rcx
	movq	4816(%rbp), %r12
	movq	4808(%rbp), %r13
	addq	$8, 232(%rsp)                   # 8-byte Folded Spill
	movq	%rbx, %rax
	movq	%r12, %rsi
	imulq	%r14, %rsi
	movq	%rcx, %rdi
	movq	%rcx, %r9
	imulq	%r11, %rdi
	addq	$-1, 216(%rsp)                  # 8-byte Folded Spill
	addq	$-1, %rax
	movq	%r12, %rcx
	xorq	$1, %rcx
	movq	104(%rsp), %rbx                 # 8-byte Reload
	movq	%rbx, %rdx
	xorq	$1, %rdx
	orq	%rcx, %rdx
	setne	87(%rsp)                        # 1-byte Folded Spill
	movq	%rbx, %rcx
	shlq	$5, %rcx
	movq	%rcx, 144(%rsp)                 # 8-byte Spill
	movq	%rbx, %rcx
	shlq	$6, %rcx
	movq	%rcx, 320(%rsp)                 # 8-byte Spill
	movq	%r8, %rcx
	imulq	%r9, %rcx
	imulq	%r11, %rcx
	leaq	16(,%rcx,8), %r9
	addq	%r13, %r9
	movq	%r12, %rdx
	shlq	$4, %rdx
	movq	%rdx, 312(%rsp)                 # 8-byte Spill
	leaq	(,%rcx,8), %rdx
	addq	%r13, %rdx
	movq	%rdx, 224(%rsp)                 # 8-byte Spill
	movq	%rbx, %rdx
	shlq	$4, %rdx
	movq	%rdx, 304(%rsp)                 # 8-byte Spill
	addq	%rcx, %rcx
	addq	$4, %rcx
	movq	%rcx, 296(%rsp)                 # 8-byte Spill
	movq	%r12, %rcx
	shlq	$5, %rcx
	movq	%rcx, 488(%rsp)                 # 8-byte Spill
	xorpd	%xmm6, %xmm6
	movq	328(%rsp), %r13
	movq	136(%rsp), %rdx                 # 8-byte Reload
	leaq	(,%rdx,8), %rdx
	movq	%rdx, 168(%rsp)                 # 8-byte Spill
	leaq	(,%rdi,8), %rdx
	movq	%rdx, 368(%rsp)                 # 8-byte Spill
	leaq	(,%r14,8), %rdx
	movq	%rdx, 344(%rsp)                 # 8-byte Spill
	movq	%rsi, 384(%rsp)                 # 8-byte Spill
	leaq	(,%rsi,8), %rdx
	movq	%rax, %rsi
	movq	%rdx, 256(%rsp)                 # 8-byte Spill
	movq	4824(%rbp), %rax
	leaq	(,%rax,8), %rdx
	movq	%rdx, 152(%rsp)                 # 8-byte Spill
	leaq	(,%r12,8), %rdx
	movq	%rdx, 200(%rsp)                 # 8-byte Spill
	leaq	(,%rbx,8), %rdx
	movq	%rdx, 192(%rsp)                 # 8-byte Spill
	movq	%rdi, 352(%rsp)                 # 8-byte Spill
	leaq	(%rdi,%rdi), %rdx
	movq	%rdx, 360(%rsp)                 # 8-byte Spill
	movq	%r15, 400(%rsp)                 # 8-byte Spill
	movq	%rsi, 392(%rsp)                 # 8-byte Spill
	jmp	.LBB2_32
	.p2align	4, 0x90
.LBB2_77:                               #   in Loop: Header=BB2_32 Depth=1
	addq	$1, %r8
	movq	376(%rsp), %r9                  # 8-byte Reload
	movq	368(%rsp), %rdx                 # 8-byte Reload
	addq	%rdx, %r9
	addq	%rdx, 224(%rsp)                 # 8-byte Folded Spill
	movq	360(%rsp), %rdx                 # 8-byte Reload
	addq	%rdx, 296(%rsp)                 # 8-byte Folded Spill
	cmpq	%r10, %r8
	jge	.LBB2_78
.LBB2_32:                               # =>This Loop Header: Depth=1
                                        #     Child Loop BB2_34 Depth 2
                                        #       Child Loop BB2_40 Depth 3
                                        #         Child Loop BB2_66 Depth 4
                                        #         Child Loop BB2_74 Depth 4
                                        #       Child Loop BB2_45 Depth 3
                                        #         Child Loop BB2_51 Depth 4
                                        #         Child Loop BB2_57 Depth 4
                                        #         Child Loop BB2_61 Depth 4
	cmpq	%r8, 216(%rsp)                  # 8-byte Folded Reload
	movq	176(%rsp), %rdx                 # 8-byte Reload
	movq	%rdx, %r12
	cmovneq	%r11, %r12
	testq	%rdx, %rdx
	cmoveq	%r11, %r12
	movq	504(%rsp), %rcx
	cmpq	%r13, %rcx
	movq	%r9, 376(%rsp)                  # 8-byte Spill
	jge	.LBB2_77
# %bb.33:                               #   in Loop: Header=BB2_32 Depth=1
	movq	%r8, %rax
	movq	4792(%rbp), %rdi
	imulq	%rdi, %rax
	movq	352(%rsp), %rbx                 # 8-byte Reload
	imulq	%r8, %rbx
	movq	4760(%rbp), %rdx
	leaq	(%rdx,%rax,8), %rax
	movq	4808(%rbp), %rdx
	leaq	(%rdx,%rbx,8), %rbx
	movq	%rbx, 416(%rsp)                 # 8-byte Spill
	leaq	(%rax,%rdi,8), %rbx
	movq	%rbx, 408(%rsp)                 # 8-byte Spill
	movq	344(%rsp), %rdi                 # 8-byte Reload
	imulq	%rcx, %rdi
	addq	$32, %rdi
	imulq	4816(%rbp), %rdi
	leaq	(%r9,%rdi), %rbx
	movq	%rbx, 272(%rsp)                 # 8-byte Spill
	movq	256(%rsp), %rdx                 # 8-byte Reload
	imulq	%rcx, %rdx
	leaq	(%r9,%rdx), %rbx
	movq	%rbx, 264(%rsp)                 # 8-byte Spill
	movq	224(%rsp), %rbx                 # 8-byte Reload
	addq	%rdx, %rbx
	movq	%rbx, 184(%rsp)                 # 8-byte Spill
	movq	4808(%rbp), %rbx
	addq	%rbx, %rdi
	movq	%rdi, 288(%rsp)                 # 8-byte Spill
	addq	%rbx, %rdx
	movq	%rdx, 280(%rsp)                 # 8-byte Spill
	movq	%rax, 248(%rsp)                 # 8-byte Spill
	movq	%rax, %rdx
	movq	%r8, 424(%rsp)                  # 8-byte Spill
	movq	%r12, 160(%rsp)                 # 8-byte Spill
	jmp	.LBB2_34
	.p2align	4, 0x90
.LBB2_36:                               #   in Loop: Header=BB2_34 Depth=2
	movq	4832(%rbp), %rax
	movq	%rax, 72(%rsp)
	leaq	520(%rsp), %rax
	movq	%rax, 64(%rsp)
	movq	4824(%rbp), %rax
	movq	%rax, 56(%rsp)
	movq	4816(%rbp), %rax
	movq	%rax, 48(%rsp)
	movq	208(%rsp), %rax                 # 8-byte Reload
	movq	%rax, 40(%rsp)
	movq	4800(%rbp), %rax
	movq	%rax, 32(%rsp)
	movq	4704(%rbp), %rcx
	movq	4712(%rbp), %rdx
	movq	248(%rsp), %r9                  # 8-byte Reload
	callq	*240(%rsp)                      # 8-byte Folded Reload
.LBB2_76:                               #   in Loop: Header=BB2_34 Depth=2
	movq	448(%rsp), %rcx                 # 8-byte Reload
	addq	$1, %rcx
	movq	328(%rsp), %r13
	movq	336(%rsp), %r10
	movq	256(%rsp), %rdx                 # 8-byte Reload
	addq	%rdx, 272(%rsp)                 # 8-byte Folded Spill
	addq	%rdx, 264(%rsp)                 # 8-byte Folded Spill
	addq	%rdx, 184(%rsp)                 # 8-byte Folded Spill
	addq	%rdx, 288(%rsp)                 # 8-byte Folded Spill
	addq	%rdx, 280(%rsp)                 # 8-byte Folded Spill
	cmpq	%r13, %rcx
	movq	4784(%rbp), %r11
	movq	4744(%rbp), %r14
	movq	400(%rsp), %r15                 # 8-byte Reload
	movq	392(%rsp), %rsi                 # 8-byte Reload
	movq	424(%rsp), %r8                  # 8-byte Reload
	movq	160(%rsp), %r12                 # 8-byte Reload
	movq	440(%rsp), %rdx                 # 8-byte Reload
	jge	.LBB2_77
.LBB2_34:                               #   Parent Loop BB2_32 Depth=1
                                        # =>  This Loop Header: Depth=2
                                        #       Child Loop BB2_40 Depth 3
                                        #         Child Loop BB2_66 Depth 4
                                        #         Child Loop BB2_74 Depth 4
                                        #       Child Loop BB2_45 Depth 3
                                        #         Child Loop BB2_51 Depth 4
                                        #         Child Loop BB2_57 Depth 4
                                        #         Child Loop BB2_61 Depth 4
	movq	%rdx, %rax
	movq	%rcx, %rbx
	movq	4752(%rbp), %r9
	imulq	%r9, %rbx
	movq	384(%rsp), %rdi                 # 8-byte Reload
	imulq	%rcx, %rdi
	cmpq	%rcx, %rsi
	movq	%r15, %rsi
	cmovneq	%r14, %rsi
	testq	%r15, %r15
	cmoveq	%r14, %rsi
	addq	$-1, %r13
	addq	$-1, %r10
	cmpq	%r8, %r10
	movq	408(%rsp), %rdx                 # 8-byte Reload
	cmoveq	4760(%rbp), %rdx
	cmpq	%rcx, %r13
	cmovneq	%rax, %rdx
	movq	4720(%rbp), %r10
	leaq	(%r10,%rbx,8), %r8
	leaq	(%r8,%r9,8), %rax
	cmoveq	%r10, %rax
	movq	%rax, 528(%rsp)
	movq	416(%rsp), %rax                 # 8-byte Reload
	leaq	(%rax,%rdi,8), %rax
	movq	%rax, 208(%rsp)                 # 8-byte Spill
	movq	%rdx, 536(%rsp)
	cmpq	%r11, %r12
	movq	%rcx, 448(%rsp)                 # 8-byte Spill
	movq	%rdx, 440(%rsp)                 # 8-byte Spill
	jne	.LBB2_37
# %bb.35:                               #   in Loop: Header=BB2_34 Depth=2
	cmpq	%r14, %rsi
	je	.LBB2_36
.LBB2_37:                               #   in Loop: Header=BB2_34 Depth=2
	movq	4832(%rbp), %rax
	movq	%rax, 72(%rsp)
	leaq	520(%rsp), %rax
	movq	%rax, 64(%rsp)
	movq	136(%rsp), %rax                 # 8-byte Reload
	movq	%rax, 56(%rsp)
	movq	104(%rsp), %rax                 # 8-byte Reload
	movq	%rax, 48(%rsp)
	leaq	576(%rsp), %rax
	movq	%rax, 40(%rsp)
	movq	232(%rsp), %rax                 # 8-byte Reload
	movq	%rax, 32(%rsp)
	movq	4704(%rbp), %rcx
	movq	4712(%rbp), %rdx
	movq	248(%rsp), %r9                  # 8-byte Reload
	callq	*240(%rsp)                      # 8-byte Folded Reload
	movq	4800(%rbp), %rax
	movsd	(%rax), %xmm0                   # xmm0 = mem[0],zero
	ucomisd	%xmm6, %xmm0
	movq	%rsi, %rax
	movq	%rsi, 88(%rsp)                  # 8-byte Spill
	jne	.LBB2_38
	jp	.LBB2_38
# %bb.43:                               #   in Loop: Header=BB2_34 Depth=2
	testq	%r12, %r12
	jle	.LBB2_76
# %bb.44:                               #   in Loop: Header=BB2_34 Depth=2
	leaq	-4(%rsi), %rax
	shrq	$2, %rax
	movq	%rax, 480(%rsp)                 # 8-byte Spill
	leaq	1(%rax), %rcx
	movq	%rsi, %rax
	andq	$-4, %rax
	movq	%rax, 464(%rsp)                 # 8-byte Spill
	movl	%esi, %edx
	andl	$3, %edx
	movq	%rcx, %rax
	movq	%rcx, 472(%rsp)                 # 8-byte Spill
	andq	$-2, %rcx
	negq	%rcx
	movq	%rcx, 432(%rsp)                 # 8-byte Spill
	movq	%rdx, 496(%rsp)                 # 8-byte Spill
	negq	%rdx
	movq	%rdx, 456(%rsp)                 # 8-byte Spill
	movq	184(%rsp), %rdx                 # 8-byte Reload
	movq	280(%rsp), %rcx                 # 8-byte Reload
	movq	288(%rsp), %rbx                 # 8-byte Reload
	leaq	576(%rsp), %rax
	movq	%rax, 96(%rsp)                  # 8-byte Spill
	xorl	%eax, %eax
	movq	%rax, 112(%rsp)                 # 8-byte Spill
	jmp	.LBB2_45
	.p2align	4, 0x90
.LBB2_62:                               #   in Loop: Header=BB2_45 Depth=3
	movq	112(%rsp), %rdi                 # 8-byte Reload
	addq	$1, %rdi
	movq	96(%rsp), %rax                  # 8-byte Reload
	addq	168(%rsp), %rax                 # 8-byte Folded Reload
	movq	%rax, 96(%rsp)                  # 8-byte Spill
	movq	152(%rsp), %rax                 # 8-byte Reload
	movq	120(%rsp), %rbx                 # 8-byte Reload
	addq	%rax, %rbx
	movq	128(%rsp), %rcx                 # 8-byte Reload
	addq	%rax, %rcx
	addq	%rax, %rdx
	movq	%rdi, %rax
	movq	%rdi, 112(%rsp)                 # 8-byte Spill
	cmpq	160(%rsp), %rdi                 # 8-byte Folded Reload
	movq	88(%rsp), %rsi                  # 8-byte Reload
	je	.LBB2_76
.LBB2_45:                               #   Parent Loop BB2_32 Depth=1
                                        #     Parent Loop BB2_34 Depth=2
                                        # =>    This Loop Header: Depth=3
                                        #         Child Loop BB2_51 Depth 4
                                        #         Child Loop BB2_57 Depth 4
                                        #         Child Loop BB2_61 Depth 4
	movq	%rbx, 120(%rsp)                 # 8-byte Spill
	movq	%rcx, 128(%rsp)                 # 8-byte Spill
	testq	%rsi, %rsi
	jle	.LBB2_62
# %bb.46:                               #   in Loop: Header=BB2_45 Depth=3
	movq	88(%rsp), %r9                   # 8-byte Reload
	cmpq	$4, %r9
	setb	%al
	orb	87(%rsp), %al                   # 1-byte Folded Reload
	je	.LBB2_48
# %bb.47:                               #   in Loop: Header=BB2_45 Depth=3
	xorl	%r8d, %r8d
	jmp	.LBB2_55
	.p2align	4, 0x90
.LBB2_48:                               #   in Loop: Header=BB2_45 Depth=3
	movq	%rdx, %r11
	cmpq	$0, 480(%rsp)                   # 8-byte Folded Reload
	je	.LBB2_49
# %bb.50:                               #   in Loop: Header=BB2_45 Depth=3
	movq	296(%rsp), %rax                 # 8-byte Reload
	movq	432(%rsp), %rbx                 # 8-byte Reload
	movq	96(%rsp), %rdi                  # 8-byte Reload
	xorl	%edx, %edx
	movq	144(%rsp), %r14                 # 8-byte Reload
	movq	320(%rsp), %r10                 # 8-byte Reload
	movq	312(%rsp), %r8                  # 8-byte Reload
	movq	128(%rsp), %rsi                 # 8-byte Reload
	movq	120(%rsp), %rcx                 # 8-byte Reload
	.p2align	4, 0x90
.LBB2_51:                               #   Parent Loop BB2_32 Depth=1
                                        #     Parent Loop BB2_34 Depth=2
                                        #       Parent Loop BB2_45 Depth=3
                                        # =>      This Inner Loop Header: Depth=4
	movups	(%rdi), %xmm0
	movups	16(%rdi), %xmm1
	movups	%xmm0, -16(%rsi,%rax,4)
	movups	%xmm1, (%rsi,%rax,4)
	movupd	(%rdi,%r14), %xmm0
	movupd	16(%rdi,%r14), %xmm1
	movupd	%xmm0, -16(%rcx,%rax,4)
	movupd	%xmm1, (%rcx,%rax,4)
	addq	$8, %rdx
	addq	%r10, %rdi
	addq	%r8, %rax
	addq	$2, %rbx
	jne	.LBB2_51
# %bb.52:                               #   in Loop: Header=BB2_45 Depth=3
	testb	$1, 472(%rsp)                   # 1-byte Folded Reload
	je	.LBB2_54
.LBB2_53:                               #   in Loop: Header=BB2_45 Depth=3
	movq	112(%rsp), %rbx                 # 8-byte Reload
	movq	%rbx, %rax
	imulq	136(%rsp), %rax                 # 8-byte Folded Reload
	imulq	4824(%rbp), %rbx
	leaq	(%rsp,%rax,8), %rax
	addq	$576, %rax                      # imm = 0x240
	movq	208(%rsp), %rcx                 # 8-byte Reload
	leaq	(%rcx,%rbx,8), %rbx
	movq	%rdx, %rdi
	imulq	104(%rsp), %rdi                 # 8-byte Folded Reload
	movupd	(%rax,%rdi,8), %xmm0
	movupd	16(%rax,%rdi,8), %xmm1
	imulq	4816(%rbp), %rdx
	movupd	%xmm0, (%rbx,%rdx,8)
	movupd	%xmm1, 16(%rbx,%rdx,8)
.LBB2_54:                               #   in Loop: Header=BB2_45 Depth=3
	movq	464(%rsp), %rax                 # 8-byte Reload
	movq	%rax, %r8
	cmpq	%rax, %r9
	movq	%r11, %rdx
	je	.LBB2_62
.LBB2_55:                               #   in Loop: Header=BB2_45 Depth=3
	movq	%r8, %r9
	notq	%r9
	addq	88(%rsp), %r9                   # 8-byte Folded Reload
	cmpq	$0, 496(%rsp)                   # 8-byte Folded Reload
	je	.LBB2_59
# %bb.56:                               #   in Loop: Header=BB2_45 Depth=3
	movq	200(%rsp), %rcx                 # 8-byte Reload
	movq	%rcx, %rax
	imulq	%r8, %rax
	movq	%rdx, %r10
	addq	%rdx, %rax
	movq	104(%rsp), %rbx                 # 8-byte Reload
	imulq	%r8, %rbx
	movq	96(%rsp), %rdx                  # 8-byte Reload
	leaq	(%rdx,%rbx,8), %rbx
	xorl	%edi, %edi
	movq	192(%rsp), %rsi                 # 8-byte Reload
	movq	456(%rsp), %rdx                 # 8-byte Reload
	.p2align	4, 0x90
.LBB2_57:                               #   Parent Loop BB2_32 Depth=1
                                        #     Parent Loop BB2_34 Depth=2
                                        #       Parent Loop BB2_45 Depth=3
                                        # =>      This Inner Loop Header: Depth=4
	movsd	(%rbx), %xmm0                   # xmm0 = mem[0],zero
	movsd	%xmm0, (%rax)
	addq	$-1, %rdi
	addq	%rcx, %rax
	addq	%rsi, %rbx
	cmpq	%rdi, %rdx
	jne	.LBB2_57
# %bb.58:                               #   in Loop: Header=BB2_45 Depth=3
	subq	%rdi, %r8
	movq	%r10, %rdx
.LBB2_59:                               #   in Loop: Header=BB2_45 Depth=3
	cmpq	$3, %r9
	movq	144(%rsp), %rcx                 # 8-byte Reload
	movq	488(%rsp), %r9                  # 8-byte Reload
	jb	.LBB2_62
# %bb.60:                               #   in Loop: Header=BB2_45 Depth=3
	movq	88(%rsp), %rsi                  # 8-byte Reload
	subq	%r8, %rsi
	leaq	3(%r8), %rdi
	movq	200(%rsp), %r10                 # 8-byte Reload
	movq	%r10, %rbx
	imulq	%rdi, %rbx
	movq	192(%rsp), %r11                 # 8-byte Reload
	imulq	%r11, %rdi
	leaq	2(%r8), %rax
	movq	%r10, %r12
	imulq	%rax, %r12
	imulq	%r11, %rax
	leaq	1(%r8), %r13
	movq	%r10, %r14
	imulq	%r13, %r14
	imulq	%r11, %r13
	imulq	%r8, %r10
	imulq	%r11, %r8
	movq	96(%rsp), %r11                  # 8-byte Reload
	movq	%rdx, %r15
	.p2align	4, 0x90
.LBB2_61:                               #   Parent Loop BB2_32 Depth=1
                                        #     Parent Loop BB2_34 Depth=2
                                        #       Parent Loop BB2_45 Depth=3
                                        # =>      This Inner Loop Header: Depth=4
	movsd	(%r11,%r8), %xmm0               # xmm0 = mem[0],zero
	movsd	%xmm0, (%r15,%r10)
	movsd	(%r11,%r13), %xmm0              # xmm0 = mem[0],zero
	movsd	%xmm0, (%r15,%r14)
	movsd	(%r11,%rax), %xmm0              # xmm0 = mem[0],zero
	movsd	%xmm0, (%r15,%r12)
	movsd	(%r11,%rdi), %xmm0              # xmm0 = mem[0],zero
	movsd	%xmm0, (%r15,%rbx)
	addq	%r9, %r15
	addq	%rcx, %r11
	addq	$-4, %rsi
	jne	.LBB2_61
	jmp	.LBB2_62
.LBB2_49:                               #   in Loop: Header=BB2_45 Depth=3
	xorl	%edx, %edx
	testb	$1, 472(%rsp)                   # 1-byte Folded Reload
	jne	.LBB2_53
	jmp	.LBB2_54
.LBB2_38:                               #   in Loop: Header=BB2_34 Depth=2
	testq	%r12, %r12
	jle	.LBB2_76
# %bb.39:                               #   in Loop: Header=BB2_34 Depth=2
	leaq	-4(%rsi), %rax
	shrq	$2, %rax
	movq	%rax, 96(%rsp)                  # 8-byte Spill
	leaq	1(%rax), %rcx
	movq	%rsi, %rax
	andq	$-4, %rax
	movq	%rax, 120(%rsp)                 # 8-byte Spill
	movapd	%xmm0, %xmm1
	unpcklpd	%xmm0, %xmm1                    # xmm1 = xmm1[0],xmm0[0]
	movq	%rcx, %rax
	movq	%rcx, 128(%rsp)                 # 8-byte Spill
	andq	$-2, %rcx
	negq	%rcx
	movq	%rcx, 112(%rsp)                 # 8-byte Spill
	movq	184(%rsp), %r12                 # 8-byte Reload
	movq	264(%rsp), %r13                 # 8-byte Reload
	movq	272(%rsp), %rdx                 # 8-byte Reload
	leaq	576(%rsp), %r14
	xorl	%r9d, %r9d
	jmp	.LBB2_40
	.p2align	4, 0x90
.LBB2_75:                               #   in Loop: Header=BB2_40 Depth=3
	addq	$1, %r9
	addq	168(%rsp), %r14                 # 8-byte Folded Reload
	movq	152(%rsp), %rax                 # 8-byte Reload
	addq	%rax, %rdx
	addq	%rax, %r13
	addq	%rax, %r12
	cmpq	160(%rsp), %r9                  # 8-byte Folded Reload
	movq	88(%rsp), %rsi                  # 8-byte Reload
	je	.LBB2_76
.LBB2_40:                               #   Parent Loop BB2_32 Depth=1
                                        #     Parent Loop BB2_34 Depth=2
                                        # =>    This Loop Header: Depth=3
                                        #         Child Loop BB2_66 Depth 4
                                        #         Child Loop BB2_74 Depth 4
	testq	%rsi, %rsi
	movq	144(%rsp), %rdi                 # 8-byte Reload
	movq	320(%rsp), %rax                 # 8-byte Reload
	movq	312(%rsp), %r15                 # 8-byte Reload
	movq	304(%rsp), %r11                 # 8-byte Reload
	jle	.LBB2_75
# %bb.41:                               #   in Loop: Header=BB2_40 Depth=3
	cmpq	$4, 88(%rsp)                    # 8-byte Folded Reload
	setb	%r8b
	movq	%r9, %rcx
	imulq	136(%rsp), %rcx                 # 8-byte Folded Reload
	movq	%r9, %rbx
	imulq	4824(%rbp), %rbx
	leaq	(%rsp,%rcx,8), %rsi
	addq	$576, %rsi                      # imm = 0x240
	movq	208(%rsp), %rcx                 # 8-byte Reload
	leaq	(%rcx,%rbx,8), %rbx
	orb	87(%rsp), %r8b                  # 1-byte Folded Reload
	je	.LBB2_63
# %bb.42:                               #   in Loop: Header=BB2_40 Depth=3
	xorl	%r8d, %r8d
	jmp	.LBB2_70
	.p2align	4, 0x90
.LBB2_63:                               #   in Loop: Header=BB2_40 Depth=3
	movq	%rsi, %rcx
	cmpq	$0, 96(%rsp)                    # 8-byte Folded Reload
	je	.LBB2_64
# %bb.65:                               #   in Loop: Header=BB2_40 Depth=3
	xorl	%esi, %esi
	movq	112(%rsp), %r11                 # 8-byte Reload
	movq	%r14, %r8
	xorl	%r10d, %r10d
	.p2align	4, 0x90
.LBB2_66:                               #   Parent Loop BB2_32 Depth=1
                                        #     Parent Loop BB2_34 Depth=2
                                        #       Parent Loop BB2_40 Depth=3
                                        # =>      This Inner Loop Header: Depth=4
	movupd	(%r8), %xmm2
	movupd	16(%r8), %xmm3
	movupd	-16(%r13,%rsi,4), %xmm4
	movupd	(%r13,%rsi,4), %xmm5
	mulpd	%xmm1, %xmm4
	addpd	%xmm2, %xmm4
	mulpd	%xmm1, %xmm5
	addpd	%xmm3, %xmm5
	movupd	%xmm4, -16(%r13,%rsi,4)
	movupd	%xmm5, (%r13,%rsi,4)
	movupd	(%r8,%rdi), %xmm2
	movupd	16(%r8,%rdi), %xmm3
	movupd	-16(%rdx,%rsi,4), %xmm4
	movupd	(%rdx,%rsi,4), %xmm5
	mulpd	%xmm1, %xmm4
	addpd	%xmm2, %xmm4
	mulpd	%xmm1, %xmm5
	addpd	%xmm3, %xmm5
	movupd	%xmm4, -16(%rdx,%rsi,4)
	movupd	%xmm5, (%rdx,%rsi,4)
	addq	$8, %r10
	addq	%rax, %r8
	addq	%r15, %rsi
	addq	$2, %r11
	jne	.LBB2_66
	jmp	.LBB2_67
.LBB2_64:                               #   in Loop: Header=BB2_40 Depth=3
	xorl	%r10d, %r10d
.LBB2_67:                               #   in Loop: Header=BB2_40 Depth=3
	testb	$1, 128(%rsp)                   # 1-byte Folded Reload
	movq	%rcx, %rsi
	je	.LBB2_69
# %bb.68:                               #   in Loop: Header=BB2_40 Depth=3
	movq	%r10, %rax
	imulq	104(%rsp), %rax                 # 8-byte Folded Reload
	movupd	(%rsi,%rax,8), %xmm2
	movupd	16(%rsi,%rax,8), %xmm3
	imulq	4816(%rbp), %r10
	movupd	(%rbx,%r10,8), %xmm4
	movupd	16(%rbx,%r10,8), %xmm5
	mulpd	%xmm1, %xmm4
	addpd	%xmm2, %xmm4
	mulpd	%xmm1, %xmm5
	addpd	%xmm3, %xmm5
	movupd	%xmm4, (%rbx,%r10,8)
	movupd	%xmm5, 16(%rbx,%r10,8)
.LBB2_69:                               #   in Loop: Header=BB2_40 Depth=3
	movq	120(%rsp), %rax                 # 8-byte Reload
	movq	%rax, %r8
	cmpq	%rax, 88(%rsp)                  # 8-byte Folded Reload
	movq	304(%rsp), %r11                 # 8-byte Reload
	je	.LBB2_75
.LBB2_70:                               #   in Loop: Header=BB2_40 Depth=3
	movq	%r8, %rax
	orq	$1, %rax
	testb	$1, 88(%rsp)                    # 1-byte Folded Reload
	je	.LBB2_72
# %bb.71:                               #   in Loop: Header=BB2_40 Depth=3
	movq	%r8, %rcx
	imulq	104(%rsp), %rcx                 # 8-byte Folded Reload
	imulq	4816(%rbp), %r8
	movsd	(%rbx,%r8,8), %xmm2             # xmm2 = mem[0],zero
	mulsd	%xmm0, %xmm2
	addsd	(%rsi,%rcx,8), %xmm2
	movsd	%xmm2, (%rbx,%r8,8)
	movq	%rax, %r8
.LBB2_72:                               #   in Loop: Header=BB2_40 Depth=3
	cmpq	%rax, 88(%rsp)                  # 8-byte Folded Reload
	je	.LBB2_75
# %bb.73:                               #   in Loop: Header=BB2_40 Depth=3
	movq	88(%rsp), %rbx                  # 8-byte Reload
	subq	%r8, %rbx
	leaq	1(%r8), %r10
	movq	200(%rsp), %rsi                 # 8-byte Reload
	movq	%rsi, %rcx
	imulq	%r10, %rcx
	movq	192(%rsp), %rax                 # 8-byte Reload
	imulq	%rax, %r10
	imulq	%r8, %rsi
	imulq	%rax, %r8
	movq	%r14, %rdi
	movq	%r12, %rax
	.p2align	4, 0x90
.LBB2_74:                               #   Parent Loop BB2_32 Depth=1
                                        #     Parent Loop BB2_34 Depth=2
                                        #       Parent Loop BB2_40 Depth=3
                                        # =>      This Inner Loop Header: Depth=4
	movsd	(%rax,%rsi), %xmm2              # xmm2 = mem[0],zero
	mulsd	%xmm0, %xmm2
	addsd	(%rdi,%r8), %xmm2
	movsd	%xmm2, (%rax,%rsi)
	movsd	(%rax,%rcx), %xmm2              # xmm2 = mem[0],zero
	mulsd	%xmm0, %xmm2
	addsd	(%rdi,%r10), %xmm2
	movsd	%xmm2, (%rax,%rcx)
	addq	%r15, %rax
	addq	%r11, %rdi
	addq	$-2, %rbx
	jne	.LBB2_74
	jmp	.LBB2_75
.LBB2_78:
	movaps	4576(%rbp), %xmm6               # 16-byte Reload
	leaq	4600(%rbp), %rsp
	popq	%rbx
	popq	%rdi
	popq	%rsi
	popq	%r12
	popq	%r13
	popq	%r14
	popq	%r15
	popq	%rbp
	retq
	.seh_endproc
                                        # -- End function
	.def	 bli_cgemm_ker_var2;
	.scl	2;
	.type	32;
	.endef
	.globl	bli_cgemm_ker_var2              # -- Begin function bli_cgemm_ker_var2
	.p2align	4, 0x90
bli_cgemm_ker_var2:                     # @bli_cgemm_ker_var2
.seh_proc bli_cgemm_ker_var2
# %bb.0:
	pushq	%rbp
	.seh_pushreg %rbp
	pushq	%r15
	.seh_pushreg %r15
	pushq	%r14
	.seh_pushreg %r14
	pushq	%r13
	.seh_pushreg %r13
	pushq	%r12
	.seh_pushreg %r12
	pushq	%rsi
	.seh_pushreg %rsi
	pushq	%rdi
	.seh_pushreg %rdi
	pushq	%rbx
	.seh_pushreg %rbx
	movl	$4600, %eax                     # imm = 0x11F8
	callq	__chkstk
	subq	%rax, %rsp
	.seh_stackalloc 4600
	leaq	128(%rsp), %rbp
	.seh_setframe %rbp, 128
	movaps	%xmm6, 4448(%rbp)               # 16-byte Spill
	.seh_savexmm %xmm6, 4576
	.seh_endprologue
	andq	$-64, %rsp
	movq	4704(%rbp), %rbx
	xorl	%eax, %eax
	cmpl	$6, 5064(%rbx)
	sete	%al
	cmpb	$0, 1072(%rbx,%rax)
	movq	4656(%rbp), %rdi
	movl	$1, %eax
	movq	%rdi, %rbx
	cmoveq	%rax, %rbx
	cmoveq	4616(%rbp), %rax
	movq	%rax, 144(%rsp)                 # 8-byte Spill
	testq	%r8, %r8
	je	.LBB3_47
# %bb.1:
	testq	%r9, %r9
	je	.LBB3_47
# %bb.2:
	cmpq	$0, 4576(%rbp)
	je	.LBB3_47
# %bb.3:
	movl	%ecx, 128(%rsp)                 # 4-byte Spill
	movl	%edx, 120(%rsp)                 # 4-byte Spill
	movq	%rbx, 136(%rsp)                 # 8-byte Spill
	movq	4704(%rbp), %rax
	movq	760(%rax), %rax
	movq	%rax, 192(%rsp)                 # 8-byte Spill
	movq	BLIS_ZERO+64(%rip), %rax
	movq	%rax, 184(%rsp)                 # 8-byte Spill
	testq	%rdi, %rdi
	jle	.LBB3_12
# %bb.4:
	movq	4616(%rbp), %r13
	leaq	-1(%r13), %rcx
	movl	%r13d, %r15d
	andl	$3, %r15d
	andq	$-4, %r13
	movq	144(%rsp), %rax                 # 8-byte Reload
	leaq	(,%rax,8), %r14
	movq	136(%rsp), %rax                 # 8-byte Reload
	movq	%rax, %rsi
	shlq	$5, %rsi
	leaq	(,%rax,8), %rax
	leaq	448(%rsp), %rbx
	xorl	%r12d, %r12d
	movq	136(%rsp), %r11                 # 8-byte Reload
	jmp	.LBB3_5
	.p2align	4, 0x90
.LBB3_11:                               #   in Loop: Header=BB3_5 Depth=1
	addq	$1, %r12
	addq	%r14, %rbx
	cmpq	4656(%rbp), %r12
	je	.LBB3_12
.LBB3_5:                                # =>This Loop Header: Depth=1
                                        #     Child Loop BB3_15 Depth 2
                                        #     Child Loop BB3_10 Depth 2
	cmpq	$0, 4616(%rbp)
	jle	.LBB3_11
# %bb.6:                                #   in Loop: Header=BB3_5 Depth=1
	cmpq	$3, %rcx
	jae	.LBB3_14
# %bb.7:                                #   in Loop: Header=BB3_5 Depth=1
	xorl	%edx, %edx
	jmp	.LBB3_8
	.p2align	4, 0x90
.LBB3_14:                               #   in Loop: Header=BB3_5 Depth=1
	movq	%rbx, %rdi
	xorl	%edx, %edx
	.p2align	4, 0x90
.LBB3_15:                               #   Parent Loop BB3_5 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
	movq	$0, (%rdi)
	leaq	(%rdi,%rax), %r10
	movq	$0, (%rdi,%r11,8)
	movq	$0, (%r10,%r11,8)
	addq	%rax, %r10
	movq	$0, (%r10,%r11,8)
	addq	$4, %rdx
	addq	%rsi, %rdi
	cmpq	%rdx, %r13
	jne	.LBB3_15
.LBB3_8:                                #   in Loop: Header=BB3_5 Depth=1
	testq	%r15, %r15
	je	.LBB3_11
# %bb.9:                                #   in Loop: Header=BB3_5 Depth=1
	imulq	%rax, %rdx
	movq	%r15, %rdi
	.p2align	4, 0x90
.LBB3_10:                               #   Parent Loop BB3_5 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
	movq	$0, (%rbx,%rdx)
	addq	%rax, %rdx
	addq	$-1, %rdi
	jne	.LBB3_10
	jmp	.LBB3_11
.LBB3_12:
	movq	%r9, %rax
	movq	4656(%rbp), %rbx
	orq	%rbx, %rax
	shrq	$32, %rax
	je	.LBB3_13
# %bb.16:
	movq	%r9, %rax
	cqto
	idivq	%rbx
	movq	%rdx, 152(%rsp)                 # 8-byte Spill
	movq	%rax, %rsi
	jmp	.LBB3_17
.LBB3_13:
	movl	%r9d, %eax
	xorl	%edx, %edx
	divl	%ebx
                                        # kill: def $edx killed $edx def $rdx
	movq	%rdx, 152(%rsp)                 # 8-byte Spill
	movl	%eax, %esi
.LBB3_17:
	movq	4616(%rbp), %rcx
	movq	4720(%rbp), %r9
	movq	4648(%rbp), %r10
	movq	4608(%rbp), %rdi
	movq	%r8, %rax
	orq	%rcx, %rax
	shrq	$32, %rax
	je	.LBB3_18
# %bb.19:
	movq	%r8, %rax
	cqto
	idivq	%rcx
	movq	%rdx, %rcx
	movq	%rax, %rbx
	jmp	.LBB3_20
.LBB3_18:
	movl	%r8d, %eax
	xorl	%edx, %edx
	divl	%ecx
	movl	%edx, %ecx
	movl	%eax, %ebx
.LBB3_20:
	cmpq	$1, 152(%rsp)                   # 8-byte Folded Reload
	movq	%rsi, %rdx
	sbbq	$-1, %rdx
	movq	%rcx, 328(%rsp)                 # 8-byte Spill
	cmpq	$1, %rcx
	sbbq	$-1, %rbx
	movl	128(%rsp), %eax                 # 4-byte Reload
	movl	%eax, 392(%rsp)
	movl	120(%rsp), %eax                 # 4-byte Reload
	movl	%eax, 396(%rsp)
	movq	%rdi, 416(%rsp)
	movq	%r10, 424(%rsp)
	movq	48(%r9), %rsi
	leaq	280(%rsp), %rax
	movq	%rax, 40(%rsp)
	leaq	384(%rsp), %rax
	movq	%rax, 32(%rsp)
	movl	$1, %r8d
	movq	%r9, %rcx
	movq	%rdx, 168(%rsp)                 # 8-byte Spill
	xorl	%r9d, %r9d
	callq	bli_thread_range_sub
	leaq	272(%rsp), %rax
	movq	%rax, 40(%rsp)
	leaq	376(%rsp), %rax
	movq	%rax, 32(%rsp)
	movl	$1, %r8d
	movq	%rsi, %rcx
	movq	%rbx, 200(%rsp)                 # 8-byte Spill
	movq	%rbx, %rdx
	xorl	%r9d, %r9d
	callq	bli_thread_range_sub
	movq	384(%rsp), %rax
	movq	280(%rsp), %rcx
	movq	%rcx, 120(%rsp)                 # 8-byte Spill
	cmpq	%rcx, %rax
	movq	4616(%rbp), %r8
	jge	.LBB3_47
# %bb.21:
	movq	4696(%rbp), %r12
	movq	4688(%rbp), %r13
	movq	4680(%rbp), %r10
	addq	$16, 184(%rsp)                  # 8-byte Folded Spill
	movq	%r13, %rdi
	imulq	%r8, %rdi
	movq	%r12, %rdx
	movq	4656(%rbp), %rbx
	imulq	%rbx, %rdx
	addq	$-1, 168(%rsp)                  # 8-byte Folded Spill
	addq	$-1, 200(%rsp)                  # 8-byte Folded Spill
	movq	%rax, %rcx
	imulq	%r12, %rcx
	imulq	%rbx, %rcx
	movq	136(%rsp), %rbx                 # 8-byte Reload
	movq	%rbx, %r15
	shlq	$4, %r15
	leaq	(%r10,%rcx,8), %rsi
	movq	%rsi, 176(%rsp)                 # 8-byte Spill
	shlq	$3, %rcx
	movq	%rcx, 296(%rsp)                 # 8-byte Spill
	movq	%r13, %r14
	shlq	$4, %r14
	xorps	%xmm6, %xmm6
	movq	272(%rsp), %r9
	movq	%rdi, 320(%rsp)                 # 8-byte Spill
	leaq	(,%rdi,8), %rsi
	movq	%rsi, 216(%rsp)                 # 8-byte Spill
	leaq	(,%r12,8), %rsi
	movq	%rdx, 304(%rsp)                 # 8-byte Spill
	leaq	(,%rdx,8), %rdx
	movq	%rdx, 312(%rsp)                 # 8-byte Spill
	leaq	(,%r13,8), %r13
	movq	144(%rsp), %rdx                 # 8-byte Reload
	leaq	(,%rdx,8), %rdx
	movq	%rdx, 264(%rsp)                 # 8-byte Spill
	leaq	(,%rbx,8), %r12
	leaq	(,%r8,8), %rdx
	movq	%rdx, 288(%rsp)                 # 8-byte Spill
	movq	%r10, 256(%rsp)                 # 8-byte Spill
	movq	%rsi, %r10
	movq	%rsi, 160(%rsp)                 # 8-byte Spill
	jmp	.LBB3_22
	.p2align	4, 0x90
.LBB3_46:                               #   in Loop: Header=BB3_22 Depth=1
	addq	$1, %rax
	movq	312(%rsp), %rdx                 # 8-byte Reload
	addq	%rdx, 256(%rsp)                 # 8-byte Folded Spill
	addq	%rdx, 176(%rsp)                 # 8-byte Folded Spill
	cmpq	120(%rsp), %rax                 # 8-byte Folded Reload
	jge	.LBB3_47
.LBB3_22:                               # =>This Loop Header: Depth=1
                                        #     Child Loop BB3_24 Depth 2
                                        #       Child Loop BB3_41 Depth 3
                                        #         Child Loop BB3_43 Depth 4
                                        #       Child Loop BB3_31 Depth 3
                                        #         Child Loop BB3_38 Depth 4
	cmpq	%rax, 168(%rsp)                 # 8-byte Folded Reload
	movq	152(%rsp), %rdx                 # 8-byte Reload
	movq	%rdx, %rsi
	movq	4656(%rbp), %rcx
	cmovneq	%rcx, %rsi
	testq	%rdx, %rdx
	cmoveq	%rcx, %rsi
	movq	376(%rsp), %rdi
	cmpq	%r9, %rdi
	jge	.LBB3_46
# %bb.23:                               #   in Loop: Header=BB3_22 Depth=1
	movq	%rsi, 128(%rsp)                 # 8-byte Spill
	movq	%rax, %rdx
	movq	4664(%rbp), %r11
	imulq	%r11, %rdx
	movq	304(%rsp), %rbx                 # 8-byte Reload
	imulq	%rax, %rbx
	movq	4632(%rbp), %rsi
	leaq	(%rsi,%rdx,8), %rcx
	movq	4680(%rbp), %rdx
	leaq	(%rdx,%rbx,8), %rdx
	movq	%rdx, 344(%rsp)                 # 8-byte Spill
	movq	%rdi, %rbx
	leaq	(%rcx,%r11,8), %rdx
	movq	%rdx, 336(%rsp)                 # 8-byte Spill
	movq	216(%rsp), %rdi                 # 8-byte Reload
	imulq	%rbx, %rdi
	movq	296(%rsp), %rdx                 # 8-byte Reload
	addq	%rdi, %rdx
	movq	%rdx, 240(%rsp)                 # 8-byte Spill
	movq	288(%rsp), %rsi                 # 8-byte Reload
	imulq	%rbx, %rsi
	addq	$8, %rsi
	imulq	4688(%rbp), %rsi
	movq	176(%rsp), %rdx                 # 8-byte Reload
	addq	%rdx, %rsi
	movq	%rsi, 224(%rsp)                 # 8-byte Spill
	addq	%rdx, %rdi
	movq	%rdi, 232(%rsp)                 # 8-byte Spill
	movq	%rcx, 208(%rsp)                 # 8-byte Spill
	movq	%rcx, %rdx
	movq	%rax, 352(%rsp)                 # 8-byte Spill
	jmp	.LBB3_24
	.p2align	4, 0x90
.LBB3_26:                               #   in Loop: Header=BB3_24 Depth=2
	movq	4704(%rbp), %rax
	movq	%rax, 72(%rsp)
	leaq	392(%rsp), %rax
	movq	%rax, 64(%rsp)
	movq	4696(%rbp), %rax
	movq	%rax, 56(%rsp)
	movq	4688(%rbp), %rax
	movq	%rax, 48(%rsp)
	movq	248(%rsp), %rax                 # 8-byte Reload
	movq	%rax, 40(%rsp)
	movq	4672(%rbp), %rax
	movq	%rax, 32(%rsp)
	movq	4576(%rbp), %rcx
	movq	4584(%rbp), %rdx
	movq	208(%rsp), %r9                  # 8-byte Reload
	callq	*192(%rsp)                      # 8-byte Folded Reload
.LBB3_45:                               #   in Loop: Header=BB3_24 Depth=2
	movq	368(%rsp), %rbx                 # 8-byte Reload
	addq	$1, %rbx
	movq	272(%rsp), %r9
	movq	280(%rsp), %rax
	movq	%rax, 120(%rsp)                 # 8-byte Spill
	movq	216(%rsp), %rax                 # 8-byte Reload
	addq	%rax, 240(%rsp)                 # 8-byte Folded Spill
	addq	%rax, 224(%rsp)                 # 8-byte Folded Spill
	addq	%rax, 232(%rsp)                 # 8-byte Folded Spill
	cmpq	%r9, %rbx
	movq	4616(%rbp), %r8
	movq	352(%rsp), %rax                 # 8-byte Reload
	movq	360(%rsp), %rdx                 # 8-byte Reload
	jge	.LBB3_46
.LBB3_24:                               #   Parent Loop BB3_22 Depth=1
                                        # =>  This Loop Header: Depth=2
                                        #       Child Loop BB3_41 Depth 3
                                        #         Child Loop BB3_43 Depth 4
                                        #       Child Loop BB3_31 Depth 3
                                        #         Child Loop BB3_38 Depth 4
	movq	%rax, %rcx
	movq	%rbx, %r11
	movq	4624(%rbp), %rdi
	imulq	%rdi, %r11
	movq	320(%rsp), %rdi                 # 8-byte Reload
	imulq	%rbx, %rdi
	cmpq	%rbx, 200(%rsp)                 # 8-byte Folded Reload
	movq	328(%rsp), %rsi                 # 8-byte Reload
	movq	%r8, %r10
	movq	120(%rsp), %rax                 # 8-byte Reload
	movq	%rsi, %r8
	cmovneq	%r10, %r8
	testq	%rsi, %rsi
	cmoveq	%r10, %r8
	addq	$-1, %r9
	addq	$-1, %rax
	cmpq	%rcx, %rax
	movq	336(%rsp), %rsi                 # 8-byte Reload
	cmoveq	4632(%rbp), %rsi
	movq	%rbx, 368(%rsp)                 # 8-byte Spill
	cmpq	%rbx, %r9
	movq	%r8, %rbx
	cmovneq	%rdx, %rsi
	movq	4592(%rbp), %rcx
	leaq	(%rcx,%r11,8), %r8
	movq	4624(%rbp), %rax
	leaq	(%r8,%rax,8), %rax
	cmoveq	%rcx, %rax
	movq	%rax, 400(%rsp)
	movq	344(%rsp), %rax                 # 8-byte Reload
	leaq	(%rax,%rdi,8), %rax
	movq	%rax, 248(%rsp)                 # 8-byte Spill
	movq	%rsi, 408(%rsp)
	movq	128(%rsp), %rdi                 # 8-byte Reload
	cmpq	4656(%rbp), %rdi
	movq	%rsi, 360(%rsp)                 # 8-byte Spill
	jne	.LBB3_27
# %bb.25:                               #   in Loop: Header=BB3_24 Depth=2
	cmpq	%r10, %rbx
	je	.LBB3_26
.LBB3_27:                               #   in Loop: Header=BB3_24 Depth=2
	movq	4704(%rbp), %rax
	movq	%rax, 72(%rsp)
	leaq	392(%rsp), %rax
	movq	%rax, 64(%rsp)
	movq	144(%rsp), %rax                 # 8-byte Reload
	movq	%rax, 56(%rsp)
	movq	136(%rsp), %rax                 # 8-byte Reload
	movq	%rax, 48(%rsp)
	leaq	448(%rsp), %rax
	movq	%rax, 40(%rsp)
	movq	184(%rsp), %rax                 # 8-byte Reload
	movq	%rax, 32(%rsp)
	movq	4576(%rbp), %rcx
	movq	4584(%rbp), %rdx
	movq	208(%rsp), %r9                  # 8-byte Reload
	callq	*192(%rsp)                      # 8-byte Folded Reload
	movq	4672(%rbp), %rax
	movss	(%rax), %xmm0                   # xmm0 = mem[0],zero,zero,zero
	ucomiss	%xmm6, %xmm0
	movq	%rbx, 120(%rsp)                 # 8-byte Spill
	jne	.LBB3_39
	jp	.LBB3_39
# %bb.28:                               #   in Loop: Header=BB3_24 Depth=2
	movss	4(%rax), %xmm1                  # xmm1 = mem[0],zero,zero,zero
	ucomiss	%xmm6, %xmm1
	jne	.LBB3_39
	jp	.LBB3_39
# %bb.29:                               #   in Loop: Header=BB3_24 Depth=2
	testq	%rdi, %rdi
	movq	160(%rsp), %r10                 # 8-byte Reload
	jle	.LBB3_45
# %bb.30:                               #   in Loop: Header=BB3_24 Depth=2
	movq	%rbx, %rax
	andq	$-2, %rax
	movq	232(%rsp), %rcx                 # 8-byte Reload
	movq	224(%rsp), %rdx                 # 8-byte Reload
	leaq	452(%rsp), %r8
	xorl	%r9d, %r9d
	jmp	.LBB3_31
	.p2align	4, 0x90
.LBB3_36:                               #   in Loop: Header=BB3_31 Depth=3
	addq	$1, %r9
	addq	264(%rsp), %r8                  # 8-byte Folded Reload
	addq	%r10, %rdx
	addq	%r10, %rcx
	cmpq	%rdi, %r9
	je	.LBB3_45
.LBB3_31:                               #   Parent Loop BB3_22 Depth=1
                                        #     Parent Loop BB3_24 Depth=2
                                        # =>    This Loop Header: Depth=3
                                        #         Child Loop BB3_38 Depth 4
	testq	%rbx, %rbx
	jle	.LBB3_36
# %bb.32:                               #   in Loop: Header=BB3_31 Depth=3
	cmpq	$1, %rbx
	jne	.LBB3_37
# %bb.33:                               #   in Loop: Header=BB3_31 Depth=3
	xorl	%esi, %esi
	jmp	.LBB3_34
	.p2align	4, 0x90
.LBB3_37:                               #   in Loop: Header=BB3_31 Depth=3
	xorl	%ebx, %ebx
	movq	%r8, %rdi
	xorl	%esi, %esi
	.p2align	4, 0x90
.LBB3_38:                               #   Parent Loop BB3_22 Depth=1
                                        #     Parent Loop BB3_24 Depth=2
                                        #       Parent Loop BB3_31 Depth=3
                                        # =>      This Inner Loop Header: Depth=4
	movss	-4(%rdi), %xmm0                 # xmm0 = mem[0],zero,zero,zero
	movss	%xmm0, (%rcx,%rbx)
	movss	(%rdi), %xmm0                   # xmm0 = mem[0],zero,zero,zero
	movss	%xmm0, 4(%rcx,%rbx)
	movss	-4(%rdi,%r12), %xmm0            # xmm0 = mem[0],zero,zero,zero
	movss	%xmm0, (%rdx,%rbx)
	movss	(%rdi,%r12), %xmm0              # xmm0 = mem[0],zero,zero,zero
	movss	%xmm0, 4(%rdx,%rbx)
	addq	$2, %rsi
	addq	%r15, %rdi
	addq	%r14, %rbx
	cmpq	%rsi, %rax
	jne	.LBB3_38
.LBB3_34:                               #   in Loop: Header=BB3_31 Depth=3
	movq	120(%rsp), %rbx                 # 8-byte Reload
	testb	$1, %bl
	movq	128(%rsp), %rdi                 # 8-byte Reload
	je	.LBB3_36
# %bb.35:                               #   in Loop: Header=BB3_31 Depth=3
	movq	%r9, %rbx
	imulq	144(%rsp), %rbx                 # 8-byte Folded Reload
	movq	%r9, %rdi
	imulq	4696(%rbp), %rdi
	leaq	(%rsp,%rbx,8), %r11
	addq	$448, %r11                      # imm = 0x1C0
	movq	248(%rsp), %rbx                 # 8-byte Reload
	leaq	(%rbx,%rdi,8), %r10
	movq	%rsi, %rdi
	imulq	136(%rsp), %rdi                 # 8-byte Folded Reload
	movss	(%r11,%rdi,8), %xmm0            # xmm0 = mem[0],zero,zero,zero
	imulq	4688(%rbp), %rsi
	movss	%xmm0, (%r10,%rsi,8)
	movss	4(%r11,%rdi,8), %xmm0           # xmm0 = mem[0],zero,zero,zero
	movq	120(%rsp), %rbx                 # 8-byte Reload
	movq	128(%rsp), %rdi                 # 8-byte Reload
	movss	%xmm0, 4(%r10,%rsi,8)
	movq	160(%rsp), %r10                 # 8-byte Reload
	jmp	.LBB3_36
	.p2align	4, 0x90
.LBB3_39:                               #   in Loop: Header=BB3_24 Depth=2
	testq	%rdi, %rdi
	movq	160(%rsp), %r10                 # 8-byte Reload
	jle	.LBB3_45
# %bb.40:                               #   in Loop: Header=BB3_24 Depth=2
	leaq	452(%rsp), %rdi
	movq	240(%rsp), %rcx                 # 8-byte Reload
	xorl	%edx, %edx
	jmp	.LBB3_41
	.p2align	4, 0x90
.LBB3_44:                               #   in Loop: Header=BB3_41 Depth=3
	addq	$1, %rdx
	addq	%r10, %rcx
	addq	264(%rsp), %rdi                 # 8-byte Folded Reload
	cmpq	128(%rsp), %rdx                 # 8-byte Folded Reload
	movq	120(%rsp), %rbx                 # 8-byte Reload
	je	.LBB3_45
.LBB3_41:                               #   Parent Loop BB3_22 Depth=1
                                        #     Parent Loop BB3_24 Depth=2
                                        # =>    This Loop Header: Depth=3
                                        #         Child Loop BB3_43 Depth 4
	testq	%rbx, %rbx
	jle	.LBB3_44
# %bb.42:                               #   in Loop: Header=BB3_41 Depth=3
	movq	4672(%rbp), %rax
	movss	4(%rax), %xmm1                  # xmm1 = mem[0],zero,zero,zero
	movq	%rdi, %rax
	movq	256(%rsp), %rsi                 # 8-byte Reload
	movq	120(%rsp), %rbx                 # 8-byte Reload
	.p2align	4, 0x90
.LBB3_43:                               #   Parent Loop BB3_22 Depth=1
                                        #     Parent Loop BB3_24 Depth=2
                                        #       Parent Loop BB3_41 Depth=3
                                        # =>      This Inner Loop Header: Depth=4
	movss	(%rsi,%rcx), %xmm2              # xmm2 = mem[0],zero,zero,zero
	movss	4(%rsi,%rcx), %xmm3             # xmm3 = mem[0],zero,zero,zero
	movaps	%xmm0, %xmm4
	mulss	%xmm2, %xmm4
	addss	-4(%rax), %xmm4
	movaps	%xmm1, %xmm5
	mulss	%xmm3, %xmm5
	mulss	%xmm1, %xmm2
	addss	(%rax), %xmm2
	subss	%xmm5, %xmm4
	mulss	%xmm0, %xmm3
	addss	%xmm2, %xmm3
	movss	%xmm4, (%rsi,%rcx)
	movss	%xmm3, 4(%rsi,%rcx)
	addq	%r13, %rsi
	addq	%r12, %rax
	addq	$-1, %rbx
	jne	.LBB3_43
	jmp	.LBB3_44
.LBB3_47:
	movaps	4448(%rbp), %xmm6               # 16-byte Reload
	leaq	4472(%rbp), %rsp
	popq	%rbx
	popq	%rdi
	popq	%rsi
	popq	%r12
	popq	%r13
	popq	%r14
	popq	%r15
	popq	%rbp
	retq
	.seh_endproc
                                        # -- End function
	.def	 bli_zgemm_ker_var2;
	.scl	2;
	.type	32;
	.endef
	.globl	bli_zgemm_ker_var2              # -- Begin function bli_zgemm_ker_var2
	.p2align	4, 0x90
bli_zgemm_ker_var2:                     # @bli_zgemm_ker_var2
.seh_proc bli_zgemm_ker_var2
# %bb.0:
	pushq	%rbp
	.seh_pushreg %rbp
	pushq	%r15
	.seh_pushreg %r15
	pushq	%r14
	.seh_pushreg %r14
	pushq	%r13
	.seh_pushreg %r13
	pushq	%r12
	.seh_pushreg %r12
	pushq	%rsi
	.seh_pushreg %rsi
	pushq	%rdi
	.seh_pushreg %rdi
	pushq	%rbx
	.seh_pushreg %rbx
	movl	$4600, %eax                     # imm = 0x11F8
	callq	__chkstk
	subq	%rax, %rsp
	.seh_stackalloc 4600
	leaq	128(%rsp), %rbp
	.seh_setframe %rbp, 128
	movapd	%xmm6, 4448(%rbp)               # 16-byte Spill
	.seh_savexmm %xmm6, 4576
	.seh_endprologue
	andq	$-64, %rsp
	movq	4704(%rbp), %rbx
	xorl	%eax, %eax
	cmpl	$6, 5064(%rbx)
	sete	%al
	cmpb	$0, 1074(%rax,%rbx)
	movq	4656(%rbp), %rdi
	movl	$1, %eax
	movq	%rdi, %rbx
	cmoveq	%rax, %rbx
	movq	%rbx, 128(%rsp)                 # 8-byte Spill
	cmoveq	4616(%rbp), %rax
	movq	%rax, 152(%rsp)                 # 8-byte Spill
	testq	%r8, %r8
	je	.LBB4_47
# %bb.1:
	testq	%r9, %r9
	je	.LBB4_47
# %bb.2:
	cmpq	$0, 4576(%rbp)
	je	.LBB4_47
# %bb.3:
	movq	%r8, 136(%rsp)                  # 8-byte Spill
	movl	%ecx, 144(%rsp)                 # 4-byte Spill
	movl	%edx, 120(%rsp)                 # 4-byte Spill
	movq	4704(%rbp), %rax
	movq	776(%rax), %rax
	movq	%rax, 192(%rsp)                 # 8-byte Spill
	movq	BLIS_ZERO+64(%rip), %rax
	movq	%rax, 184(%rsp)                 # 8-byte Spill
	testq	%rdi, %rdi
	jle	.LBB4_12
# %bb.4:
	movq	4616(%rbp), %rbx
	leaq	-1(%rbx), %r8
	movl	%ebx, %r14d
	andl	$3, %r14d
	andq	$-4, %rbx
	movq	128(%rsp), %rdx                 # 8-byte Reload
	movq	%rdx, %rdi
	shlq	$5, %rdi
	movq	152(%rsp), %r15                 # 8-byte Reload
	shlq	$4, %r15
	movq	%rdx, %rax
	shlq	$6, %rax
	shlq	$4, %rdx
	leaq	(%rdx,%rdx,2), %rsi
	leaq	448(%rsp), %r11
	xorl	%r13d, %r13d
	xorpd	%xmm0, %xmm0
	jmp	.LBB4_5
	.p2align	4, 0x90
.LBB4_11:                               #   in Loop: Header=BB4_5 Depth=1
	addq	$1, %r13
	addq	%r15, %r11
	cmpq	4656(%rbp), %r13
	je	.LBB4_12
.LBB4_5:                                # =>This Loop Header: Depth=1
                                        #     Child Loop BB4_15 Depth 2
                                        #     Child Loop BB4_10 Depth 2
	cmpq	$0, 4616(%rbp)
	jle	.LBB4_11
# %bb.6:                                #   in Loop: Header=BB4_5 Depth=1
	cmpq	$3, %r8
	jae	.LBB4_14
# %bb.7:                                #   in Loop: Header=BB4_5 Depth=1
	xorl	%r10d, %r10d
	jmp	.LBB4_8
	.p2align	4, 0x90
.LBB4_14:                               #   in Loop: Header=BB4_5 Depth=1
	movq	%r11, %r12
	xorl	%r10d, %r10d
	.p2align	4, 0x90
.LBB4_15:                               #   Parent Loop BB4_5 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
	movapd	%xmm0, (%r12)
	movapd	%xmm0, (%r12,%rdx)
	movapd	%xmm0, (%r12,%rdi)
	addq	$4, %r10
	movapd	%xmm0, (%r12,%rsi)
	addq	%rax, %r12
	cmpq	%r10, %rbx
	jne	.LBB4_15
.LBB4_8:                                #   in Loop: Header=BB4_5 Depth=1
	testq	%r14, %r14
	je	.LBB4_11
# %bb.9:                                #   in Loop: Header=BB4_5 Depth=1
	imulq	128(%rsp), %r10                 # 8-byte Folded Reload
	shlq	$4, %r10
	addq	%r11, %r10
	movq	%r14, %rcx
	.p2align	4, 0x90
.LBB4_10:                               #   Parent Loop BB4_5 Depth=1
                                        # =>  This Inner Loop Header: Depth=2
	movapd	%xmm0, (%r10)
	addq	%rdx, %r10
	addq	$-1, %rcx
	jne	.LBB4_10
	jmp	.LBB4_11
.LBB4_12:
	movq	%r9, %rax
	movq	4656(%rbp), %rcx
	orq	%rcx, %rax
	shrq	$32, %rax
	je	.LBB4_13
# %bb.16:
	movq	%r9, %rax
	cqto
	idivq	%rcx
	movq	%rdx, 160(%rsp)                 # 8-byte Spill
	movq	%rax, %rsi
	jmp	.LBB4_17
.LBB4_13:
	movl	%r9d, %eax
	xorl	%edx, %edx
	divl	%ecx
                                        # kill: def $edx killed $edx def $rdx
	movq	%rdx, 160(%rsp)                 # 8-byte Spill
	movl	%eax, %esi
.LBB4_17:
	movq	4616(%rbp), %rcx
	movq	136(%rsp), %rax                 # 8-byte Reload
	movq	4720(%rbp), %r9
	movq	4648(%rbp), %r8
	movq	4608(%rbp), %rdi
	movq	%rax, %rdx
	orq	%rcx, %rdx
	shrq	$32, %rdx
	je	.LBB4_18
# %bb.19:
	cqto
	idivq	%rcx
	movq	%rdx, %rcx
	movq	%rax, %rbx
	jmp	.LBB4_20
.LBB4_18:
                                        # kill: def $eax killed $eax killed $rax
	xorl	%edx, %edx
	divl	%ecx
	movl	%edx, %ecx
	movl	%eax, %ebx
.LBB4_20:
	cmpq	$1, 160(%rsp)                   # 8-byte Folded Reload
	movq	%rsi, %rdx
	sbbq	$-1, %rdx
	movq	%rcx, 336(%rsp)                 # 8-byte Spill
	cmpq	$1, %rcx
	sbbq	$-1, %rbx
	movl	144(%rsp), %eax                 # 4-byte Reload
	movl	%eax, 392(%rsp)
	movl	120(%rsp), %eax                 # 4-byte Reload
	movl	%eax, 396(%rsp)
	movq	%rdi, 416(%rsp)
	movq	%r8, 424(%rsp)
	movq	48(%r9), %rsi
	leaq	280(%rsp), %rax
	movq	%rax, 40(%rsp)
	leaq	384(%rsp), %rax
	movq	%rax, 32(%rsp)
	movl	$1, %r8d
	movq	%r9, %rcx
	movq	%rdx, 176(%rsp)                 # 8-byte Spill
	xorl	%r9d, %r9d
	callq	bli_thread_range_sub
	leaq	272(%rsp), %rax
	movq	%rax, 40(%rsp)
	leaq	376(%rsp), %rax
	movq	%rax, 32(%rsp)
	movl	$1, %r8d
	movq	%rsi, %rcx
	movq	%rbx, 200(%rsp)                 # 8-byte Spill
	movq	%rbx, %rdx
	xorl	%r9d, %r9d
	callq	bli_thread_range_sub
	movq	384(%rsp), %r10
	movq	280(%rsp), %rax
	cmpq	%rax, %r10
	movq	4656(%rbp), %r11
	movq	4616(%rbp), %rdi
	jge	.LBB4_47
# %bb.21:
	movq	4696(%rbp), %r8
	movq	4688(%rbp), %r15
	addq	$24, 184(%rsp)                  # 8-byte Folded Spill
	movq	%r15, %rdx
	imulq	%rdi, %rdx
	movq	%r8, %rcx
	imulq	%r11, %rcx
	addq	$-1, 176(%rsp)                  # 8-byte Folded Spill
	addq	$-1, 200(%rsp)                  # 8-byte Folded Spill
	movq	%r10, %rbx
	imulq	%r8, %rbx
	imulq	%r11, %rbx
	shlq	$4, %rbx
	movq	%rdx, 328(%rsp)                 # 8-byte Spill
	shlq	$4, %rdx
	movq	%rdx, 216(%rsp)                 # 8-byte Spill
	shlq	$4, %r8
	movq	%r8, 168(%rsp)                  # 8-byte Spill
	movq	%rcx, 304(%rsp)                 # 8-byte Spill
	shlq	$4, %rcx
	movq	%rcx, 312(%rsp)                 # 8-byte Spill
	movq	%r15, %r14
	shlq	$4, %r14
	movq	152(%rsp), %rcx                 # 8-byte Reload
	shlq	$4, %rcx
	movq	%rcx, 136(%rsp)                 # 8-byte Spill
	movq	128(%rsp), %r12                 # 8-byte Reload
	movq	%r12, %r13
	shlq	$4, %r13
	shlq	$5, %r12
	movq	%rdi, %rcx
	shlq	$4, %rcx
	movq	%rcx, 288(%rsp)                 # 8-byte Spill
	shlq	$5, %r15
	xorpd	%xmm6, %xmm6
	movq	4680(%rbp), %rdx
	movq	272(%rsp), %rcx
	movq	%rbx, 296(%rsp)                 # 8-byte Spill
	addq	%rdx, %rbx
	movq	%rdx, 264(%rsp)                 # 8-byte Spill
	jmp	.LBB4_22
	.p2align	4, 0x90
.LBB4_46:                               #   in Loop: Header=BB4_22 Depth=1
	addq	$1, %r10
	movq	312(%rsp), %rdx                 # 8-byte Reload
	addq	%rdx, 264(%rsp)                 # 8-byte Folded Spill
	movq	320(%rsp), %rbx                 # 8-byte Reload
	addq	%rdx, %rbx
	cmpq	%rax, %r10
	jge	.LBB4_47
.LBB4_22:                               # =>This Loop Header: Depth=1
                                        #     Child Loop BB4_24 Depth 2
                                        #       Child Loop BB4_41 Depth 3
                                        #         Child Loop BB4_43 Depth 4
                                        #       Child Loop BB4_31 Depth 3
                                        #         Child Loop BB4_38 Depth 4
	cmpq	%r10, 176(%rsp)                 # 8-byte Folded Reload
	movq	160(%rsp), %rsi                 # 8-byte Reload
	movq	%rsi, %rdx
	cmovneq	%r11, %rdx
	testq	%rsi, %rsi
	cmoveq	%r11, %rdx
	movq	376(%rsp), %r8
	cmpq	%rcx, %r8
	movq	%rbx, 320(%rsp)                 # 8-byte Spill
	jge	.LBB4_46
# %bb.23:                               #   in Loop: Header=BB4_22 Depth=1
	movq	%rdx, 144(%rsp)                 # 8-byte Spill
	movq	%r10, %r9
	movq	4664(%rbp), %rdx
	imulq	%rdx, %r9
	shlq	$4, %r9
	addq	4632(%rbp), %r9
	movq	304(%rsp), %rsi                 # 8-byte Reload
	imulq	%r10, %rsi
	shlq	$4, %rsi
	addq	4680(%rbp), %rsi
	movq	%rsi, 352(%rsp)                 # 8-byte Spill
	shlq	$4, %rdx
	addq	%r9, %rdx
	movq	%rdx, 344(%rsp)                 # 8-byte Spill
	movq	216(%rsp), %rsi                 # 8-byte Reload
	imulq	%r8, %rsi
	movq	296(%rsp), %rdx                 # 8-byte Reload
	addq	%rsi, %rdx
	movq	%rdx, 240(%rsp)                 # 8-byte Spill
	movq	288(%rsp), %rdx                 # 8-byte Reload
	imulq	%r8, %rdx
	addq	$16, %rdx
	imulq	4688(%rbp), %rdx
	addq	%rbx, %rdx
	movq	%rdx, 224(%rsp)                 # 8-byte Spill
	addq	%rbx, %rsi
	movq	%rsi, 232(%rsp)                 # 8-byte Spill
	movq	%r9, 208(%rsp)                  # 8-byte Spill
	movq	%r9, 248(%rsp)                  # 8-byte Spill
	movq	%r10, 360(%rsp)                 # 8-byte Spill
	jmp	.LBB4_24
	.p2align	4, 0x90
.LBB4_26:                               #   in Loop: Header=BB4_24 Depth=2
	movq	4704(%rbp), %rax
	movq	%rax, 72(%rsp)
	leaq	392(%rsp), %rax
	movq	%rax, 64(%rsp)
	movq	4696(%rbp), %rax
	movq	%rax, 56(%rsp)
	movq	4688(%rbp), %rax
	movq	%rax, 48(%rsp)
	movq	256(%rsp), %rax                 # 8-byte Reload
	movq	%rax, 40(%rsp)
	movq	4672(%rbp), %rax
	movq	%rax, 32(%rsp)
	movq	4576(%rbp), %rcx
	movq	4584(%rbp), %rdx
	movq	%r11, %r8
	movq	208(%rsp), %r9                  # 8-byte Reload
	callq	*192(%rsp)                      # 8-byte Folded Reload
	movq	4656(%rbp), %r11
.LBB4_45:                               #   in Loop: Header=BB4_24 Depth=2
	movq	368(%rsp), %r8                  # 8-byte Reload
	addq	$1, %r8
	movq	272(%rsp), %rcx
	movq	280(%rsp), %rax
	movq	216(%rsp), %rdx                 # 8-byte Reload
	addq	%rdx, 240(%rsp)                 # 8-byte Folded Spill
	addq	%rdx, 224(%rsp)                 # 8-byte Folded Spill
	addq	%rdx, 232(%rsp)                 # 8-byte Folded Spill
	cmpq	%rcx, %r8
	movq	4616(%rbp), %rdi
	movq	360(%rsp), %r10                 # 8-byte Reload
	jge	.LBB4_46
.LBB4_24:                               #   Parent Loop BB4_22 Depth=1
                                        # =>  This Loop Header: Depth=2
                                        #       Child Loop BB4_41 Depth 3
                                        #         Child Loop BB4_43 Depth 4
                                        #       Child Loop BB4_31 Depth 3
                                        #         Child Loop BB4_38 Depth 4
	movq	%rdi, %r9
	movq	%r8, %r11
	movq	4624(%rbp), %rdi
	imulq	%rdi, %r11
	shlq	$4, %r11
	movq	4592(%rbp), %rdx
	addq	%rdx, %r11
	movq	%rcx, %rdx
	movq	328(%rsp), %rcx                 # 8-byte Reload
	imulq	%r8, %rcx
	shlq	$4, %rcx
	addq	352(%rsp), %rcx                 # 8-byte Folded Reload
	movq	%rcx, 256(%rsp)                 # 8-byte Spill
	cmpq	%r8, 200(%rsp)                  # 8-byte Folded Reload
	movq	336(%rsp), %rbx                 # 8-byte Reload
	movq	%r10, %rcx
	movq	%rbx, %r10
	cmovneq	%r9, %r10
	movq	248(%rsp), %rsi                 # 8-byte Reload
	testq	%rbx, %rbx
	cmoveq	%r9, %r10
	movq	%rdi, %rbx
	shlq	$4, %rbx
	addq	%r11, %rbx
	addq	$-1, %rdx
	addq	$-1, %rax
	cmpq	%rcx, %rax
	movq	344(%rsp), %rax                 # 8-byte Reload
	cmoveq	4632(%rbp), %rax
	movq	%r8, 368(%rsp)                  # 8-byte Spill
	cmpq	%r8, %rdx
	cmovneq	%rsi, %rax
	cmoveq	4592(%rbp), %rbx
	movq	%r10, %rdi
	movq	%rbx, 400(%rsp)
	movq	%rax, 408(%rsp)
	movq	144(%rsp), %rsi                 # 8-byte Reload
	cmpq	4656(%rbp), %rsi
	movq	%rax, 248(%rsp)                 # 8-byte Spill
	jne	.LBB4_27
# %bb.25:                               #   in Loop: Header=BB4_24 Depth=2
	cmpq	%r9, %rdi
	je	.LBB4_26
.LBB4_27:                               #   in Loop: Header=BB4_24 Depth=2
	movq	4704(%rbp), %rax
	movq	%rax, 72(%rsp)
	leaq	392(%rsp), %rax
	movq	%rax, 64(%rsp)
	movq	152(%rsp), %rax                 # 8-byte Reload
	movq	%rax, 56(%rsp)
	movq	128(%rsp), %rax                 # 8-byte Reload
	movq	%rax, 48(%rsp)
	leaq	448(%rsp), %rax
	movq	%rax, 40(%rsp)
	movq	184(%rsp), %rax                 # 8-byte Reload
	movq	%rax, 32(%rsp)
	movq	4576(%rbp), %rcx
	movq	4584(%rbp), %rdx
	movq	%r11, %r8
	movq	208(%rsp), %r9                  # 8-byte Reload
	callq	*192(%rsp)                      # 8-byte Folded Reload
	movq	4672(%rbp), %rax
	movsd	(%rax), %xmm0                   # xmm0 = mem[0],zero
	ucomisd	%xmm6, %xmm0
	movq	%rdi, 120(%rsp)                 # 8-byte Spill
	jne	.LBB4_39
	jp	.LBB4_39
# %bb.28:                               #   in Loop: Header=BB4_24 Depth=2
	movsd	8(%rax), %xmm1                  # xmm1 = mem[0],zero
	ucomisd	%xmm6, %xmm1
	jne	.LBB4_39
	jp	.LBB4_39
# %bb.29:                               #   in Loop: Header=BB4_24 Depth=2
	testq	%rsi, %rsi
	movq	4656(%rbp), %r11
	movq	168(%rsp), %r10                 # 8-byte Reload
	jle	.LBB4_45
# %bb.30:                               #   in Loop: Header=BB4_24 Depth=2
	movq	%rdi, %rax
	andq	$-2, %rax
	movq	232(%rsp), %rcx                 # 8-byte Reload
	movq	224(%rsp), %rdx                 # 8-byte Reload
	leaq	448(%rsp), %r8
	xorl	%r9d, %r9d
	jmp	.LBB4_31
	.p2align	4, 0x90
.LBB4_36:                               #   in Loop: Header=BB4_31 Depth=3
	addq	$1, %r9
	addq	136(%rsp), %r8                  # 8-byte Folded Reload
	addq	%r10, %rdx
	addq	%r10, %rcx
	cmpq	144(%rsp), %r9                  # 8-byte Folded Reload
	je	.LBB4_45
.LBB4_31:                               #   Parent Loop BB4_22 Depth=1
                                        #     Parent Loop BB4_24 Depth=2
                                        # =>    This Loop Header: Depth=3
                                        #         Child Loop BB4_38 Depth 4
	testq	%rdi, %rdi
	jle	.LBB4_36
# %bb.32:                               #   in Loop: Header=BB4_31 Depth=3
	cmpq	$1, %rdi
	jne	.LBB4_37
# %bb.33:                               #   in Loop: Header=BB4_31 Depth=3
	xorl	%ebx, %ebx
	jmp	.LBB4_34
	.p2align	4, 0x90
.LBB4_37:                               #   in Loop: Header=BB4_31 Depth=3
	xorl	%edi, %edi
	movq	%r8, %rsi
	xorl	%ebx, %ebx
	.p2align	4, 0x90
.LBB4_38:                               #   Parent Loop BB4_22 Depth=1
                                        #     Parent Loop BB4_24 Depth=2
                                        #       Parent Loop BB4_31 Depth=3
                                        # =>      This Inner Loop Header: Depth=4
	movaps	(%rsi), %xmm0
	movups	%xmm0, (%rcx,%rdi)
	movapd	(%rsi,%r13), %xmm0
	movupd	%xmm0, (%rdx,%rdi)
	addq	$2, %rbx
	addq	%r12, %rsi
	addq	%r15, %rdi
	cmpq	%rbx, %rax
	jne	.LBB4_38
.LBB4_34:                               #   in Loop: Header=BB4_31 Depth=3
	movq	120(%rsp), %rdi                 # 8-byte Reload
	testb	$1, %dil
	je	.LBB4_36
# %bb.35:                               #   in Loop: Header=BB4_31 Depth=3
	movq	%r9, %rdi
	imulq	152(%rsp), %rdi                 # 8-byte Folded Reload
	movq	%r9, %rsi
	imulq	4696(%rbp), %rsi
	shlq	$4, %rdi
	shlq	$4, %rsi
	addq	256(%rsp), %rsi                 # 8-byte Folded Reload
	leaq	(%rsp,%rdi), %r10
	addq	$448, %r10                      # imm = 0x1C0
	movq	%rbx, %rdi
	imulq	128(%rsp), %rdi                 # 8-byte Folded Reload
	shlq	$4, %rdi
	imulq	4688(%rbp), %rbx
	shlq	$4, %rbx
	movapd	(%r10,%rdi), %xmm0
	movq	120(%rsp), %rdi                 # 8-byte Reload
	movq	168(%rsp), %r10                 # 8-byte Reload
	movupd	%xmm0, (%rsi,%rbx)
	jmp	.LBB4_36
	.p2align	4, 0x90
.LBB4_39:                               #   in Loop: Header=BB4_24 Depth=2
	testq	%rsi, %rsi
	movq	4656(%rbp), %r11
	movq	168(%rsp), %r8                  # 8-byte Reload
	jle	.LBB4_45
# %bb.40:                               #   in Loop: Header=BB4_24 Depth=2
	leaq	448(%rsp), %rbx
	movq	240(%rsp), %rcx                 # 8-byte Reload
	xorl	%edx, %edx
	jmp	.LBB4_41
	.p2align	4, 0x90
.LBB4_44:                               #   in Loop: Header=BB4_41 Depth=3
	addq	$1, %rdx
	addq	%r8, %rcx
	addq	136(%rsp), %rbx                 # 8-byte Folded Reload
	cmpq	144(%rsp), %rdx                 # 8-byte Folded Reload
	movq	120(%rsp), %rdi                 # 8-byte Reload
	je	.LBB4_45
.LBB4_41:                               #   Parent Loop BB4_22 Depth=1
                                        #     Parent Loop BB4_24 Depth=2
                                        # =>    This Loop Header: Depth=3
                                        #         Child Loop BB4_43 Depth 4
	testq	%rdi, %rdi
	jle	.LBB4_44
# %bb.42:                               #   in Loop: Header=BB4_41 Depth=3
	movq	4672(%rbp), %rax
	movsd	8(%rax), %xmm1                  # xmm1 = mem[0],zero
	movapd	%xmm0, %xmm2
	unpcklpd	%xmm1, %xmm2                    # xmm2 = xmm2[0],xmm1[0]
	unpcklpd	%xmm0, %xmm1                    # xmm1 = xmm1[0],xmm0[0]
	movq	%rbx, %rax
	movq	264(%rsp), %rdi                 # 8-byte Reload
	movq	120(%rsp), %rsi                 # 8-byte Reload
	.p2align	4, 0x90
.LBB4_43:                               #   Parent Loop BB4_22 Depth=1
                                        #     Parent Loop BB4_24 Depth=2
                                        #       Parent Loop BB4_41 Depth=3
                                        # =>      This Inner Loop Header: Depth=4
	movsd	(%rdi,%rcx), %xmm3              # xmm3 = mem[0],zero
	movsd	8(%rdi,%rcx), %xmm4             # xmm4 = mem[0],zero
	unpcklpd	%xmm3, %xmm3                    # xmm3 = xmm3[0,0]
	mulpd	%xmm2, %xmm3
	addpd	(%rax), %xmm3
	unpcklpd	%xmm4, %xmm4                    # xmm4 = xmm4[0,0]
	mulpd	%xmm1, %xmm4
	movapd	%xmm3, %xmm5
	subpd	%xmm4, %xmm5
	addpd	%xmm3, %xmm4
	movsd	%xmm5, %xmm4                    # xmm4 = xmm5[0],xmm4[1]
	movupd	%xmm4, (%rdi,%rcx)
	addq	%r14, %rdi
	addq	%r13, %rax
	addq	$-1, %rsi
	jne	.LBB4_43
	jmp	.LBB4_44
.LBB4_47:
	movaps	4448(%rbp), %xmm6               # 16-byte Reload
	leaq	4472(%rbp), %rsp
	popq	%rbx
	popq	%rdi
	popq	%rsi
	popq	%r12
	popq	%r13
	popq	%r14
	popq	%r15
	popq	%rbp
	retq
	.seh_endproc
                                        # -- End function
	.section	.rdata,"dr"
	.p2align	4                               # @ftypes
ftypes:
	.quad	bli_sgemm_ker_var2
	.quad	bli_cgemm_ker_var2
	.quad	bli_dgemm_ker_var2
	.quad	bli_zgemm_ker_var2

	.section	.drectve,"yn"
	.ascii	" /DEFAULTLIB:msvcrt.lib"
	.addrsig
	.addrsig_sym bli_sgemm_ker_var2
	.addrsig_sym bli_dgemm_ker_var2
	.addrsig_sym bli_cgemm_ker_var2
	.addrsig_sym bli_zgemm_ker_var2
	.addrsig_sym BLIS_ZERO
	.globl	_fltused
Contents of bli_dgemm_skx_asm_16x14.s
	.text
	.def	 @feat.00;
	.scl	3;
	.type	0;
	.endef
	.globl	@feat.00
.set @feat.00, 0
	.file	"bli_dgemm_skx_asm_16x14.c"
	.def	 bli_dgemm_skx_asm_16x14;
	.scl	2;
	.type	32;
	.endef
	.globl	bli_dgemm_skx_asm_16x14         # -- Begin function bli_dgemm_skx_asm_16x14
	.p2align	4, 0x90
bli_dgemm_skx_asm_16x14:                # @bli_dgemm_skx_asm_16x14
.seh_proc bli_dgemm_skx_asm_16x14
# %bb.0:
	pushq	%r15
	.seh_pushreg %r15
	pushq	%r14
	.seh_pushreg %r14
	pushq	%r13
	.seh_pushreg %r13
	pushq	%r12
	.seh_pushreg %r12
	pushq	%rsi
	.seh_pushreg %rsi
	pushq	%rdi
	.seh_pushreg %rdi
	pushq	%rbx
	.seh_pushreg %rbx
	subq	$56, %rsp
	.seh_stackalloc 56
	.seh_endprologue
	movq	176(%rsp), %r10
	movq	168(%rsp), %rax
	movq	%r9, 48(%rsp)
	movq	%r8, 40(%rsp)
	movq	%rdx, 32(%rsp)
	leaq	offsets(%rip), %rdx
	movq	%rdx, 24(%rsp)
	movq	%rcx, 16(%rsp)
	shlq	$3, %rax
	movq	%rax, 8(%rsp)
	shlq	$3, %r10
	movq	%r10, (%rsp)
	#APP
	vxorpd	%ymm4, %ymm4, %ymm4
	vxorpd	%ymm5, %ymm5, %ymm5
	vxorpd	%ymm6, %ymm6, %ymm6
	vxorpd	%ymm7, %ymm7, %ymm7
	vxorpd	%ymm8, %ymm8, %ymm8
	vxorpd	%ymm9, %ymm9, %ymm9
	vxorpd	%ymm10, %ymm10, %ymm10
	vxorpd	%ymm11, %ymm11, %ymm11
	vxorpd	%ymm12, %ymm12, %ymm12
	vxorpd	%ymm13, %ymm13, %ymm13
	vxorpd	%ymm14, %ymm14, %ymm14
	vxorpd	%ymm15, %ymm15, %ymm15
	vxorpd	%ymm16, %ymm16, %ymm16
	vxorpd	%ymm17, %ymm17, %ymm17
	vxorpd	%ymm18, %ymm18, %ymm18
	vxorpd	%ymm19, %ymm19, %ymm19
	vxorpd	%ymm20, %ymm20, %ymm20
	vxorpd	%ymm21, %ymm21, %ymm21
	vxorpd	%ymm22, %ymm22, %ymm22
	vxorpd	%ymm23, %ymm23, %ymm23
	vxorpd	%ymm24, %ymm24, %ymm24
	vxorpd	%ymm25, %ymm25, %ymm25
	vxorpd	%ymm26, %ymm26, %ymm26
	vxorpd	%ymm27, %ymm27, %ymm27
	vxorpd	%ymm28, %ymm28, %ymm28
	vxorpd	%ymm29, %ymm29, %ymm29
	vxorpd	%ymm30, %ymm30, %ymm30
	vxorpd	%ymm31, %ymm31, %ymm31
	movq	16(%rsp), %rsi
	movq	40(%rsp), %rax
	movq	48(%rsp), %rbx
	movq	160(%rsp), %rcx
	leaq	(%rsi,%rsi,2), %rdx
	leaq	(,%rdx,4), %rdx
	leaq	(%rdx,%rsi,2), %rdx
	leaq	-128(%rbx,%rdx,8), %rdx
	leaq	63(%rcx), %r9
	vmovapd	(%rax), %zmm0
	vmovapd	64(%rax), %zmm1
	leaq	128(%rax), %rax
	movq	8(%rsp), %r12
	movq	(%rsp), %r10
	movq	%rsi, %rdi
	andq	$3, %rsi
	sarq	$2, %rdi
	subq	$19, %rdi
	jle	.LK_LE_800
	.p2align	5, 0x90
.LLOOP10:
	prefetcht0	512(%rax)
	vbroadcastsd	(%rbx), %zmm2
	vbroadcastsd	8(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm4     # zmm4 = (zmm0 * zmm2) + zmm4
	vfmadd231pd	%zmm2, %zmm1, %zmm5     # zmm5 = (zmm1 * zmm2) + zmm5
	vfmadd231pd	%zmm3, %zmm0, %zmm6     # zmm6 = (zmm0 * zmm3) + zmm6
	vfmadd231pd	%zmm3, %zmm1, %zmm7     # zmm7 = (zmm1 * zmm3) + zmm7
	vbroadcastsd	16(%rbx), %zmm2
	vbroadcastsd	24(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm8     # zmm8 = (zmm0 * zmm2) + zmm8
	vfmadd231pd	%zmm2, %zmm1, %zmm9     # zmm9 = (zmm1 * zmm2) + zmm9
	vfmadd231pd	%zmm3, %zmm0, %zmm10    # zmm10 = (zmm0 * zmm3) + zmm10
	vfmadd231pd	%zmm3, %zmm1, %zmm11    # zmm11 = (zmm1 * zmm3) + zmm11
	prefetcht0	448(%rbx)
	vbroadcastsd	32(%rbx), %zmm2
	vbroadcastsd	40(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm12    # zmm12 = (zmm0 * zmm2) + zmm12
	vfmadd231pd	%zmm2, %zmm1, %zmm13    # zmm13 = (zmm1 * zmm2) + zmm13
	vfmadd231pd	%zmm3, %zmm0, %zmm14    # zmm14 = (zmm0 * zmm3) + zmm14
	vfmadd231pd	%zmm3, %zmm1, %zmm15    # zmm15 = (zmm1 * zmm3) + zmm15
	vbroadcastsd	48(%rbx), %zmm2
	vbroadcastsd	56(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm16    # zmm16 = (zmm0 * zmm2) + zmm16
	vfmadd231pd	%zmm2, %zmm1, %zmm17    # zmm17 = (zmm1 * zmm2) + zmm17
	vfmadd231pd	%zmm3, %zmm0, %zmm18    # zmm18 = (zmm0 * zmm3) + zmm18
	vfmadd231pd	%zmm3, %zmm1, %zmm19    # zmm19 = (zmm1 * zmm3) + zmm19
	prefetcht0	576(%rax)
	vbroadcastsd	64(%rbx), %zmm2
	vbroadcastsd	72(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm20    # zmm20 = (zmm0 * zmm2) + zmm20
	vfmadd231pd	%zmm2, %zmm1, %zmm21    # zmm21 = (zmm1 * zmm2) + zmm21
	vfmadd231pd	%zmm3, %zmm0, %zmm22    # zmm22 = (zmm0 * zmm3) + zmm22
	vfmadd231pd	%zmm3, %zmm1, %zmm23    # zmm23 = (zmm1 * zmm3) + zmm23
	vbroadcastsd	80(%rbx), %zmm2
	vbroadcastsd	88(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm24    # zmm24 = (zmm0 * zmm2) + zmm24
	vfmadd231pd	%zmm2, %zmm1, %zmm25    # zmm25 = (zmm1 * zmm2) + zmm25
	vfmadd231pd	%zmm3, %zmm0, %zmm26    # zmm26 = (zmm0 * zmm3) + zmm26
	vfmadd231pd	%zmm3, %zmm1, %zmm27    # zmm27 = (zmm1 * zmm3) + zmm27
	prefetcht0	504(%rbx)
	vbroadcastsd	96(%rbx), %zmm2
	vbroadcastsd	104(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm28    # zmm28 = (zmm0 * zmm2) + zmm28
	vfmadd231pd	%zmm2, %zmm1, %zmm29    # zmm29 = (zmm1 * zmm2) + zmm29
	vfmadd231pd	%zmm3, %zmm0, %zmm30    # zmm30 = (zmm0 * zmm3) + zmm30
	vfmadd231pd	%zmm3, %zmm1, %zmm31    # zmm31 = (zmm1 * zmm3) + zmm31
	vmovapd	(%rax), %zmm0
	vmovapd	64(%rax), %zmm1
	prefetcht1	(%rdx)
	prefetcht0	640(%rax)
	vbroadcastsd	112(%rbx), %zmm2
	vbroadcastsd	120(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm4     # zmm4 = (zmm0 * zmm2) + zmm4
	vfmadd231pd	%zmm2, %zmm1, %zmm5     # zmm5 = (zmm1 * zmm2) + zmm5
	vfmadd231pd	%zmm3, %zmm0, %zmm6     # zmm6 = (zmm0 * zmm3) + zmm6
	vfmadd231pd	%zmm3, %zmm1, %zmm7     # zmm7 = (zmm1 * zmm3) + zmm7
	vbroadcastsd	128(%rbx), %zmm2
	vbroadcastsd	136(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm8     # zmm8 = (zmm0 * zmm2) + zmm8
	vfmadd231pd	%zmm2, %zmm1, %zmm9     # zmm9 = (zmm1 * zmm2) + zmm9
	vfmadd231pd	%zmm3, %zmm0, %zmm10    # zmm10 = (zmm0 * zmm3) + zmm10
	vfmadd231pd	%zmm3, %zmm1, %zmm11    # zmm11 = (zmm1 * zmm3) + zmm11
	prefetcht0	560(%rbx)
	vbroadcastsd	144(%rbx), %zmm2
	vbroadcastsd	152(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm12    # zmm12 = (zmm0 * zmm2) + zmm12
	vfmadd231pd	%zmm2, %zmm1, %zmm13    # zmm13 = (zmm1 * zmm2) + zmm13
	vfmadd231pd	%zmm3, %zmm0, %zmm14    # zmm14 = (zmm0 * zmm3) + zmm14
	vfmadd231pd	%zmm3, %zmm1, %zmm15    # zmm15 = (zmm1 * zmm3) + zmm15
	vbroadcastsd	160(%rbx), %zmm2
	vbroadcastsd	168(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm16    # zmm16 = (zmm0 * zmm2) + zmm16
	vfmadd231pd	%zmm2, %zmm1, %zmm17    # zmm17 = (zmm1 * zmm2) + zmm17
	vfmadd231pd	%zmm3, %zmm0, %zmm18    # zmm18 = (zmm0 * zmm3) + zmm18
	vfmadd231pd	%zmm3, %zmm1, %zmm19    # zmm19 = (zmm1 * zmm3) + zmm19
	prefetcht0	704(%rax)
	vbroadcastsd	176(%rbx), %zmm2
	vbroadcastsd	184(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm20    # zmm20 = (zmm0 * zmm2) + zmm20
	vfmadd231pd	%zmm2, %zmm1, %zmm21    # zmm21 = (zmm1 * zmm2) + zmm21
	vfmadd231pd	%zmm3, %zmm0, %zmm22    # zmm22 = (zmm0 * zmm3) + zmm22
	vfmadd231pd	%zmm3, %zmm1, %zmm23    # zmm23 = (zmm1 * zmm3) + zmm23
	vbroadcastsd	192(%rbx), %zmm2
	vbroadcastsd	200(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm24    # zmm24 = (zmm0 * zmm2) + zmm24
	vfmadd231pd	%zmm2, %zmm1, %zmm25    # zmm25 = (zmm1 * zmm2) + zmm25
	vfmadd231pd	%zmm3, %zmm0, %zmm26    # zmm26 = (zmm0 * zmm3) + zmm26
	vfmadd231pd	%zmm3, %zmm1, %zmm27    # zmm27 = (zmm1 * zmm3) + zmm27
	prefetcht0	616(%rbx)
	vbroadcastsd	208(%rbx), %zmm2
	vbroadcastsd	216(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm28    # zmm28 = (zmm0 * zmm2) + zmm28
	vfmadd231pd	%zmm2, %zmm1, %zmm29    # zmm29 = (zmm1 * zmm2) + zmm29
	vfmadd231pd	%zmm3, %zmm0, %zmm30    # zmm30 = (zmm0 * zmm3) + zmm30
	vfmadd231pd	%zmm3, %zmm1, %zmm31    # zmm31 = (zmm1 * zmm3) + zmm31
	vmovapd	128(%rax), %zmm0
	vmovapd	192(%rax), %zmm1
	subq	$1, %rdi
	prefetcht0	768(%rax)
	vbroadcastsd	224(%rbx), %zmm2
	vbroadcastsd	232(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm4     # zmm4 = (zmm0 * zmm2) + zmm4
	vfmadd231pd	%zmm2, %zmm1, %zmm5     # zmm5 = (zmm1 * zmm2) + zmm5
	vfmadd231pd	%zmm3, %zmm0, %zmm6     # zmm6 = (zmm0 * zmm3) + zmm6
	vfmadd231pd	%zmm3, %zmm1, %zmm7     # zmm7 = (zmm1 * zmm3) + zmm7
	vbroadcastsd	240(%rbx), %zmm2
	vbroadcastsd	248(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm8     # zmm8 = (zmm0 * zmm2) + zmm8
	vfmadd231pd	%zmm2, %zmm1, %zmm9     # zmm9 = (zmm1 * zmm2) + zmm9
	vfmadd231pd	%zmm3, %zmm0, %zmm10    # zmm10 = (zmm0 * zmm3) + zmm10
	vfmadd231pd	%zmm3, %zmm1, %zmm11    # zmm11 = (zmm1 * zmm3) + zmm11
	prefetcht0	672(%rbx)
	vbroadcastsd	256(%rbx), %zmm2
	vbroadcastsd	264(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm12    # zmm12 = (zmm0 * zmm2) + zmm12
	vfmadd231pd	%zmm2, %zmm1, %zmm13    # zmm13 = (zmm1 * zmm2) + zmm13
	vfmadd231pd	%zmm3, %zmm0, %zmm14    # zmm14 = (zmm0 * zmm3) + zmm14
	vfmadd231pd	%zmm3, %zmm1, %zmm15    # zmm15 = (zmm1 * zmm3) + zmm15
	vbroadcastsd	272(%rbx), %zmm2
	vbroadcastsd	280(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm16    # zmm16 = (zmm0 * zmm2) + zmm16
	vfmadd231pd	%zmm2, %zmm1, %zmm17    # zmm17 = (zmm1 * zmm2) + zmm17
	vfmadd231pd	%zmm3, %zmm0, %zmm18    # zmm18 = (zmm0 * zmm3) + zmm18
	vfmadd231pd	%zmm3, %zmm1, %zmm19    # zmm19 = (zmm1 * zmm3) + zmm19
	prefetcht0	832(%rax)
	vbroadcastsd	288(%rbx), %zmm2
	vbroadcastsd	296(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm20    # zmm20 = (zmm0 * zmm2) + zmm20
	vfmadd231pd	%zmm2, %zmm1, %zmm21    # zmm21 = (zmm1 * zmm2) + zmm21
	vfmadd231pd	%zmm3, %zmm0, %zmm22    # zmm22 = (zmm0 * zmm3) + zmm22
	vfmadd231pd	%zmm3, %zmm1, %zmm23    # zmm23 = (zmm1 * zmm3) + zmm23
	vbroadcastsd	304(%rbx), %zmm2
	vbroadcastsd	312(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm24    # zmm24 = (zmm0 * zmm2) + zmm24
	vfmadd231pd	%zmm2, %zmm1, %zmm25    # zmm25 = (zmm1 * zmm2) + zmm25
	vfmadd231pd	%zmm3, %zmm0, %zmm26    # zmm26 = (zmm0 * zmm3) + zmm26
	vfmadd231pd	%zmm3, %zmm1, %zmm27    # zmm27 = (zmm1 * zmm3) + zmm27
	prefetcht0	728(%rbx)
	vbroadcastsd	320(%rbx), %zmm2
	vbroadcastsd	328(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm28    # zmm28 = (zmm0 * zmm2) + zmm28
	vfmadd231pd	%zmm2, %zmm1, %zmm29    # zmm29 = (zmm1 * zmm2) + zmm29
	vfmadd231pd	%zmm3, %zmm0, %zmm30    # zmm30 = (zmm0 * zmm3) + zmm30
	vfmadd231pd	%zmm3, %zmm1, %zmm31    # zmm31 = (zmm1 * zmm3) + zmm31
	vmovapd	256(%rax), %zmm0
	vmovapd	320(%rax), %zmm1
	prefetcht1	64(%rdx)
	prefetcht0	896(%rax)
	vbroadcastsd	336(%rbx), %zmm2
	vbroadcastsd	344(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm4     # zmm4 = (zmm0 * zmm2) + zmm4
	vfmadd231pd	%zmm2, %zmm1, %zmm5     # zmm5 = (zmm1 * zmm2) + zmm5
	vfmadd231pd	%zmm3, %zmm0, %zmm6     # zmm6 = (zmm0 * zmm3) + zmm6
	vfmadd231pd	%zmm3, %zmm1, %zmm7     # zmm7 = (zmm1 * zmm3) + zmm7
	vbroadcastsd	352(%rbx), %zmm2
	vbroadcastsd	360(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm8     # zmm8 = (zmm0 * zmm2) + zmm8
	vfmadd231pd	%zmm2, %zmm1, %zmm9     # zmm9 = (zmm1 * zmm2) + zmm9
	vfmadd231pd	%zmm3, %zmm0, %zmm10    # zmm10 = (zmm0 * zmm3) + zmm10
	vfmadd231pd	%zmm3, %zmm1, %zmm11    # zmm11 = (zmm1 * zmm3) + zmm11
	prefetcht0	784(%rbx)
	vbroadcastsd	368(%rbx), %zmm2
	vbroadcastsd	376(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm12    # zmm12 = (zmm0 * zmm2) + zmm12
	vfmadd231pd	%zmm2, %zmm1, %zmm13    # zmm13 = (zmm1 * zmm2) + zmm13
	vfmadd231pd	%zmm3, %zmm0, %zmm14    # zmm14 = (zmm0 * zmm3) + zmm14
	vfmadd231pd	%zmm3, %zmm1, %zmm15    # zmm15 = (zmm1 * zmm3) + zmm15
	vbroadcastsd	384(%rbx), %zmm2
	vbroadcastsd	392(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm16    # zmm16 = (zmm0 * zmm2) + zmm16
	vfmadd231pd	%zmm2, %zmm1, %zmm17    # zmm17 = (zmm1 * zmm2) + zmm17
	vfmadd231pd	%zmm3, %zmm0, %zmm18    # zmm18 = (zmm0 * zmm3) + zmm18
	vfmadd231pd	%zmm3, %zmm1, %zmm19    # zmm19 = (zmm1 * zmm3) + zmm19
	prefetcht0	960(%rax)
	vbroadcastsd	400(%rbx), %zmm2
	vbroadcastsd	408(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm20    # zmm20 = (zmm0 * zmm2) + zmm20
	vfmadd231pd	%zmm2, %zmm1, %zmm21    # zmm21 = (zmm1 * zmm2) + zmm21
	vfmadd231pd	%zmm3, %zmm0, %zmm22    # zmm22 = (zmm0 * zmm3) + zmm22
	vfmadd231pd	%zmm3, %zmm1, %zmm23    # zmm23 = (zmm1 * zmm3) + zmm23
	vbroadcastsd	416(%rbx), %zmm2
	vbroadcastsd	424(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm24    # zmm24 = (zmm0 * zmm2) + zmm24
	vfmadd231pd	%zmm2, %zmm1, %zmm25    # zmm25 = (zmm1 * zmm2) + zmm25
	vfmadd231pd	%zmm3, %zmm0, %zmm26    # zmm26 = (zmm0 * zmm3) + zmm26
	vfmadd231pd	%zmm3, %zmm1, %zmm27    # zmm27 = (zmm1 * zmm3) + zmm27
	prefetcht0	840(%rbx)
	vbroadcastsd	432(%rbx), %zmm2
	vbroadcastsd	440(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm28    # zmm28 = (zmm0 * zmm2) + zmm28
	vfmadd231pd	%zmm2, %zmm1, %zmm29    # zmm29 = (zmm1 * zmm2) + zmm29
	vfmadd231pd	%zmm3, %zmm0, %zmm30    # zmm30 = (zmm0 * zmm3) + zmm30
	vfmadd231pd	%zmm3, %zmm1, %zmm31    # zmm31 = (zmm1 * zmm3) + zmm31
	vmovapd	384(%rax), %zmm0
	vmovapd	448(%rax), %zmm1
	leaq	512(%rax), %rax
	leaq	448(%rbx), %rbx
	leaq	128(%rdx), %rdx
	jne	.LLOOP10
.LK_LE_800:
	addq	$14, %rdi
	jle	.LK_LE_240
	.p2align	5, 0x90
.LLOOP20:
	prefetcht0	(%r9)
	prefetcht0	512(%rax)
	vbroadcastsd	(%rbx), %zmm2
	vbroadcastsd	8(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm4     # zmm4 = (zmm0 * zmm2) + zmm4
	vfmadd231pd	%zmm2, %zmm1, %zmm5     # zmm5 = (zmm1 * zmm2) + zmm5
	vfmadd231pd	%zmm3, %zmm0, %zmm6     # zmm6 = (zmm0 * zmm3) + zmm6
	vfmadd231pd	%zmm3, %zmm1, %zmm7     # zmm7 = (zmm1 * zmm3) + zmm7
	vbroadcastsd	16(%rbx), %zmm2
	vbroadcastsd	24(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm8     # zmm8 = (zmm0 * zmm2) + zmm8
	vfmadd231pd	%zmm2, %zmm1, %zmm9     # zmm9 = (zmm1 * zmm2) + zmm9
	vfmadd231pd	%zmm3, %zmm0, %zmm10    # zmm10 = (zmm0 * zmm3) + zmm10
	vfmadd231pd	%zmm3, %zmm1, %zmm11    # zmm11 = (zmm1 * zmm3) + zmm11
	prefetcht0	448(%rbx)
	vbroadcastsd	32(%rbx), %zmm2
	vbroadcastsd	40(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm12    # zmm12 = (zmm0 * zmm2) + zmm12
	vfmadd231pd	%zmm2, %zmm1, %zmm13    # zmm13 = (zmm1 * zmm2) + zmm13
	vfmadd231pd	%zmm3, %zmm0, %zmm14    # zmm14 = (zmm0 * zmm3) + zmm14
	vfmadd231pd	%zmm3, %zmm1, %zmm15    # zmm15 = (zmm1 * zmm3) + zmm15
	vbroadcastsd	48(%rbx), %zmm2
	vbroadcastsd	56(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm16    # zmm16 = (zmm0 * zmm2) + zmm16
	vfmadd231pd	%zmm2, %zmm1, %zmm17    # zmm17 = (zmm1 * zmm2) + zmm17
	vfmadd231pd	%zmm3, %zmm0, %zmm18    # zmm18 = (zmm0 * zmm3) + zmm18
	vfmadd231pd	%zmm3, %zmm1, %zmm19    # zmm19 = (zmm1 * zmm3) + zmm19
	prefetcht0	576(%rax)
	vbroadcastsd	64(%rbx), %zmm2
	vbroadcastsd	72(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm20    # zmm20 = (zmm0 * zmm2) + zmm20
	vfmadd231pd	%zmm2, %zmm1, %zmm21    # zmm21 = (zmm1 * zmm2) + zmm21
	vfmadd231pd	%zmm3, %zmm0, %zmm22    # zmm22 = (zmm0 * zmm3) + zmm22
	vfmadd231pd	%zmm3, %zmm1, %zmm23    # zmm23 = (zmm1 * zmm3) + zmm23
	vbroadcastsd	80(%rbx), %zmm2
	vbroadcastsd	88(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm24    # zmm24 = (zmm0 * zmm2) + zmm24
	vfmadd231pd	%zmm2, %zmm1, %zmm25    # zmm25 = (zmm1 * zmm2) + zmm25
	vfmadd231pd	%zmm3, %zmm0, %zmm26    # zmm26 = (zmm0 * zmm3) + zmm26
	vfmadd231pd	%zmm3, %zmm1, %zmm27    # zmm27 = (zmm1 * zmm3) + zmm27
	prefetcht0	504(%rbx)
	vbroadcastsd	96(%rbx), %zmm2
	vbroadcastsd	104(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm28    # zmm28 = (zmm0 * zmm2) + zmm28
	vfmadd231pd	%zmm2, %zmm1, %zmm29    # zmm29 = (zmm1 * zmm2) + zmm29
	vfmadd231pd	%zmm3, %zmm0, %zmm30    # zmm30 = (zmm0 * zmm3) + zmm30
	vfmadd231pd	%zmm3, %zmm1, %zmm31    # zmm31 = (zmm1 * zmm3) + zmm31
	vmovapd	(%rax), %zmm0
	vmovapd	64(%rax), %zmm1
	prefetcht1	(%rdx)
	prefetcht0	640(%rax)
	vbroadcastsd	112(%rbx), %zmm2
	vbroadcastsd	120(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm4     # zmm4 = (zmm0 * zmm2) + zmm4
	vfmadd231pd	%zmm2, %zmm1, %zmm5     # zmm5 = (zmm1 * zmm2) + zmm5
	vfmadd231pd	%zmm3, %zmm0, %zmm6     # zmm6 = (zmm0 * zmm3) + zmm6
	vfmadd231pd	%zmm3, %zmm1, %zmm7     # zmm7 = (zmm1 * zmm3) + zmm7
	vbroadcastsd	128(%rbx), %zmm2
	vbroadcastsd	136(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm8     # zmm8 = (zmm0 * zmm2) + zmm8
	vfmadd231pd	%zmm2, %zmm1, %zmm9     # zmm9 = (zmm1 * zmm2) + zmm9
	vfmadd231pd	%zmm3, %zmm0, %zmm10    # zmm10 = (zmm0 * zmm3) + zmm10
	vfmadd231pd	%zmm3, %zmm1, %zmm11    # zmm11 = (zmm1 * zmm3) + zmm11
	prefetcht0	560(%rbx)
	vbroadcastsd	144(%rbx), %zmm2
	vbroadcastsd	152(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm12    # zmm12 = (zmm0 * zmm2) + zmm12
	vfmadd231pd	%zmm2, %zmm1, %zmm13    # zmm13 = (zmm1 * zmm2) + zmm13
	vfmadd231pd	%zmm3, %zmm0, %zmm14    # zmm14 = (zmm0 * zmm3) + zmm14
	vfmadd231pd	%zmm3, %zmm1, %zmm15    # zmm15 = (zmm1 * zmm3) + zmm15
	vbroadcastsd	160(%rbx), %zmm2
	vbroadcastsd	168(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm16    # zmm16 = (zmm0 * zmm2) + zmm16
	vfmadd231pd	%zmm2, %zmm1, %zmm17    # zmm17 = (zmm1 * zmm2) + zmm17
	vfmadd231pd	%zmm3, %zmm0, %zmm18    # zmm18 = (zmm0 * zmm3) + zmm18
	vfmadd231pd	%zmm3, %zmm1, %zmm19    # zmm19 = (zmm1 * zmm3) + zmm19
	prefetcht0	704(%rax)
	vbroadcastsd	176(%rbx), %zmm2
	vbroadcastsd	184(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm20    # zmm20 = (zmm0 * zmm2) + zmm20
	vfmadd231pd	%zmm2, %zmm1, %zmm21    # zmm21 = (zmm1 * zmm2) + zmm21
	vfmadd231pd	%zmm3, %zmm0, %zmm22    # zmm22 = (zmm0 * zmm3) + zmm22
	vfmadd231pd	%zmm3, %zmm1, %zmm23    # zmm23 = (zmm1 * zmm3) + zmm23
	vbroadcastsd	192(%rbx), %zmm2
	vbroadcastsd	200(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm24    # zmm24 = (zmm0 * zmm2) + zmm24
	vfmadd231pd	%zmm2, %zmm1, %zmm25    # zmm25 = (zmm1 * zmm2) + zmm25
	vfmadd231pd	%zmm3, %zmm0, %zmm26    # zmm26 = (zmm0 * zmm3) + zmm26
	vfmadd231pd	%zmm3, %zmm1, %zmm27    # zmm27 = (zmm1 * zmm3) + zmm27
	prefetcht0	616(%rbx)
	vbroadcastsd	208(%rbx), %zmm2
	vbroadcastsd	216(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm28    # zmm28 = (zmm0 * zmm2) + zmm28
	vfmadd231pd	%zmm2, %zmm1, %zmm29    # zmm29 = (zmm1 * zmm2) + zmm29
	vfmadd231pd	%zmm3, %zmm0, %zmm30    # zmm30 = (zmm0 * zmm3) + zmm30
	vfmadd231pd	%zmm3, %zmm1, %zmm31    # zmm31 = (zmm1 * zmm3) + zmm31
	vmovapd	128(%rax), %zmm0
	vmovapd	192(%rax), %zmm1
	prefetcht0	64(%r9)
	subq	$1, %rdi
	prefetcht0	768(%rax)
	vbroadcastsd	224(%rbx), %zmm2
	vbroadcastsd	232(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm4     # zmm4 = (zmm0 * zmm2) + zmm4
	vfmadd231pd	%zmm2, %zmm1, %zmm5     # zmm5 = (zmm1 * zmm2) + zmm5
	vfmadd231pd	%zmm3, %zmm0, %zmm6     # zmm6 = (zmm0 * zmm3) + zmm6
	vfmadd231pd	%zmm3, %zmm1, %zmm7     # zmm7 = (zmm1 * zmm3) + zmm7
	vbroadcastsd	240(%rbx), %zmm2
	vbroadcastsd	248(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm8     # zmm8 = (zmm0 * zmm2) + zmm8
	vfmadd231pd	%zmm2, %zmm1, %zmm9     # zmm9 = (zmm1 * zmm2) + zmm9
	vfmadd231pd	%zmm3, %zmm0, %zmm10    # zmm10 = (zmm0 * zmm3) + zmm10
	vfmadd231pd	%zmm3, %zmm1, %zmm11    # zmm11 = (zmm1 * zmm3) + zmm11
	prefetcht0	672(%rbx)
	vbroadcastsd	256(%rbx), %zmm2
	vbroadcastsd	264(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm12    # zmm12 = (zmm0 * zmm2) + zmm12
	vfmadd231pd	%zmm2, %zmm1, %zmm13    # zmm13 = (zmm1 * zmm2) + zmm13
	vfmadd231pd	%zmm3, %zmm0, %zmm14    # zmm14 = (zmm0 * zmm3) + zmm14
	vfmadd231pd	%zmm3, %zmm1, %zmm15    # zmm15 = (zmm1 * zmm3) + zmm15
	vbroadcastsd	272(%rbx), %zmm2
	vbroadcastsd	280(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm16    # zmm16 = (zmm0 * zmm2) + zmm16
	vfmadd231pd	%zmm2, %zmm1, %zmm17    # zmm17 = (zmm1 * zmm2) + zmm17
	vfmadd231pd	%zmm3, %zmm0, %zmm18    # zmm18 = (zmm0 * zmm3) + zmm18
	vfmadd231pd	%zmm3, %zmm1, %zmm19    # zmm19 = (zmm1 * zmm3) + zmm19
	prefetcht0	832(%rax)
	vbroadcastsd	288(%rbx), %zmm2
	vbroadcastsd	296(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm20    # zmm20 = (zmm0 * zmm2) + zmm20
	vfmadd231pd	%zmm2, %zmm1, %zmm21    # zmm21 = (zmm1 * zmm2) + zmm21
	vfmadd231pd	%zmm3, %zmm0, %zmm22    # zmm22 = (zmm0 * zmm3) + zmm22
	vfmadd231pd	%zmm3, %zmm1, %zmm23    # zmm23 = (zmm1 * zmm3) + zmm23
	vbroadcastsd	304(%rbx), %zmm2
	vbroadcastsd	312(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm24    # zmm24 = (zmm0 * zmm2) + zmm24
	vfmadd231pd	%zmm2, %zmm1, %zmm25    # zmm25 = (zmm1 * zmm2) + zmm25
	vfmadd231pd	%zmm3, %zmm0, %zmm26    # zmm26 = (zmm0 * zmm3) + zmm26
	vfmadd231pd	%zmm3, %zmm1, %zmm27    # zmm27 = (zmm1 * zmm3) + zmm27
	prefetcht0	728(%rbx)
	vbroadcastsd	320(%rbx), %zmm2
	vbroadcastsd	328(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm28    # zmm28 = (zmm0 * zmm2) + zmm28
	vfmadd231pd	%zmm2, %zmm1, %zmm29    # zmm29 = (zmm1 * zmm2) + zmm29
	vfmadd231pd	%zmm3, %zmm0, %zmm30    # zmm30 = (zmm0 * zmm3) + zmm30
	vfmadd231pd	%zmm3, %zmm1, %zmm31    # zmm31 = (zmm1 * zmm3) + zmm31
	vmovapd	256(%rax), %zmm0
	vmovapd	320(%rax), %zmm1
	prefetcht1	64(%rdx)
	prefetcht0	896(%rax)
	vbroadcastsd	336(%rbx), %zmm2
	vbroadcastsd	344(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm4     # zmm4 = (zmm0 * zmm2) + zmm4
	vfmadd231pd	%zmm2, %zmm1, %zmm5     # zmm5 = (zmm1 * zmm2) + zmm5
	vfmadd231pd	%zmm3, %zmm0, %zmm6     # zmm6 = (zmm0 * zmm3) + zmm6
	vfmadd231pd	%zmm3, %zmm1, %zmm7     # zmm7 = (zmm1 * zmm3) + zmm7
	vbroadcastsd	352(%rbx), %zmm2
	vbroadcastsd	360(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm8     # zmm8 = (zmm0 * zmm2) + zmm8
	vfmadd231pd	%zmm2, %zmm1, %zmm9     # zmm9 = (zmm1 * zmm2) + zmm9
	vfmadd231pd	%zmm3, %zmm0, %zmm10    # zmm10 = (zmm0 * zmm3) + zmm10
	vfmadd231pd	%zmm3, %zmm1, %zmm11    # zmm11 = (zmm1 * zmm3) + zmm11
	prefetcht0	784(%rbx)
	vbroadcastsd	368(%rbx), %zmm2
	vbroadcastsd	376(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm12    # zmm12 = (zmm0 * zmm2) + zmm12
	vfmadd231pd	%zmm2, %zmm1, %zmm13    # zmm13 = (zmm1 * zmm2) + zmm13
	vfmadd231pd	%zmm3, %zmm0, %zmm14    # zmm14 = (zmm0 * zmm3) + zmm14
	vfmadd231pd	%zmm3, %zmm1, %zmm15    # zmm15 = (zmm1 * zmm3) + zmm15
	vbroadcastsd	384(%rbx), %zmm2
	vbroadcastsd	392(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm16    # zmm16 = (zmm0 * zmm2) + zmm16
	vfmadd231pd	%zmm2, %zmm1, %zmm17    # zmm17 = (zmm1 * zmm2) + zmm17
	vfmadd231pd	%zmm3, %zmm0, %zmm18    # zmm18 = (zmm0 * zmm3) + zmm18
	vfmadd231pd	%zmm3, %zmm1, %zmm19    # zmm19 = (zmm1 * zmm3) + zmm19
	prefetcht0	960(%rax)
	vbroadcastsd	400(%rbx), %zmm2
	vbroadcastsd	408(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm20    # zmm20 = (zmm0 * zmm2) + zmm20
	vfmadd231pd	%zmm2, %zmm1, %zmm21    # zmm21 = (zmm1 * zmm2) + zmm21
	vfmadd231pd	%zmm3, %zmm0, %zmm22    # zmm22 = (zmm0 * zmm3) + zmm22
	vfmadd231pd	%zmm3, %zmm1, %zmm23    # zmm23 = (zmm1 * zmm3) + zmm23
	vbroadcastsd	416(%rbx), %zmm2
	vbroadcastsd	424(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm24    # zmm24 = (zmm0 * zmm2) + zmm24
	vfmadd231pd	%zmm2, %zmm1, %zmm25    # zmm25 = (zmm1 * zmm2) + zmm25
	vfmadd231pd	%zmm3, %zmm0, %zmm26    # zmm26 = (zmm0 * zmm3) + zmm26
	vfmadd231pd	%zmm3, %zmm1, %zmm27    # zmm27 = (zmm1 * zmm3) + zmm27
	prefetcht0	840(%rbx)
	vbroadcastsd	432(%rbx), %zmm2
	vbroadcastsd	440(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm28    # zmm28 = (zmm0 * zmm2) + zmm28
	vfmadd231pd	%zmm2, %zmm1, %zmm29    # zmm29 = (zmm1 * zmm2) + zmm29
	vfmadd231pd	%zmm3, %zmm0, %zmm30    # zmm30 = (zmm0 * zmm3) + zmm30
	vfmadd231pd	%zmm3, %zmm1, %zmm31    # zmm31 = (zmm1 * zmm3) + zmm31
	vmovapd	384(%rax), %zmm0
	vmovapd	448(%rax), %zmm1
	leaq	512(%rax), %rax
	leaq	448(%rbx), %rbx
	leaq	128(%rdx), %rdx
	leaq	(%r9,%r10), %r9
	jne	.LLOOP20
.LK_LE_240:
	addq	$5, %rdi
	jle	.LTAIL0
	.p2align	5, 0x90
.LLOOP30:
	prefetcht0	512(%rax)
	vbroadcastsd	(%rbx), %zmm2
	vbroadcastsd	8(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm4     # zmm4 = (zmm0 * zmm2) + zmm4
	vfmadd231pd	%zmm2, %zmm1, %zmm5     # zmm5 = (zmm1 * zmm2) + zmm5
	vfmadd231pd	%zmm3, %zmm0, %zmm6     # zmm6 = (zmm0 * zmm3) + zmm6
	vfmadd231pd	%zmm3, %zmm1, %zmm7     # zmm7 = (zmm1 * zmm3) + zmm7
	vbroadcastsd	16(%rbx), %zmm2
	vbroadcastsd	24(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm8     # zmm8 = (zmm0 * zmm2) + zmm8
	vfmadd231pd	%zmm2, %zmm1, %zmm9     # zmm9 = (zmm1 * zmm2) + zmm9
	vfmadd231pd	%zmm3, %zmm0, %zmm10    # zmm10 = (zmm0 * zmm3) + zmm10
	vfmadd231pd	%zmm3, %zmm1, %zmm11    # zmm11 = (zmm1 * zmm3) + zmm11
	prefetcht0	448(%rbx)
	vbroadcastsd	32(%rbx), %zmm2
	vbroadcastsd	40(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm12    # zmm12 = (zmm0 * zmm2) + zmm12
	vfmadd231pd	%zmm2, %zmm1, %zmm13    # zmm13 = (zmm1 * zmm2) + zmm13
	vfmadd231pd	%zmm3, %zmm0, %zmm14    # zmm14 = (zmm0 * zmm3) + zmm14
	vfmadd231pd	%zmm3, %zmm1, %zmm15    # zmm15 = (zmm1 * zmm3) + zmm15
	vbroadcastsd	48(%rbx), %zmm2
	vbroadcastsd	56(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm16    # zmm16 = (zmm0 * zmm2) + zmm16
	vfmadd231pd	%zmm2, %zmm1, %zmm17    # zmm17 = (zmm1 * zmm2) + zmm17
	vfmadd231pd	%zmm3, %zmm0, %zmm18    # zmm18 = (zmm0 * zmm3) + zmm18
	vfmadd231pd	%zmm3, %zmm1, %zmm19    # zmm19 = (zmm1 * zmm3) + zmm19
	prefetcht0	576(%rax)
	vbroadcastsd	64(%rbx), %zmm2
	vbroadcastsd	72(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm20    # zmm20 = (zmm0 * zmm2) + zmm20
	vfmadd231pd	%zmm2, %zmm1, %zmm21    # zmm21 = (zmm1 * zmm2) + zmm21
	vfmadd231pd	%zmm3, %zmm0, %zmm22    # zmm22 = (zmm0 * zmm3) + zmm22
	vfmadd231pd	%zmm3, %zmm1, %zmm23    # zmm23 = (zmm1 * zmm3) + zmm23
	vbroadcastsd	80(%rbx), %zmm2
	vbroadcastsd	88(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm24    # zmm24 = (zmm0 * zmm2) + zmm24
	vfmadd231pd	%zmm2, %zmm1, %zmm25    # zmm25 = (zmm1 * zmm2) + zmm25
	vfmadd231pd	%zmm3, %zmm0, %zmm26    # zmm26 = (zmm0 * zmm3) + zmm26
	vfmadd231pd	%zmm3, %zmm1, %zmm27    # zmm27 = (zmm1 * zmm3) + zmm27
	prefetcht0	504(%rbx)
	vbroadcastsd	96(%rbx), %zmm2
	vbroadcastsd	104(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm28    # zmm28 = (zmm0 * zmm2) + zmm28
	vfmadd231pd	%zmm2, %zmm1, %zmm29    # zmm29 = (zmm1 * zmm2) + zmm29
	vfmadd231pd	%zmm3, %zmm0, %zmm30    # zmm30 = (zmm0 * zmm3) + zmm30
	vfmadd231pd	%zmm3, %zmm1, %zmm31    # zmm31 = (zmm1 * zmm3) + zmm31
	vmovapd	(%rax), %zmm0
	vmovapd	64(%rax), %zmm1
	prefetcht1	(%rdx)
	prefetcht0	640(%rax)
	vbroadcastsd	112(%rbx), %zmm2
	vbroadcastsd	120(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm4     # zmm4 = (zmm0 * zmm2) + zmm4
	vfmadd231pd	%zmm2, %zmm1, %zmm5     # zmm5 = (zmm1 * zmm2) + zmm5
	vfmadd231pd	%zmm3, %zmm0, %zmm6     # zmm6 = (zmm0 * zmm3) + zmm6
	vfmadd231pd	%zmm3, %zmm1, %zmm7     # zmm7 = (zmm1 * zmm3) + zmm7
	vbroadcastsd	128(%rbx), %zmm2
	vbroadcastsd	136(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm8     # zmm8 = (zmm0 * zmm2) + zmm8
	vfmadd231pd	%zmm2, %zmm1, %zmm9     # zmm9 = (zmm1 * zmm2) + zmm9
	vfmadd231pd	%zmm3, %zmm0, %zmm10    # zmm10 = (zmm0 * zmm3) + zmm10
	vfmadd231pd	%zmm3, %zmm1, %zmm11    # zmm11 = (zmm1 * zmm3) + zmm11
	prefetcht0	560(%rbx)
	vbroadcastsd	144(%rbx), %zmm2
	vbroadcastsd	152(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm12    # zmm12 = (zmm0 * zmm2) + zmm12
	vfmadd231pd	%zmm2, %zmm1, %zmm13    # zmm13 = (zmm1 * zmm2) + zmm13
	vfmadd231pd	%zmm3, %zmm0, %zmm14    # zmm14 = (zmm0 * zmm3) + zmm14
	vfmadd231pd	%zmm3, %zmm1, %zmm15    # zmm15 = (zmm1 * zmm3) + zmm15
	vbroadcastsd	160(%rbx), %zmm2
	vbroadcastsd	168(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm16    # zmm16 = (zmm0 * zmm2) + zmm16
	vfmadd231pd	%zmm2, %zmm1, %zmm17    # zmm17 = (zmm1 * zmm2) + zmm17
	vfmadd231pd	%zmm3, %zmm0, %zmm18    # zmm18 = (zmm0 * zmm3) + zmm18
	vfmadd231pd	%zmm3, %zmm1, %zmm19    # zmm19 = (zmm1 * zmm3) + zmm19
	prefetcht0	704(%rax)
	vbroadcastsd	176(%rbx), %zmm2
	vbroadcastsd	184(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm20    # zmm20 = (zmm0 * zmm2) + zmm20
	vfmadd231pd	%zmm2, %zmm1, %zmm21    # zmm21 = (zmm1 * zmm2) + zmm21
	vfmadd231pd	%zmm3, %zmm0, %zmm22    # zmm22 = (zmm0 * zmm3) + zmm22
	vfmadd231pd	%zmm3, %zmm1, %zmm23    # zmm23 = (zmm1 * zmm3) + zmm23
	vbroadcastsd	192(%rbx), %zmm2
	vbroadcastsd	200(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm24    # zmm24 = (zmm0 * zmm2) + zmm24
	vfmadd231pd	%zmm2, %zmm1, %zmm25    # zmm25 = (zmm1 * zmm2) + zmm25
	vfmadd231pd	%zmm3, %zmm0, %zmm26    # zmm26 = (zmm0 * zmm3) + zmm26
	vfmadd231pd	%zmm3, %zmm1, %zmm27    # zmm27 = (zmm1 * zmm3) + zmm27
	prefetcht0	616(%rbx)
	vbroadcastsd	208(%rbx), %zmm2
	vbroadcastsd	216(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm28    # zmm28 = (zmm0 * zmm2) + zmm28
	vfmadd231pd	%zmm2, %zmm1, %zmm29    # zmm29 = (zmm1 * zmm2) + zmm29
	vfmadd231pd	%zmm3, %zmm0, %zmm30    # zmm30 = (zmm0 * zmm3) + zmm30
	vfmadd231pd	%zmm3, %zmm1, %zmm31    # zmm31 = (zmm1 * zmm3) + zmm31
	vmovapd	128(%rax), %zmm0
	vmovapd	192(%rax), %zmm1
	subq	$1, %rdi
	prefetcht0	768(%rax)
	vbroadcastsd	224(%rbx), %zmm2
	vbroadcastsd	232(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm4     # zmm4 = (zmm0 * zmm2) + zmm4
	vfmadd231pd	%zmm2, %zmm1, %zmm5     # zmm5 = (zmm1 * zmm2) + zmm5
	vfmadd231pd	%zmm3, %zmm0, %zmm6     # zmm6 = (zmm0 * zmm3) + zmm6
	vfmadd231pd	%zmm3, %zmm1, %zmm7     # zmm7 = (zmm1 * zmm3) + zmm7
	vbroadcastsd	240(%rbx), %zmm2
	vbroadcastsd	248(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm8     # zmm8 = (zmm0 * zmm2) + zmm8
	vfmadd231pd	%zmm2, %zmm1, %zmm9     # zmm9 = (zmm1 * zmm2) + zmm9
	vfmadd231pd	%zmm3, %zmm0, %zmm10    # zmm10 = (zmm0 * zmm3) + zmm10
	vfmadd231pd	%zmm3, %zmm1, %zmm11    # zmm11 = (zmm1 * zmm3) + zmm11
	prefetcht0	672(%rbx)
	vbroadcastsd	256(%rbx), %zmm2
	vbroadcastsd	264(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm12    # zmm12 = (zmm0 * zmm2) + zmm12
	vfmadd231pd	%zmm2, %zmm1, %zmm13    # zmm13 = (zmm1 * zmm2) + zmm13
	vfmadd231pd	%zmm3, %zmm0, %zmm14    # zmm14 = (zmm0 * zmm3) + zmm14
	vfmadd231pd	%zmm3, %zmm1, %zmm15    # zmm15 = (zmm1 * zmm3) + zmm15
	vbroadcastsd	272(%rbx), %zmm2
	vbroadcastsd	280(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm16    # zmm16 = (zmm0 * zmm2) + zmm16
	vfmadd231pd	%zmm2, %zmm1, %zmm17    # zmm17 = (zmm1 * zmm2) + zmm17
	vfmadd231pd	%zmm3, %zmm0, %zmm18    # zmm18 = (zmm0 * zmm3) + zmm18
	vfmadd231pd	%zmm3, %zmm1, %zmm19    # zmm19 = (zmm1 * zmm3) + zmm19
	prefetcht0	832(%rax)
	vbroadcastsd	288(%rbx), %zmm2
	vbroadcastsd	296(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm20    # zmm20 = (zmm0 * zmm2) + zmm20
	vfmadd231pd	%zmm2, %zmm1, %zmm21    # zmm21 = (zmm1 * zmm2) + zmm21
	vfmadd231pd	%zmm3, %zmm0, %zmm22    # zmm22 = (zmm0 * zmm3) + zmm22
	vfmadd231pd	%zmm3, %zmm1, %zmm23    # zmm23 = (zmm1 * zmm3) + zmm23
	vbroadcastsd	304(%rbx), %zmm2
	vbroadcastsd	312(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm24    # zmm24 = (zmm0 * zmm2) + zmm24
	vfmadd231pd	%zmm2, %zmm1, %zmm25    # zmm25 = (zmm1 * zmm2) + zmm25
	vfmadd231pd	%zmm3, %zmm0, %zmm26    # zmm26 = (zmm0 * zmm3) + zmm26
	vfmadd231pd	%zmm3, %zmm1, %zmm27    # zmm27 = (zmm1 * zmm3) + zmm27
	prefetcht0	728(%rbx)
	vbroadcastsd	320(%rbx), %zmm2
	vbroadcastsd	328(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm28    # zmm28 = (zmm0 * zmm2) + zmm28
	vfmadd231pd	%zmm2, %zmm1, %zmm29    # zmm29 = (zmm1 * zmm2) + zmm29
	vfmadd231pd	%zmm3, %zmm0, %zmm30    # zmm30 = (zmm0 * zmm3) + zmm30
	vfmadd231pd	%zmm3, %zmm1, %zmm31    # zmm31 = (zmm1 * zmm3) + zmm31
	vmovapd	256(%rax), %zmm0
	vmovapd	320(%rax), %zmm1
	prefetcht1	64(%rdx)
	prefetcht0	896(%rax)
	vbroadcastsd	336(%rbx), %zmm2
	vbroadcastsd	344(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm4     # zmm4 = (zmm0 * zmm2) + zmm4
	vfmadd231pd	%zmm2, %zmm1, %zmm5     # zmm5 = (zmm1 * zmm2) + zmm5
	vfmadd231pd	%zmm3, %zmm0, %zmm6     # zmm6 = (zmm0 * zmm3) + zmm6
	vfmadd231pd	%zmm3, %zmm1, %zmm7     # zmm7 = (zmm1 * zmm3) + zmm7
	vbroadcastsd	352(%rbx), %zmm2
	vbroadcastsd	360(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm8     # zmm8 = (zmm0 * zmm2) + zmm8
	vfmadd231pd	%zmm2, %zmm1, %zmm9     # zmm9 = (zmm1 * zmm2) + zmm9
	vfmadd231pd	%zmm3, %zmm0, %zmm10    # zmm10 = (zmm0 * zmm3) + zmm10
	vfmadd231pd	%zmm3, %zmm1, %zmm11    # zmm11 = (zmm1 * zmm3) + zmm11
	prefetcht0	784(%rbx)
	vbroadcastsd	368(%rbx), %zmm2
	vbroadcastsd	376(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm12    # zmm12 = (zmm0 * zmm2) + zmm12
	vfmadd231pd	%zmm2, %zmm1, %zmm13    # zmm13 = (zmm1 * zmm2) + zmm13
	vfmadd231pd	%zmm3, %zmm0, %zmm14    # zmm14 = (zmm0 * zmm3) + zmm14
	vfmadd231pd	%zmm3, %zmm1, %zmm15    # zmm15 = (zmm1 * zmm3) + zmm15
	vbroadcastsd	384(%rbx), %zmm2
	vbroadcastsd	392(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm16    # zmm16 = (zmm0 * zmm2) + zmm16
	vfmadd231pd	%zmm2, %zmm1, %zmm17    # zmm17 = (zmm1 * zmm2) + zmm17
	vfmadd231pd	%zmm3, %zmm0, %zmm18    # zmm18 = (zmm0 * zmm3) + zmm18
	vfmadd231pd	%zmm3, %zmm1, %zmm19    # zmm19 = (zmm1 * zmm3) + zmm19
	prefetcht0	960(%rax)
	vbroadcastsd	400(%rbx), %zmm2
	vbroadcastsd	408(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm20    # zmm20 = (zmm0 * zmm2) + zmm20
	vfmadd231pd	%zmm2, %zmm1, %zmm21    # zmm21 = (zmm1 * zmm2) + zmm21
	vfmadd231pd	%zmm3, %zmm0, %zmm22    # zmm22 = (zmm0 * zmm3) + zmm22
	vfmadd231pd	%zmm3, %zmm1, %zmm23    # zmm23 = (zmm1 * zmm3) + zmm23
	vbroadcastsd	416(%rbx), %zmm2
	vbroadcastsd	424(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm24    # zmm24 = (zmm0 * zmm2) + zmm24
	vfmadd231pd	%zmm2, %zmm1, %zmm25    # zmm25 = (zmm1 * zmm2) + zmm25
	vfmadd231pd	%zmm3, %zmm0, %zmm26    # zmm26 = (zmm0 * zmm3) + zmm26
	vfmadd231pd	%zmm3, %zmm1, %zmm27    # zmm27 = (zmm1 * zmm3) + zmm27
	prefetcht0	840(%rbx)
	vbroadcastsd	432(%rbx), %zmm2
	vbroadcastsd	440(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm28    # zmm28 = (zmm0 * zmm2) + zmm28
	vfmadd231pd	%zmm2, %zmm1, %zmm29    # zmm29 = (zmm1 * zmm2) + zmm29
	vfmadd231pd	%zmm3, %zmm0, %zmm30    # zmm30 = (zmm0 * zmm3) + zmm30
	vfmadd231pd	%zmm3, %zmm1, %zmm31    # zmm31 = (zmm1 * zmm3) + zmm31
	vmovapd	384(%rax), %zmm0
	vmovapd	448(%rax), %zmm1
	leaq	512(%rax), %rax
	leaq	448(%rbx), %rbx
	leaq	128(%rdx), %rdx
	jne	.LLOOP30
.LTAIL0:
	testq	%rsi, %rsi
	je	.LPOSTACCUM0
	.p2align	5, 0x90
.LTAIL_LOOP0:
	subq	$1, %rsi
	prefetcht0	512(%rax)
	vbroadcastsd	(%rbx), %zmm2
	vbroadcastsd	8(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm4     # zmm4 = (zmm0 * zmm2) + zmm4
	vfmadd231pd	%zmm2, %zmm1, %zmm5     # zmm5 = (zmm1 * zmm2) + zmm5
	vfmadd231pd	%zmm3, %zmm0, %zmm6     # zmm6 = (zmm0 * zmm3) + zmm6
	vfmadd231pd	%zmm3, %zmm1, %zmm7     # zmm7 = (zmm1 * zmm3) + zmm7
	vbroadcastsd	16(%rbx), %zmm2
	vbroadcastsd	24(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm8     # zmm8 = (zmm0 * zmm2) + zmm8
	vfmadd231pd	%zmm2, %zmm1, %zmm9     # zmm9 = (zmm1 * zmm2) + zmm9
	vfmadd231pd	%zmm3, %zmm0, %zmm10    # zmm10 = (zmm0 * zmm3) + zmm10
	vfmadd231pd	%zmm3, %zmm1, %zmm11    # zmm11 = (zmm1 * zmm3) + zmm11
	prefetcht0	448(%rbx)
	vbroadcastsd	32(%rbx), %zmm2
	vbroadcastsd	40(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm12    # zmm12 = (zmm0 * zmm2) + zmm12
	vfmadd231pd	%zmm2, %zmm1, %zmm13    # zmm13 = (zmm1 * zmm2) + zmm13
	vfmadd231pd	%zmm3, %zmm0, %zmm14    # zmm14 = (zmm0 * zmm3) + zmm14
	vfmadd231pd	%zmm3, %zmm1, %zmm15    # zmm15 = (zmm1 * zmm3) + zmm15
	vbroadcastsd	48(%rbx), %zmm2
	vbroadcastsd	56(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm16    # zmm16 = (zmm0 * zmm2) + zmm16
	vfmadd231pd	%zmm2, %zmm1, %zmm17    # zmm17 = (zmm1 * zmm2) + zmm17
	vfmadd231pd	%zmm3, %zmm0, %zmm18    # zmm18 = (zmm0 * zmm3) + zmm18
	vfmadd231pd	%zmm3, %zmm1, %zmm19    # zmm19 = (zmm1 * zmm3) + zmm19
	prefetcht0	576(%rax)
	vbroadcastsd	64(%rbx), %zmm2
	vbroadcastsd	72(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm20    # zmm20 = (zmm0 * zmm2) + zmm20
	vfmadd231pd	%zmm2, %zmm1, %zmm21    # zmm21 = (zmm1 * zmm2) + zmm21
	vfmadd231pd	%zmm3, %zmm0, %zmm22    # zmm22 = (zmm0 * zmm3) + zmm22
	vfmadd231pd	%zmm3, %zmm1, %zmm23    # zmm23 = (zmm1 * zmm3) + zmm23
	vbroadcastsd	80(%rbx), %zmm2
	vbroadcastsd	88(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm24    # zmm24 = (zmm0 * zmm2) + zmm24
	vfmadd231pd	%zmm2, %zmm1, %zmm25    # zmm25 = (zmm1 * zmm2) + zmm25
	vfmadd231pd	%zmm3, %zmm0, %zmm26    # zmm26 = (zmm0 * zmm3) + zmm26
	vfmadd231pd	%zmm3, %zmm1, %zmm27    # zmm27 = (zmm1 * zmm3) + zmm27
	prefetcht0	504(%rbx)
	vbroadcastsd	96(%rbx), %zmm2
	vbroadcastsd	104(%rbx), %zmm3
	vfmadd231pd	%zmm2, %zmm0, %zmm28    # zmm28 = (zmm0 * zmm2) + zmm28
	vfmadd231pd	%zmm2, %zmm1, %zmm29    # zmm29 = (zmm1 * zmm2) + zmm29
	vfmadd231pd	%zmm3, %zmm0, %zmm30    # zmm30 = (zmm0 * zmm3) + zmm30
	vfmadd231pd	%zmm3, %zmm1, %zmm31    # zmm31 = (zmm1 * zmm3) + zmm31
	vmovapd	(%rax), %zmm0
	vmovapd	64(%rax), %zmm1
	leaq	128(%rax), %rax
	leaq	112(%rbx), %rbx
	jne	.LTAIL_LOOP0
.LPOSTACCUM0:
	movq	32(%rsp), %rax
	movq	152(%rsp), %rbx
	vbroadcastsd	(%rax), %zmm0
	vbroadcastsd	(%rbx), %zmm1
	vxorpd	%ymm2, %ymm2, %ymm2
	movq	%r12, %rax
	movq	%r10, %rbx
	cmpq	$8, %rax
	jne	.LSCATTEREDUPDATE0
	vcomisd	%xmm2, %xmm1
	je	.LCOLSTORBZ0
	vmulpd	%zmm0, %zmm4, %zmm4
	vmulpd	%zmm0, %zmm5, %zmm5
	vfmadd231pd	(%rcx), %zmm1, %zmm4    # zmm4 = (zmm1 * mem) + zmm4
	vfmadd231pd	64(%rcx), %zmm1, %zmm5  # zmm5 = (zmm1 * mem) + zmm5
	vmovupd	%zmm4, (%rcx)
	vmovupd	%zmm5, 64(%rcx)
	leaq	(%rcx,%rbx), %rcx
	vmulpd	%zmm0, %zmm6, %zmm6
	vmulpd	%zmm0, %zmm7, %zmm7
	vfmadd231pd	(%rcx), %zmm1, %zmm6    # zmm6 = (zmm1 * mem) + zmm6
	vfmadd231pd	64(%rcx), %zmm1, %zmm7  # zmm7 = (zmm1 * mem) + zmm7
	vmovupd	%zmm6, (%rcx)
	vmovupd	%zmm7, 64(%rcx)
	leaq	(%rcx,%rbx), %rcx
	vmulpd	%zmm0, %zmm8, %zmm8
	vmulpd	%zmm0, %zmm9, %zmm9
	vfmadd231pd	(%rcx), %zmm1, %zmm8    # zmm8 = (zmm1 * mem) + zmm8
	vfmadd231pd	64(%rcx), %zmm1, %zmm9  # zmm9 = (zmm1 * mem) + zmm9
	vmovupd	%zmm8, (%rcx)
	vmovupd	%zmm9, 64(%rcx)
	leaq	(%rcx,%rbx), %rcx
	vmulpd	%zmm0, %zmm10, %zmm10
	vmulpd	%zmm0, %zmm11, %zmm11
	vfmadd231pd	(%rcx), %zmm1, %zmm10   # zmm10 = (zmm1 * mem) + zmm10
	vfmadd231pd	64(%rcx), %zmm1, %zmm11 # zmm11 = (zmm1 * mem) + zmm11
	vmovupd	%zmm10, (%rcx)
	vmovupd	%zmm11, 64(%rcx)
	leaq	(%rcx,%rbx), %rcx
	vmulpd	%zmm0, %zmm12, %zmm12
	vmulpd	%zmm0, %zmm13, %zmm13
	vfmadd231pd	(%rcx), %zmm1, %zmm12   # zmm12 = (zmm1 * mem) + zmm12
	vfmadd231pd	64(%rcx), %zmm1, %zmm13 # zmm13 = (zmm1 * mem) + zmm13
	vmovupd	%zmm12, (%rcx)
	vmovupd	%zmm13, 64(%rcx)
	leaq	(%rcx,%rbx), %rcx
	vmulpd	%zmm0, %zmm14, %zmm14
	vmulpd	%zmm0, %zmm15, %zmm15
	vfmadd231pd	(%rcx), %zmm1, %zmm14   # zmm14 = (zmm1 * mem) + zmm14
	vfmadd231pd	64(%rcx), %zmm1, %zmm15 # zmm15 = (zmm1 * mem) + zmm15
	vmovupd	%zmm14, (%rcx)
	vmovupd	%zmm15, 64(%rcx)
	leaq	(%rcx,%rbx), %rcx
	vmulpd	%zmm0, %zmm16, %zmm16
	vmulpd	%zmm0, %zmm17, %zmm17
	vfmadd231pd	(%rcx), %zmm1, %zmm16   # zmm16 = (zmm1 * mem) + zmm16
	vfmadd231pd	64(%rcx), %zmm1, %zmm17 # zmm17 = (zmm1 * mem) + zmm17
	vmovupd	%zmm16, (%rcx)
	vmovupd	%zmm17, 64(%rcx)
	leaq	(%rcx,%rbx), %rcx
	vmulpd	%zmm0, %zmm18, %zmm18
	vmulpd	%zmm0, %zmm19, %zmm19
	vfmadd231pd	(%rcx), %zmm1, %zmm18   # zmm18 = (zmm1 * mem) + zmm18
	vfmadd231pd	64(%rcx), %zmm1, %zmm19 # zmm19 = (zmm1 * mem) + zmm19
	vmovupd	%zmm18, (%rcx)
	vmovupd	%zmm19, 64(%rcx)
	leaq	(%rcx,%rbx), %rcx
	vmulpd	%zmm0, %zmm20, %zmm20
	vmulpd	%zmm0, %zmm21, %zmm21
	vfmadd231pd	(%rcx), %zmm1, %zmm20   # zmm20 = (zmm1 * mem) + zmm20
	vfmadd231pd	64(%rcx), %zmm1, %zmm21 # zmm21 = (zmm1 * mem) + zmm21
	vmovupd	%zmm20, (%rcx)
	vmovupd	%zmm21, 64(%rcx)
	leaq	(%rcx,%rbx), %rcx
	vmulpd	%zmm0, %zmm22, %zmm22
	vmulpd	%zmm0, %zmm23, %zmm23
	vfmadd231pd	(%rcx), %zmm1, %zmm22   # zmm22 = (zmm1 * mem) + zmm22
	vfmadd231pd	64(%rcx), %zmm1, %zmm23 # zmm23 = (zmm1 * mem) + zmm23
	vmovupd	%zmm22, (%rcx)
	vmovupd	%zmm23, 64(%rcx)
	leaq	(%rcx,%rbx), %rcx
	vmulpd	%zmm0, %zmm24, %zmm24
	vmulpd	%zmm0, %zmm25, %zmm25
	vfmadd231pd	(%rcx), %zmm1, %zmm24   # zmm24 = (zmm1 * mem) + zmm24
	vfmadd231pd	64(%rcx), %zmm1, %zmm25 # zmm25 = (zmm1 * mem) + zmm25
	vmovupd	%zmm24, (%rcx)
	vmovupd	%zmm25, 64(%rcx)
	leaq	(%rcx,%rbx), %rcx
	vmulpd	%zmm0, %zmm26, %zmm26
	vmulpd	%zmm0, %zmm27, %zmm27
	vfmadd231pd	(%rcx), %zmm1, %zmm26   # zmm26 = (zmm1 * mem) + zmm26
	vfmadd231pd	64(%rcx), %zmm1, %zmm27 # zmm27 = (zmm1 * mem) + zmm27
	vmovupd	%zmm26, (%rcx)
	vmovupd	%zmm27, 64(%rcx)
	leaq	(%rcx,%rbx), %rcx
	vmulpd	%zmm0, %zmm28, %zmm28
	vmulpd	%zmm0, %zmm29, %zmm29
	vfmadd231pd	(%rcx), %zmm1, %zmm28   # zmm28 = (zmm1 * mem) + zmm28
	vfmadd231pd	64(%rcx), %zmm1, %zmm29 # zmm29 = (zmm1 * mem) + zmm29
	vmovupd	%zmm28, (%rcx)
	vmovupd	%zmm29, 64(%rcx)
	leaq	(%rcx,%rbx), %rcx
	vmulpd	%zmm0, %zmm30, %zmm30
	vmulpd	%zmm0, %zmm31, %zmm31
	vfmadd231pd	(%rcx), %zmm1, %zmm30   # zmm30 = (zmm1 * mem) + zmm30
	vfmadd231pd	64(%rcx), %zmm1, %zmm31 # zmm31 = (zmm1 * mem) + zmm31
	vmovupd	%zmm30, (%rcx)
	vmovupd	%zmm31, 64(%rcx)
	leaq	(%rcx,%rbx), %rcx
	jmp	.LEND0
.LCOLSTORBZ0:
	vmulpd	%zmm0, %zmm4, %zmm4
	vmulpd	%zmm0, %zmm5, %zmm5
	vmovupd	%zmm4, (%rcx)
	vmovupd	%zmm5, 64(%rcx)
	leaq	(%rcx,%rbx), %rcx
	vmulpd	%zmm0, %zmm6, %zmm6
	vmulpd	%zmm0, %zmm7, %zmm7
	vmovupd	%zmm6, (%rcx)
	vmovupd	%zmm7, 64(%rcx)
	leaq	(%rcx,%rbx), %rcx
	vmulpd	%zmm0, %zmm8, %zmm8
	vmulpd	%zmm0, %zmm9, %zmm9
	vmovupd	%zmm8, (%rcx)
	vmovupd	%zmm9, 64(%rcx)
	leaq	(%rcx,%rbx), %rcx
	vmulpd	%zmm0, %zmm10, %zmm10
	vmulpd	%zmm0, %zmm11, %zmm11
	vmovupd	%zmm10, (%rcx)
	vmovupd	%zmm11, 64(%rcx)
	leaq	(%rcx,%rbx), %rcx
	vmulpd	%zmm0, %zmm12, %zmm12
	vmulpd	%zmm0, %zmm13, %zmm13
	vmovupd	%zmm12, (%rcx)
	vmovupd	%zmm13, 64(%rcx)
	leaq	(%rcx,%rbx), %rcx
	vmulpd	%zmm0, %zmm14, %zmm14
	vmulpd	%zmm0, %zmm15, %zmm15
	vmovupd	%zmm14, (%rcx)
	vmovupd	%zmm15, 64(%rcx)
	leaq	(%rcx,%rbx), %rcx
	vmulpd	%zmm0, %zmm16, %zmm16
	vmulpd	%zmm0, %zmm17, %zmm17
	vmovupd	%zmm16, (%rcx)
	vmovupd	%zmm17, 64(%rcx)
	leaq	(%rcx,%rbx), %rcx
	vmulpd	%zmm0, %zmm18, %zmm18
	vmulpd	%zmm0, %zmm19, %zmm19
	vmovupd	%zmm18, (%rcx)
	vmovupd	%zmm19, 64(%rcx)
	leaq	(%rcx,%rbx), %rcx
	vmulpd	%zmm0, %zmm20, %zmm20
	vmulpd	%zmm0, %zmm21, %zmm21
	vmovupd	%zmm20, (%rcx)
	vmovupd	%zmm21, 64(%rcx)
	leaq	(%rcx,%rbx), %rcx
	vmulpd	%zmm0, %zmm22, %zmm22
	vmulpd	%zmm0, %zmm23, %zmm23
	vmovupd	%zmm22, (%rcx)
	vmovupd	%zmm23, 64(%rcx)
	leaq	(%rcx,%rbx), %rcx
	vmulpd	%zmm0, %zmm24, %zmm24
	vmulpd	%zmm0, %zmm25, %zmm25
	vmovupd	%zmm24, (%rcx)
	vmovupd	%zmm25, 64(%rcx)
	leaq	(%rcx,%rbx), %rcx
	vmulpd	%zmm0, %zmm26, %zmm26
	vmulpd	%zmm0, %zmm27, %zmm27
	vmovupd	%zmm26, (%rcx)
	vmovupd	%zmm27, 64(%rcx)
	leaq	(%rcx,%rbx), %rcx
	vmulpd	%zmm0, %zmm28, %zmm28
	vmulpd	%zmm0, %zmm29, %zmm29
	vmovupd	%zmm28, (%rcx)
	vmovupd	%zmm29, 64(%rcx)
	leaq	(%rcx,%rbx), %rcx
	vmulpd	%zmm0, %zmm30, %zmm30
	vmulpd	%zmm0, %zmm31, %zmm31
	vmovupd	%zmm30, (%rcx)
	vmovupd	%zmm31, 64(%rcx)
	leaq	(%rcx,%rbx), %rcx
	jmp	.LEND0
.LSCATTEREDUPDATE0:
	vmulpd	%zmm0, %zmm4, %zmm4
	vmulpd	%zmm0, %zmm5, %zmm5
	vmulpd	%zmm0, %zmm6, %zmm6
	vmulpd	%zmm0, %zmm7, %zmm7
	vmulpd	%zmm0, %zmm8, %zmm8
	vmulpd	%zmm0, %zmm9, %zmm9
	vmulpd	%zmm0, %zmm10, %zmm10
	vmulpd	%zmm0, %zmm11, %zmm11
	vmulpd	%zmm0, %zmm12, %zmm12
	vmulpd	%zmm0, %zmm13, %zmm13
	vmulpd	%zmm0, %zmm14, %zmm14
	vmulpd	%zmm0, %zmm15, %zmm15
	vmulpd	%zmm0, %zmm16, %zmm16
	vmulpd	%zmm0, %zmm17, %zmm17
	vmulpd	%zmm0, %zmm18, %zmm18
	vmulpd	%zmm0, %zmm19, %zmm19
	vmulpd	%zmm0, %zmm20, %zmm20
	vmulpd	%zmm0, %zmm21, %zmm21
	vmulpd	%zmm0, %zmm22, %zmm22
	vmulpd	%zmm0, %zmm23, %zmm23
	vmulpd	%zmm0, %zmm24, %zmm24
	vmulpd	%zmm0, %zmm25, %zmm25
	vmulpd	%zmm0, %zmm26, %zmm26
	vmulpd	%zmm0, %zmm27, %zmm27
	vmulpd	%zmm0, %zmm28, %zmm28
	vmulpd	%zmm0, %zmm29, %zmm29
	vmulpd	%zmm0, %zmm30, %zmm30
	vmulpd	%zmm0, %zmm31, %zmm31
	vcomisd	%xmm2, %xmm1
	movq	24(%rsp), %rdi
	vpbroadcastq	%rax, %zmm0
	vpmullq	(%rdi), %zmm0, %zmm2
	vpmullq	64(%rdi), %zmm0, %zmm3
	je	.LSCATTERBZ0
	kxnorw	%k0, %k0, %k1
	kxnorw	%k0, %k0, %k2
	kxnorw	%k0, %k0, %k3
	kxnorw	%k0, %k0, %k4
	vgatherqpd	(%rcx,%zmm2), %zmm0 {%k1}
	vfmadd231pd	%zmm1, %zmm0, %zmm4     # zmm4 = (zmm0 * zmm1) + zmm4
	vgatherqpd	(%rcx,%zmm3), %zmm0 {%k2}
	vfmadd231pd	%zmm1, %zmm0, %zmm5     # zmm5 = (zmm0 * zmm1) + zmm5
	vscatterqpd	%zmm4, (%rcx,%zmm2) {%k3}
	vscatterqpd	%zmm5, (%rcx,%zmm3) {%k4}
	leaq	(%rcx,%rbx), %rcx
	kxnorw	%k0, %k0, %k1
	kxnorw	%k0, %k0, %k2
	kxnorw	%k0, %k0, %k3
	kxnorw	%k0, %k0, %k4
	vgatherqpd	(%rcx,%zmm2), %zmm0 {%k1}
	vfmadd231pd	%zmm1, %zmm0, %zmm6     # zmm6 = (zmm0 * zmm1) + zmm6
	vgatherqpd	(%rcx,%zmm3), %zmm0 {%k2}
	vfmadd231pd	%zmm1, %zmm0, %zmm7     # zmm7 = (zmm0 * zmm1) + zmm7
	vscatterqpd	%zmm6, (%rcx,%zmm2) {%k3}
	vscatterqpd	%zmm7, (%rcx,%zmm3) {%k4}
	leaq	(%rcx,%rbx), %rcx
	kxnorw	%k0, %k0, %k1
	kxnorw	%k0, %k0, %k2
	kxnorw	%k0, %k0, %k3
	kxnorw	%k0, %k0, %k4
	vgatherqpd	(%rcx,%zmm2), %zmm0 {%k1}
	vfmadd231pd	%zmm1, %zmm0, %zmm8     # zmm8 = (zmm0 * zmm1) + zmm8
	vgatherqpd	(%rcx,%zmm3), %zmm0 {%k2}
	vfmadd231pd	%zmm1, %zmm0, %zmm9     # zmm9 = (zmm0 * zmm1) + zmm9
	vscatterqpd	%zmm8, (%rcx,%zmm2) {%k3}
	vscatterqpd	%zmm9, (%rcx,%zmm3) {%k4}
	leaq	(%rcx,%rbx), %rcx
	kxnorw	%k0, %k0, %k1
	kxnorw	%k0, %k0, %k2
	kxnorw	%k0, %k0, %k3
	kxnorw	%k0, %k0, %k4
	vgatherqpd	(%rcx,%zmm2), %zmm0 {%k1}
	vfmadd231pd	%zmm1, %zmm0, %zmm10    # zmm10 = (zmm0 * zmm1) + zmm10
	vgatherqpd	(%rcx,%zmm3), %zmm0 {%k2}
	vfmadd231pd	%zmm1, %zmm0, %zmm11    # zmm11 = (zmm0 * zmm1) + zmm11
	vscatterqpd	%zmm10, (%rcx,%zmm2) {%k3}
	vscatterqpd	%zmm11, (%rcx,%zmm3) {%k4}
	leaq	(%rcx,%rbx), %rcx
	kxnorw	%k0, %k0, %k1
	kxnorw	%k0, %k0, %k2
	kxnorw	%k0, %k0, %k3
	kxnorw	%k0, %k0, %k4
	vgatherqpd	(%rcx,%zmm2), %zmm0 {%k1}
	vfmadd231pd	%zmm1, %zmm0, %zmm12    # zmm12 = (zmm0 * zmm1) + zmm12
	vgatherqpd	(%rcx,%zmm3), %zmm0 {%k2}
	vfmadd231pd	%zmm1, %zmm0, %zmm13    # zmm13 = (zmm0 * zmm1) + zmm13
	vscatterqpd	%zmm12, (%rcx,%zmm2) {%k3}
	vscatterqpd	%zmm13, (%rcx,%zmm3) {%k4}
	leaq	(%rcx,%rbx), %rcx
	kxnorw	%k0, %k0, %k1
	kxnorw	%k0, %k0, %k2
	kxnorw	%k0, %k0, %k3
	kxnorw	%k0, %k0, %k4
	vgatherqpd	(%rcx,%zmm2), %zmm0 {%k1}
	vfmadd231pd	%zmm1, %zmm0, %zmm14    # zmm14 = (zmm0 * zmm1) + zmm14
	vgatherqpd	(%rcx,%zmm3), %zmm0 {%k2}
	vfmadd231pd	%zmm1, %zmm0, %zmm15    # zmm15 = (zmm0 * zmm1) + zmm15
	vscatterqpd	%zmm14, (%rcx,%zmm2) {%k3}
	vscatterqpd	%zmm15, (%rcx,%zmm3) {%k4}
	leaq	(%rcx,%rbx), %rcx
	kxnorw	%k0, %k0, %k1
	kxnorw	%k0, %k0, %k2
	kxnorw	%k0, %k0, %k3
	kxnorw	%k0, %k0, %k4
	vgatherqpd	(%rcx,%zmm2), %zmm0 {%k1}
	vfmadd231pd	%zmm1, %zmm0, %zmm16    # zmm16 = (zmm0 * zmm1) + zmm16
	vgatherqpd	(%rcx,%zmm3), %zmm0 {%k2}
	vfmadd231pd	%zmm1, %zmm0, %zmm17    # zmm17 = (zmm0 * zmm1) + zmm17
	vscatterqpd	%zmm16, (%rcx,%zmm2) {%k3}
	vscatterqpd	%zmm17, (%rcx,%zmm3) {%k4}
	leaq	(%rcx,%rbx), %rcx
	kxnorw	%k0, %k0, %k1
	kxnorw	%k0, %k0, %k2
	kxnorw	%k0, %k0, %k3
	kxnorw	%k0, %k0, %k4
	vgatherqpd	(%rcx,%zmm2), %zmm0 {%k1}
	vfmadd231pd	%zmm1, %zmm0, %zmm18    # zmm18 = (zmm0 * zmm1) + zmm18
	vgatherqpd	(%rcx,%zmm3), %zmm0 {%k2}
	vfmadd231pd	%zmm1, %zmm0, %zmm19    # zmm19 = (zmm0 * zmm1) + zmm19
	vscatterqpd	%zmm18, (%rcx,%zmm2) {%k3}
	vscatterqpd	%zmm19, (%rcx,%zmm3) {%k4}
	leaq	(%rcx,%rbx), %rcx
	kxnorw	%k0, %k0, %k1
	kxnorw	%k0, %k0, %k2
	kxnorw	%k0, %k0, %k3
	kxnorw	%k0, %k0, %k4
	vgatherqpd	(%rcx,%zmm2), %zmm0 {%k1}
	vfmadd231pd	%zmm1, %zmm0, %zmm20    # zmm20 = (zmm0 * zmm1) + zmm20
	vgatherqpd	(%rcx,%zmm3), %zmm0 {%k2}
	vfmadd231pd	%zmm1, %zmm0, %zmm21    # zmm21 = (zmm0 * zmm1) + zmm21
	vscatterqpd	%zmm20, (%rcx,%zmm2) {%k3}
	vscatterqpd	%zmm21, (%rcx,%zmm3) {%k4}
	leaq	(%rcx,%rbx), %rcx
	kxnorw	%k0, %k0, %k1
	kxnorw	%k0, %k0, %k2
	kxnorw	%k0, %k0, %k3
	kxnorw	%k0, %k0, %k4
	vgatherqpd	(%rcx,%zmm2), %zmm0 {%k1}
	vfmadd231pd	%zmm1, %zmm0, %zmm22    # zmm22 = (zmm0 * zmm1) + zmm22
	vgatherqpd	(%rcx,%zmm3), %zmm0 {%k2}
	vfmadd231pd	%zmm1, %zmm0, %zmm23    # zmm23 = (zmm0 * zmm1) + zmm23
	vscatterqpd	%zmm22, (%rcx,%zmm2) {%k3}
	vscatterqpd	%zmm23, (%rcx,%zmm3) {%k4}
	leaq	(%rcx,%rbx), %rcx
	kxnorw	%k0, %k0, %k1
	kxnorw	%k0, %k0, %k2
	kxnorw	%k0, %k0, %k3
	kxnorw	%k0, %k0, %k4
	vgatherqpd	(%rcx,%zmm2), %zmm0 {%k1}
	vfmadd231pd	%zmm1, %zmm0, %zmm24    # zmm24 = (zmm0 * zmm1) + zmm24
	vgatherqpd	(%rcx,%zmm3), %zmm0 {%k2}
	vfmadd231pd	%zmm1, %zmm0, %zmm25    # zmm25 = (zmm0 * zmm1) + zmm25
	vscatterqpd	%zmm24, (%rcx,%zmm2) {%k3}
	vscatterqpd	%zmm25, (%rcx,%zmm3) {%k4}
	leaq	(%rcx,%rbx), %rcx
	kxnorw	%k0, %k0, %k1
	kxnorw	%k0, %k0, %k2
	kxnorw	%k0, %k0, %k3
	kxnorw	%k0, %k0, %k4
	vgatherqpd	(%rcx,%zmm2), %zmm0 {%k1}
	vfmadd231pd	%zmm1, %zmm0, %zmm26    # zmm26 = (zmm0 * zmm1) + zmm26
	vgatherqpd	(%rcx,%zmm3), %zmm0 {%k2}
	vfmadd231pd	%zmm1, %zmm0, %zmm27    # zmm27 = (zmm0 * zmm1) + zmm27
	vscatterqpd	%zmm26, (%rcx,%zmm2) {%k3}
	vscatterqpd	%zmm27, (%rcx,%zmm3) {%k4}
	leaq	(%rcx,%rbx), %rcx
	kxnorw	%k0, %k0, %k1
	kxnorw	%k0, %k0, %k2
	kxnorw	%k0, %k0, %k3
	kxnorw	%k0, %k0, %k4
	vgatherqpd	(%rcx,%zmm2), %zmm0 {%k1}
	vfmadd231pd	%zmm1, %zmm0, %zmm28    # zmm28 = (zmm0 * zmm1) + zmm28
	vgatherqpd	(%rcx,%zmm3), %zmm0 {%k2}
	vfmadd231pd	%zmm1, %zmm0, %zmm29    # zmm29 = (zmm0 * zmm1) + zmm29
	vscatterqpd	%zmm28, (%rcx,%zmm2) {%k3}
	vscatterqpd	%zmm29, (%rcx,%zmm3) {%k4}
	leaq	(%rcx,%rbx), %rcx
	kxnorw	%k0, %k0, %k1
	kxnorw	%k0, %k0, %k2
	kxnorw	%k0, %k0, %k3
	kxnorw	%k0, %k0, %k4
	vgatherqpd	(%rcx,%zmm2), %zmm0 {%k1}
	vfmadd231pd	%zmm1, %zmm0, %zmm30    # zmm30 = (zmm0 * zmm1) + zmm30
	vgatherqpd	(%rcx,%zmm3), %zmm0 {%k2}
	vfmadd231pd	%zmm1, %zmm0, %zmm31    # zmm31 = (zmm0 * zmm1) + zmm31
	vscatterqpd	%zmm30, (%rcx,%zmm2) {%k3}
	vscatterqpd	%zmm31, (%rcx,%zmm3) {%k4}
	leaq	(%rcx,%rbx), %rcx
	jmp	.LEND0
.LSCATTERBZ0:
	kxnorw	%k0, %k0, %k1
	kxnorw	%k0, %k0, %k2
	vscatterqpd	%zmm4, (%rcx,%zmm2) {%k1}
	vscatterqpd	%zmm5, (%rcx,%zmm3) {%k2}
	leaq	(%rcx,%rbx), %rcx
	kxnorw	%k0, %k0, %k1
	kxnorw	%k0, %k0, %k2
	vscatterqpd	%zmm6, (%rcx,%zmm2) {%k1}
	vscatterqpd	%zmm7, (%rcx,%zmm3) {%k2}
	leaq	(%rcx,%rbx), %rcx
	kxnorw	%k0, %k0, %k1
	kxnorw	%k0, %k0, %k2
	vscatterqpd	%zmm8, (%rcx,%zmm2) {%k1}
	vscatterqpd	%zmm9, (%rcx,%zmm3) {%k2}
	leaq	(%rcx,%rbx), %rcx
	kxnorw	%k0, %k0, %k1
	kxnorw	%k0, %k0, %k2
	vscatterqpd	%zmm10, (%rcx,%zmm2) {%k1}
	vscatterqpd	%zmm11, (%rcx,%zmm3) {%k2}
	leaq	(%rcx,%rbx), %rcx
	kxnorw	%k0, %k0, %k1
	kxnorw	%k0, %k0, %k2
	vscatterqpd	%zmm12, (%rcx,%zmm2) {%k1}
	vscatterqpd	%zmm13, (%rcx,%zmm3) {%k2}
	leaq	(%rcx,%rbx), %rcx
	kxnorw	%k0, %k0, %k1
	kxnorw	%k0, %k0, %k2
	vscatterqpd	%zmm14, (%rcx,%zmm2) {%k1}
	vscatterqpd	%zmm15, (%rcx,%zmm3) {%k2}
	leaq	(%rcx,%rbx), %rcx
	kxnorw	%k0, %k0, %k1
	kxnorw	%k0, %k0, %k2
	vscatterqpd	%zmm16, (%rcx,%zmm2) {%k1}
	vscatterqpd	%zmm17, (%rcx,%zmm3) {%k2}
	leaq	(%rcx,%rbx), %rcx
	kxnorw	%k0, %k0, %k1
	kxnorw	%k0, %k0, %k2
	vscatterqpd	%zmm18, (%rcx,%zmm2) {%k1}
	vscatterqpd	%zmm19, (%rcx,%zmm3) {%k2}
	leaq	(%rcx,%rbx), %rcx
	kxnorw	%k0, %k0, %k1
	kxnorw	%k0, %k0, %k2
	vscatterqpd	%zmm20, (%rcx,%zmm2) {%k1}
	vscatterqpd	%zmm21, (%rcx,%zmm3) {%k2}
	leaq	(%rcx,%rbx), %rcx
	kxnorw	%k0, %k0, %k1
	kxnorw	%k0, %k0, %k2
	vscatterqpd	%zmm22, (%rcx,%zmm2) {%k1}
	vscatterqpd	%zmm23, (%rcx,%zmm3) {%k2}
	leaq	(%rcx,%rbx), %rcx
	kxnorw	%k0, %k0, %k1
	kxnorw	%k0, %k0, %k2
	vscatterqpd	%zmm24, (%rcx,%zmm2) {%k1}
	vscatterqpd	%zmm25, (%rcx,%zmm3) {%k2}
	leaq	(%rcx,%rbx), %rcx
	kxnorw	%k0, %k0, %k1
	kxnorw	%k0, %k0, %k2
	vscatterqpd	%zmm26, (%rcx,%zmm2) {%k1}
	vscatterqpd	%zmm27, (%rcx,%zmm3) {%k2}
	leaq	(%rcx,%rbx), %rcx
	kxnorw	%k0, %k0, %k1
	kxnorw	%k0, %k0, %k2
	vscatterqpd	%zmm28, (%rcx,%zmm2) {%k1}
	vscatterqpd	%zmm29, (%rcx,%zmm3) {%k2}
	leaq	(%rcx,%rbx), %rcx
	kxnorw	%k0, %k0, %k1
	kxnorw	%k0, %k0, %k2
	vscatterqpd	%zmm30, (%rcx,%zmm2) {%k1}
	vscatterqpd	%zmm31, (%rcx,%zmm3) {%k2}
	leaq	(%rcx,%rbx), %rcx
.LEND0:
	vzeroupper

	#NO_APP
	addq	$56, %rsp
	popq	%rbx
	popq	%rdi
	popq	%rsi
	popq	%r12
	popq	%r13
	popq	%r14
	popq	%r15
	retq
	.seh_endproc
                                        # -- End function
	.data
	.p2align	6                               # @offsets
offsets:
	.quad	0                               # 0x0
	.quad	1                               # 0x1
	.quad	2                               # 0x2
	.quad	3                               # 0x3
	.quad	4                               # 0x4
	.quad	5                               # 0x5
	.quad	6                               # 0x6
	.quad	7                               # 0x7
	.quad	8                               # 0x8
	.quad	9                               # 0x9
	.quad	10                              # 0xa
	.quad	11                              # 0xb
	.quad	12                              # 0xc
	.quad	13                              # 0xd
	.quad	14                              # 0xe
	.quad	15                              # 0xf

	.section	.drectve,"yn"
	.ascii	" /DEFAULTLIB:msvcrt.lib"
	.addrsig
	.addrsig_sym offsets

assembly.zip

@devinamatthews
Copy link
Member

The problem seems to be here:

	callq	*240(%rsp)                      # 8-byte Folded Reload
	movq	4800(%rbp), %rax
	movsd	(%rax), %xmm0                   # xmm0 = mem[0],zero
	ucomisd	%xmm6, %xmm0

The beta variable is loaded in xmm0 and compared to xmm6 which was zeroed out before the call to the microkernel. Of course, in the meantime the microkernel has overwritten it. Perhaps the craziness of Windows calling conventions is to blame here? Could we force the compiler to re-zero xmm6 by changing the calling convention used for the microkernel or some other mechanism?

@devinamatthews
Copy link
Member

I guess the "hacky" solution is to zero out xmm6 on exit from the microkernel...

@devinamatthews
Copy link
Member

If you could also generate the assembly for kernels/haswell/3/bli_gemm_haswell_asm_d6x8.c then I could see if maybe the compiler is zeroing out the xmm registers for us in that case but not for AVX512. On non-AVX512 (haswell) xmm6[0] should contain C[1][0] on exit so I'm not sure why it doesn't also fail.

@devinamatthews
Copy link
Member

Indeed the haswell kernel is saving xmm6 through xmm15 on the stack, but the skx kernel is not.

devinamatthews added a commit that referenced this issue Jul 7, 2021
Try using `-march=haswell` for kernels. Fixes #514.
@devinamatthews
Copy link
Member

@h-vetinari please try the windows-avx512 branch again (force-pushed, only ad10dc1 matters).

@h-vetinari
Copy link
Contributor Author

Perhaps the craziness of Windows calling conventions is to blame here? Could we force the compiler to re-zero xmm6 by changing the calling convention used for the microkernel or some other mechanism?

Not sure if related, but this reminds me of a recent numpy/scipy bug caused by microsoft shipping a broken ucrtbase runtime that ended up corrupting various registers (and took them half a year to fix).

@devinamatthews
Copy link
Member

@h-vetinari If I'm reading the logs right, seems like it is fixed?

@h-vetinari
Copy link
Contributor Author

Yes, switching to the haswell kernel made the reproducer run through also on AVX512 🥳
(I just couldn't stick around for the job to finish because it got very late.)

Thanks a lot for your help on this, much appreciated!

I'll discuss with Isuru to build a new version for conda-forge with this patch, and see if this helps with the other failures I've been seeing (e.g. #517, resp. other scipy errors I haven't raised a ticket for yet).

@h-vetinari
Copy link
Contributor Author

Also note, check-blastests.sh doesn't produce failures anymore with that patch either.

@devinamatthews
Copy link
Member

Great, I'll also add a comment to the kernel makefile so this doesn't pop back up. FYI it's not using the haswell kernel on AVX512 (that would lead to much lower performance), just using -march=haswell to maintain the same calling convention. The AVX512 instructions are hard-coded in inline ASM.

devinamatthews added a commit that referenced this issue Jul 8, 2021
Use `-march=haswell` for kernels. Fixes #514.
@h-vetinari
Copy link
Contributor Author

Great, thanks, and thanks for the explanation.

Great, I'll also add a comment to the kernel makefile so this doesn't pop back up.

Not that it's a priority, but shouldn't it be possible to eventually fix this natively (not knowing ATM whether it's the compiler, the runtime, blis or something else that's at fault)?

@devinamatthews
Copy link
Member

I consider this solved "once and for all" since it seems that the cause is a difference in calling convention between AVX512 and non-AVX512 code (or a compiler bug). I can't find any Microsoft documentation to support this but then again I didn't think I would.

@h-vetinari
Copy link
Contributor Author

h-vetinari commented Aug 8, 2021

Just to report the results of this (took a while, sorry) - the errors on the numpy test suite are now gone: conda-forge/numpy-feedstock#237, which is awesome!

Unfortunately, on the scipy side conda-forge/scipy-feedstock#172, blis+avx512 now hangs indefinitely. I'm going to open another issue for that. Edit: #526

At least #517 did not reappear, which I'm going to close.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants