-
Notifications
You must be signed in to change notification settings - Fork 706
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
{devel}[foss/2021b] PyTorch v1.11.0 w/ Python 3.9.6 + CUDA 11.4.1 #16385
{devel}[foss/2021b] PyTorch v1.11.0 w/ Python 3.9.6 + CUDA 11.4.1 #16385
Conversation
|
||
# several tests are known to be flaky, and fail in some contexts (like having multiple GPUs available), | ||
# so we allow up to 10 (out of ~90k) tests to fail before treating the installation to be faulty | ||
max_failed_tests = 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With the merge of easybuilders/easybuild-easyblocks#2794 I'm guessing this will need to be higher. But let's see how many tests actually fail first, it might not be all that many since we still patched failing tests when the original EasyConfig 1.11.0 was developed :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To (hopefully) add to this: I tried to install PyTorch-1.10.0-foss-2021a-CUDA-11.3.1.eb
which originally failed.
So I added the max_failed_tests = 10
to the EasyConfig file and tried to install it like that:
eb --include-easyblocks-from-pr=2794 --cuda-compute-capabilities=7.5 PyTorch-1.10.0-foss-2021a-CUDA-11.3.1.eb
I got:
WARNING: 0 test failure, 463 test errors (out of 57757):
distributed/pipeline/sync/skip/test_gpipe (12 skipped, 1 warning, 1 error)
distributed/pipeline/sync/skip/test_leak (1 warning, 8 errors)
distributed/pipeline/sync/test_bugs (1 skipped, 1 warning, 3 errors)
distributed/pipeline/sync/test_inplace (2 xfailed, 1 warning, 1 error)
distributed/pipeline/sync/test_pipe (1 passed, 8 skipped, 1 warning, 47 errors)
distributed/pipeline/sync/test_transparency (1 warning, 1 error)
distributed/rpc/cuda/test_tensorpipe_agent (107 total tests, errors=1)
distributed/rpc/test_faulty_agent (28 total tests, errors=28)
distributed/rpc/test_tensorpipe_agent (424 total tests, errors=412)
distributed/test_store (19 total tests, errors=1)
I guess we need to do a bit more tuning here. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, see the changes in #16339
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test report from installation.
The installation failed with:
Running test_xnnpack_integration ... [2022-10-12 02:18:04.597373]
Executing ['/sw-eb/software/Python/3.9.6-GCCcore-11.2.0/bin/python', 'test_xnnpack_integration.py', '-v'] ... [2022-10-12 02:18:04.597482]
/dev/shm/hpcsw/eb-kabv0vz7/tmpcfnhusx2/lib/python3.9/site-packages/torch/cuda/__init__.py:82: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up
environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at /dev/shm/hpcsw/PyTorch/1.11.0/foss-202
1b-CUDA-11.4.1/pytorch/c10/cuda/CUDAFunctions.cpp:112.)
return torch._C._cuda_getDeviceCount() > 0
test_conv1d_basic (__main__.TestXNNPACKConv1dTransformPass) ... /dev/shm/hpcsw/eb-kabv0vz7/tmpcfnhusx2/lib/python3.9/site-packages/torch/testing/_internal/common_utils.py:424: UserWarning:
Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered in
ternally at /dev/shm/hpcsw/PyTorch/1.11.0/foss-2021b-CUDA-11.4.1/pytorch/c10/core/TensorImpl.h:1460.)
return callable(*args, **kwargs)
ok
test_conv1d_with_relu_fc (__main__.TestXNNPACKConv1dTransformPass) ... skipped 'test is slow; run with PYTORCH_TEST_WITH_SLOW to enable test'
test_conv2d (__main__.TestXNNPACKOps) ... ok
test_conv2d_transpose (__main__.TestXNNPACKOps) ... ok
test_linear (__main__.TestXNNPACKOps) ... ok
test_linear_1d_input (__main__.TestXNNPACKOps) ... ok
test_decomposed_linear (__main__.TestXNNPACKRewritePass) ... ok
test_linear (__main__.TestXNNPACKRewritePass) ... ok
test_combined_model (__main__.TestXNNPACKSerDes) ... ok
test_conv2d (__main__.TestXNNPACKSerDes) ... ok
test_conv2d_transpose (__main__.TestXNNPACKSerDes) ... ok
test_linear (__main__.TestXNNPACKSerDes) ... ok
----------------------------------------------------------------------
Ran 12 tests in 141.679s
OK (skipped=1)
distributed/pipeline/sync/skip/test_gpipe failed!
distributed/pipeline/sync/skip/test_leak failed!
distributed/pipeline/sync/test_bugs failed!
distributed/pipeline/sync/test_inplace failed!
distributed/pipeline/sync/test_pipe failed!
distributed/pipeline/sync/test_transparency failed!
distributed/rpc/cuda/test_tensorpipe_agent failed!
distributed/rpc/test_faulty_agent failed!
distributed/rpc/test_tensorpipe_agent failed!
distributed/test_store failed!
distributions/test_distributions failed!
== 2022-10-12 02:20:34,765 filetools.py:382 INFO Path /dev/shm/hpcsw/eb-kabv0vz7/tmpcfnhusx2 successfully removed.
== 2022-10-12 02:20:36,991 pytorch.py:344 WARNING 0 test failure, 24 test errors (out of 88784):
distributed/pipeline/sync/skip/test_gpipe (12 skipped, 1 warning, 1 error)
distributed/pipeline/sync/skip/test_leak (1 warning, 8 errors)
distributed/pipeline/sync/test_bugs (1 skipped, 1 warning, 3 errors)
distributed/pipeline/sync/test_inplace (2 xfailed, 1 warning, 1 error)
distributed/pipeline/sync/test_pipe (1 passed, 8 skipped, 1 warning, 47 errors)
distributed/pipeline/sync/test_transparency (1 warning, 1 error)
distributions/test_distributions (216 total tests, errors=3, skipped=5)
The PyTorch test suite is known to include some flaky tests, which may fail depending on the specifics of the system or the context in which they are run. For this PyTorch installation, EasyBuild allows up to 10 tests to fail. We recommend to double check that the failing tests listed above are known to be flaky, or do not affect your intended usage of PyTorch. In case of doubt, reach out to the EasyBuild community (via GitHub, Slack, or mailing list).
== 2022-10-12 02:20:37,273 build_log.py:169 ERROR EasyBuild crashed with an error (at easybuild/base/exceptions.py:124 in __init__): Too many failed tests (24), maximum allowed is 10 (at easybuild/easyblocks/pytorch.py:348 in test_step)
== 2022-10-12 02:20:37,275 build_log.py:265 INFO ... (took 1 hour 58 mins 12 secs)
== 2022-10-12 02:20:37,278 filetools.py:2014 INFO Removing lock /sw-eb/software/.locks/_sw-eb_software_PyTorch_1.11.0-foss-2021b-CUDA-11.4.1.lock...
== 2022-10-12 02:20:37,280 filetools.py:382 INFO Path /sw-eb/software/.locks/_sw-eb_software_PyTorch_1.11.0-foss-2021b-CUDA-11.4.1.lock successfully removed.
== 2022-10-12 02:20:37,281 filetools.py:2018 INFO Lock removed: /sw-eb/software/.locks/_sw-eb_software_PyTorch_1.11.0-foss-2021b-CUDA-11.4.1.lock
== 2022-10-12 02:20:37,281 easyblock.py:4089 WARNING build failed (first 300 chars): Too many failed tests (24), maximum allowed is 10
== 2022-10-12 02:20:37,283 easyblock.py:319 INFO Closing log for application name PyTorch version 1.11.0
That was done before any changes were made.
Unfortunately, the log-file is too large to actually upload it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused, 2 days ago you had 400-something errors (with pretty much all tests in distributed/rpc/test_tensorpipe_agent
failing), yesterday, you had 24. What was the difference between these two runs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for causing confusion. Two days ago I was trying out the new EasyBlock with the PyTorch-1.10.0-foss-2021a-CUDA-11.3.1.eb
which is not this one here. My wrong to put it here. Apologies.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, clear then!
On a side note, I see the recent EasyBlock fails to properly count everything... Reported this in an issue and will fix it later.
Test report by @casparvl |
Test report by @casparvl |
On both nodes, I got:
These systems are intel CPU + 4x Titan V ( |
@boegelbot please test @ generoso |
@casparvl: Request for testing this PR well received on login1 PR test command '
Test results coming soon (I hope)... - notification for comment with ID 1277291633 processed Message to humans: this is just bookkeeping information for me, |
Test report by @boegelbot |
] | ||
tests = ['PyTorch-check-cpp-extension.py'] | ||
|
||
moduleclass = 'devel' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a moduleclass 'ai' now, that's where this should go probably.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for letting me know. That has slipped my mind. I fixed that.
@casparvl |
Hm, I'm mainly concerned about
Can you find the section in the output file that reports the full error for those tests and paste it here? Or if all are very similar, just paste one as an example. Maybe we can verify if it's a known issue or something... Again, I still don't think that there's anything wrong with this PR as such, so if we can't figure it out in reasonable time, I'd still propose to merge this - it works on Kenneth's system, two of mine, and Generoso, so for many people this PR would probably work fine and thus be helpful to have it merged. |
Update on this:
but the process is hanging here again, which I observed inside the container as well:
I had a look at the patched from PR 16339 but none of them seem to be really applicable. I might be wrong here. Any suggestions? In light it is working on other systems, I am a bit inclined to say the issue is on mine. Where would I find the log file for that process? System info:
EasyBuild:
|
I'm afraid I don't immediately know what's going on here. But one of the things you could try is run this test individually. What I did when developing
That allows much quicker testing, and also, and I've seen differences in the past between e.g. interactively running a test where I just ssh-ed to a node, instead of running it in a SLURM job. For me, a test that fails in my build (SLURM) job, but succeeds when run interactively, points to a build that is fine, but a test that fails because of some environmental differences. |
Thanks for all your help, much appreciated!
The only thing I could not update was the way it is downloading the source code for PyTorch as my framework is not up for that right now. Please feel free to updated that. I don't want to be the stopper of something which actually works but we got an issue with that node I am doing the testing on. I will keep you posted on that. |
@boegelbot please test @ generoso |
@casparvl: Request for testing this PR well received on login1 PR test command '
Test results coming soon (I hope)... - notification for comment with ID 1282388299 processed Message to humans: this is just bookkeeping information for me, |
Test report by @boegelbot |
Test report by @casparvl |
I was able to build this with no issues using eb --from-pr 16385, Thanks for creating this for the 2021b toolchain. |
@fizwit Pillow 8.3.2 is the version we're using a dependency in various places in easyconfigs of the 2021b generation. Is there more information available why the |
Test report by @boegel edit:
|
Jumping back into that: IF the tests are working at other sites, I am in favour to merge this PR as I am not convinced that our setup we are currently having is working as expected. So please don't wait for me to fix my failing tests when it works everywhere else! :-) |
@fizwit Except for the missing checksum for the patch file, your modified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Test report by @boegel |
Test report by @boegel |
I don't know about the first test but the other 3 are likely fixed in my PyTorch 1.12 EC with:
Why adding another 1.11 and not 1.12? Maybe some failures are fixed in the 1.12 EC already. |
@Flamefire I guess we could upgrade this to PyTorch 1.12, but this PR has been sitting here for a while, so I figured I would get it merged... There's a bug in the PyTorch easyblock w.r.t. counting the tests here though, no? |
more patches need to be added to fix/skip failing tests!
Test report by @boegel |
FWIW: I'm testing the PyTorch 1.12 EC on foss-2021b to see what might need patching and what might be a real issue. Going to another toolchain always has the possibility of hitting compatibility issues which may be resolved later. E.g. IIRC 1.10 doesn't work on 2022a as it is incompatible with Python 3.10. Maybe we see something similar here as taking a closer look at the failures they seem to be not exactly the same as on 1.12 |
easybuild/easyconfigs/p/PyTorch/PyTorch-1.11.0-foss-2021b-CUDA-11.4.1.eb
Outdated
Show resolved
Hide resolved
…g-extensions dependency (already included with Python)
Test report by @boegel |
I'm currently testing with PyTorch 1.12 on 2021b and still have trouble. A test on an AVX2 x86 system is failing with a real issue: pytorch/pytorch#92246 The same test with same inputs works on 2022a it is only on 2021b where I get wrong results. I currently traced it into XNNPACK and am still investigating what the issue is. It is possible that the 2021b toolchain has a bug somewhere. Found it: GCC has a bug until 11.3 (i.e. "GCC11.0 through 11.2"): https://stackoverflow.com/a/72837992/1930508 |
GCC bug fixed in: |
I added PRs for PyTorch 1.12.1 on 2021b:
Maybe you can try those as a comparison. At least the CUDA version fails for me on some(?) nodes with
So check the logs for this if it already happens with PyTorch 1.11.0 which might hint that it isn't compatible with CUDA and/or cuBLAS 11.4 |
Test report by @casparvl |
@sassy-crick Since we now have easyconfig merged for PyTorch 1.12.1 with |
closing this, superseded by more recent |
(created using
eb --new-pr
)