{devel}[foss/2021b] PyTorch v1.11.0 w/ Python 3.9.6 + CUDA 11.4.1 #16385

sassy-crick · 2022-10-11T09:51:59Z

(created using eb --new-pr)

casparvl · 2022-10-11T13:41:14Z

easybuild/easyconfigs/p/PyTorch/PyTorch-1.11.0-foss-2021b-CUDA-11.4.1.eb

+
+# several tests are known to be flaky, and fail in some contexts (like having multiple GPUs available),
+# so we allow up to 10 (out of ~90k) tests to fail before treating the installation to be faulty
+max_failed_tests = 10


With the merge of easybuilders/easybuild-easyblocks#2794 I'm guessing this will need to be higher. But let's see how many tests actually fail first, it might not be all that many since we still patched failing tests when the original EasyConfig 1.11.0 was developed :)

To (hopefully) add to this: I tried to install PyTorch-1.10.0-foss-2021a-CUDA-11.3.1.eb which originally failed.
So I added the max_failed_tests = 10 to the EasyConfig file and tried to install it like that:

eb --include-easyblocks-from-pr=2794 --cuda-compute-capabilities=7.5 PyTorch-1.10.0-foss-2021a-CUDA-11.3.1.eb

I got:

WARNING: 0 test failure, 463 test errors (out of 57757): distributed/pipeline/sync/skip/test_gpipe (12 skipped, 1 warning, 1 error) distributed/pipeline/sync/skip/test_leak (1 warning, 8 errors) distributed/pipeline/sync/test_bugs (1 skipped, 1 warning, 3 errors) distributed/pipeline/sync/test_inplace (2 xfailed, 1 warning, 1 error) distributed/pipeline/sync/test_pipe (1 passed, 8 skipped, 1 warning, 47 errors) distributed/pipeline/sync/test_transparency (1 warning, 1 error) distributed/rpc/cuda/test_tensorpipe_agent (107 total tests, errors=1) distributed/rpc/test_faulty_agent (28 total tests, errors=28) distributed/rpc/test_tensorpipe_agent (424 total tests, errors=412) distributed/test_store (19 total tests, errors=1)

I guess we need to do a bit more tuning here. :-)

Also, see the changes in #16339

Test report from installation.

The installation failed with:

Running test_xnnpack_integration ... [2022-10-12 02:18:04.597373] Executing ['/sw-eb/software/Python/3.9.6-GCCcore-11.2.0/bin/python', 'test_xnnpack_integration.py', '-v'] ... [2022-10-12 02:18:04.597482] /dev/shm/hpcsw/eb-kabv0vz7/tmpcfnhusx2/lib/python3.9/site-packages/torch/cuda/__init__.py:82: UserWarning: CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at /dev/shm/hpcsw/PyTorch/1.11.0/foss-202 1b-CUDA-11.4.1/pytorch/c10/cuda/CUDAFunctions.cpp:112.) return torch._C._cuda_getDeviceCount() > 0 test_conv1d_basic (__main__.TestXNNPACKConv1dTransformPass) ... /dev/shm/hpcsw/eb-kabv0vz7/tmpcfnhusx2/lib/python3.9/site-packages/torch/testing/_internal/common_utils.py:424: UserWarning: Named tensors and all their associated APIs are an experimental feature and subject to change. Please do not use them for anything important until they are released as stable. (Triggered in ternally at /dev/shm/hpcsw/PyTorch/1.11.0/foss-2021b-CUDA-11.4.1/pytorch/c10/core/TensorImpl.h:1460.) return callable(*args, **kwargs) ok test_conv1d_with_relu_fc (__main__.TestXNNPACKConv1dTransformPass) ... skipped 'test is slow; run with PYTORCH_TEST_WITH_SLOW to enable test' test_conv2d (__main__.TestXNNPACKOps) ... ok test_conv2d_transpose (__main__.TestXNNPACKOps) ... ok test_linear (__main__.TestXNNPACKOps) ... ok test_linear_1d_input (__main__.TestXNNPACKOps) ... ok test_decomposed_linear (__main__.TestXNNPACKRewritePass) ... ok test_linear (__main__.TestXNNPACKRewritePass) ... ok test_combined_model (__main__.TestXNNPACKSerDes) ... ok test_conv2d (__main__.TestXNNPACKSerDes) ... ok test_conv2d_transpose (__main__.TestXNNPACKSerDes) ... ok test_linear (__main__.TestXNNPACKSerDes) ... ok ---------------------------------------------------------------------- Ran 12 tests in 141.679s OK (skipped=1) distributed/pipeline/sync/skip/test_gpipe failed! distributed/pipeline/sync/skip/test_leak failed! distributed/pipeline/sync/test_bugs failed! distributed/pipeline/sync/test_inplace failed! distributed/pipeline/sync/test_pipe failed! distributed/pipeline/sync/test_transparency failed! distributed/rpc/cuda/test_tensorpipe_agent failed! distributed/rpc/test_faulty_agent failed! distributed/rpc/test_tensorpipe_agent failed! distributed/test_store failed! distributions/test_distributions failed! == 2022-10-12 02:20:34,765 filetools.py:382 INFO Path /dev/shm/hpcsw/eb-kabv0vz7/tmpcfnhusx2 successfully removed. == 2022-10-12 02:20:36,991 pytorch.py:344 WARNING 0 test failure, 24 test errors (out of 88784): distributed/pipeline/sync/skip/test_gpipe (12 skipped, 1 warning, 1 error) distributed/pipeline/sync/skip/test_leak (1 warning, 8 errors) distributed/pipeline/sync/test_bugs (1 skipped, 1 warning, 3 errors) distributed/pipeline/sync/test_inplace (2 xfailed, 1 warning, 1 error) distributed/pipeline/sync/test_pipe (1 passed, 8 skipped, 1 warning, 47 errors) distributed/pipeline/sync/test_transparency (1 warning, 1 error) distributions/test_distributions (216 total tests, errors=3, skipped=5) The PyTorch test suite is known to include some flaky tests, which may fail depending on the specifics of the system or the context in which they are run. For this PyTorch installation, EasyBuild allows up to 10 tests to fail. We recommend to double check that the failing tests listed above are known to be flaky, or do not affect your intended usage of PyTorch. In case of doubt, reach out to the EasyBuild community (via GitHub, Slack, or mailing list). == 2022-10-12 02:20:37,273 build_log.py:169 ERROR EasyBuild crashed with an error (at easybuild/base/exceptions.py:124 in __init__): Too many failed tests (24), maximum allowed is 10 (at easybuild/easyblocks/pytorch.py:348 in test_step) == 2022-10-12 02:20:37,275 build_log.py:265 INFO ... (took 1 hour 58 mins 12 secs) == 2022-10-12 02:20:37,278 filetools.py:2014 INFO Removing lock /sw-eb/software/.locks/_sw-eb_software_PyTorch_1.11.0-foss-2021b-CUDA-11.4.1.lock... == 2022-10-12 02:20:37,280 filetools.py:382 INFO Path /sw-eb/software/.locks/_sw-eb_software_PyTorch_1.11.0-foss-2021b-CUDA-11.4.1.lock successfully removed. == 2022-10-12 02:20:37,281 filetools.py:2018 INFO Lock removed: /sw-eb/software/.locks/_sw-eb_software_PyTorch_1.11.0-foss-2021b-CUDA-11.4.1.lock == 2022-10-12 02:20:37,281 easyblock.py:4089 WARNING build failed (first 300 chars): Too many failed tests (24), maximum allowed is 10 == 2022-10-12 02:20:37,283 easyblock.py:319 INFO Closing log for application name PyTorch version 1.11.0

That was done before any changes were made.
Unfortunately, the log-file is too large to actually upload it.

I'm confused, 2 days ago you had 400-something errors (with pretty much all tests in distributed/rpc/test_tensorpipe_agent failing), yesterday, you had 24. What was the difference between these two runs?

Sorry for causing confusion. Two days ago I was trying out the new EasyBlock with the PyTorch-1.10.0-foss-2021a-CUDA-11.3.1.eb which is not this one here. My wrong to put it here. Apologies.

Ah ok, clear then!

On a side note, I see the recent EasyBlock fails to properly count everything... Reported this in an issue and will fix it later.

casparvl · 2022-10-11T19:02:24Z

Test report by @casparvl
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2794
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
gcn2 - Linux RHEL 8.4, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz, 4 x NVIDIA NVIDIA A100-SXM4-40GB, 515.43.04, Python 3.6.8
See https://gist.github.com/9de07b40eab80e645037356c039a18d8 for a full test report.

casparvl · 2022-10-11T22:13:44Z

Test report by @casparvl
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2794
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
software2.lisa.surfsara.nl - Linux debian 10.13, x86_64, Intel(R) Xeon(R) Bronze 3104 CPU @ 1.70GHz, 4 x NVIDIA NVIDIA TITAN V, 470.103.01, Python 3.7.3
See https://gist.github.com/6a81e68e773aff30448339a435105e97 for a full test report.

casparvl · 2022-10-13T09:13:25Z

On both nodes, I got:

== 2022-10-12 00:11:31,871 pytorch.py:344 WARNING 0 test failure, 6 test errors (out of 88959):
distributions/test_distributions (216 total tests, errors=4)
test_nn (3992 total tests, errors=2, skipped=1094, expected failures=28)

These systems are intel CPU + 4x Titan V (software2), and Intel CPU + 4x A100 (gcn2). So from my point of view, this PR looks ok. But I am concerned about the high failure rates for @sassy-crick . What kind of hardware are you working on?

casparvl · 2022-10-13T09:13:45Z

@boegelbot please test @ generoso

boegelbot · 2022-10-13T09:15:11Z

@casparvl: Request for testing this PR well received on login1

PR test command 'EB_PR=16385 EB_ARGS= /opt/software/slurm/bin/sbatch --job-name test_PR_16385 --ntasks=4 ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

exit code: 0
output:

Submitted batch job 9270

Test results coming soon (I hope)...

- notification for comment with ID 1277291633 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

boegelbot · 2022-10-13T13:34:53Z

Test report by @boegelbot
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
cns1 - Linux Rocky Linux 8.5, x86_64, Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz (haswell), Python 3.6.8
See https://gist.github.com/c04310707a04ed58613880e585fab514 for a full test report.

casparvl · 2022-10-13T14:57:38Z

easybuild/easyconfigs/p/PyTorch/PyTorch-1.11.0-foss-2021b-CUDA-11.4.1.eb

+]
+tests = ['PyTorch-check-cpp-extension.py']
+
+moduleclass = 'devel'


We have a moduleclass 'ai' now, that's where this should go probably.

Thanks for letting me know. That has slipped my mind. I fixed that.

sassy-crick · 2022-10-13T16:10:10Z

On both nodes, I got:
== 2022-10-12 00:11:31,871 pytorch.py:344 WARNING 0 test failure, 6 test errors (out of 88959):
distributions/test_distributions (216 total tests, errors=4)
test_nn (3992 total tests, errors=2, skipped=1094, expected failures=28)
These systems are intel CPU + 4x Titan V (software2), and Intel CPU + 4x A100 (gcn2). So from my point of view, this PR looks ok. But I am concerned about the high failure rates for @sassy-crick . What kind of hardware are you working on?

@casparvl
I share your concerns!
These are Skylake CPUs and RTX 6000 cards, running inside a Rocky-8.5 Singularity container with 1 GPU enabled, so some tests are being skipped, and with the --nv flag being used when running the container. More details are in the test-report if that helps.
I don't seem to be able to upload that on here, hence the link.

casparvl · 2022-10-14T10:30:14Z

Hm, I'm mainly concerned about

distributed/pipeline/sync/test_pipe (1 passed, 8 skipped, 1 warning, 47 errors)

Can you find the section in the output file that reports the full error for those tests and paste it here? Or if all are very similar, just paste one as an example. Maybe we can verify if it's a known issue or something...

Again, I still don't think that there's anything wrong with this PR as such, so if we can't figure it out in reasonable time, I'd still propose to merge this - it works on Kenneth's system, two of mine, and Generoso, so for many people this PR would probably work fine and thus be helpful to have it merged.

sassy-crick · 2022-10-17T10:28:03Z

Update on this:
Fixing an issue I had with nVIDIA inside the container (nvidia-smi works fine but neither deviceQueryDrv nor deviceQuery from CUDA-samples worked), I tried to install this version of PyTorch. That failed as one of the test-processes turned into a zombie.
So I ditched the container approach and tried to install it on the GPU node directly using this:

$ eb --dump-test-report=$PWD/testreport.md --include-easyblocks-from-pr=2794  --cuda-compute-capabilities=7.5 PyTorch-1.11.0-foss-2021b-CUDA-11.4.1.eb

but the process is hanging here again, which I observed inside the container as well:

3139276 pts/1    S+     0:06  |                   \_ /usr/bin/python3 -m easybuild.main --robot --download-timeout=1000 --modules-tool=EnvironmentModules --module-syntax=Tcl --allow-modules-tool-mismatch --hooks=/apps/eb-softwarestack/hooks//site-hooks.py --cuda-compute-capabilities=7.5 --dump-test-report=/sw-eb/scripts/modified/deeplabcut/2.2.3/testreport.md --include-easyblocks-from-pr=2794 --cuda-compute-capabilities=7.5 PyTorch-1.11.0-foss-2021b-CUDA-11.4.1.eb
3262001 pts/1    Sl+    0:02  |                       \_ /sw-eb/software/Python/3.9.6-GCCcore-11.2.0/bin/python run_test.py --continue-through-error --verbose -x distributed/elastic/utils/distributed_test distributed/elastic/multiprocessing/api_test distributed/test_distributed_spawn test_optim test_model_dump distributed/fsdp/test_fsdp_memory distributed/fsdp/test_fsdp_overlap
4107924 pts/1    Sl+    0:02  |                           \_ /sw-eb/software/Python/3.9.6-GCCcore-11.2.0/bin/python test_multiprocessing.py -v
4107935 pts/1    S+     0:00  |                               \_ /sw-eb/software/Python/3.9.6-GCCcore-11.2.0/bin/python -s -c from multiprocessing.resource_tracker import main;main(24)
4107955 pts/1    Z+     0:01  |                               \_ [python] <defunct>

I had a look at the patched from PR 16339 but none of them seem to be really applicable. I might be wrong here. Any suggestions? In light it is working on other systems, I am a bit inclined to say the issue is on mine. Where would I find the log file for that process?

System info:

$ lsb_release -a
LSB Version:    :core-4.1-amd64:core-4.1-noarch:cxx-4.1-amd64:cxx-4.1-noarch:desktop-4.1-amd64:desktop-4.1-noarch:languages-4.1-amd64:languages-4.1-noarch:printing-4.1-amd64:printing-4.1-noarch
Distributor ID: RedHatEnterprise
Description:    Red Hat Enterprise Linux release 8.5 (Ootpa)
Release:        8.5
Codename:       Ootpa

EasyBuild:
eb-4.6.1

$ lscpu 
Architecture:        x86_64
CPU op-mode(s):      32-bit, 64-bit
Byte Order:          Little Endian
CPU(s):              32
On-line CPU(s) list: 0-31
Thread(s) per core:  2
Core(s) per socket:  8
Socket(s):           2
NUMA node(s):        2
Vendor ID:           GenuineIntel
CPU family:          6
Model:               85
Model name:          Intel(R) Xeon(R) Silver 4110 CPU @ 2.10GHz
Stepping:            4

casparvl · 2022-10-17T13:41:42Z

I'm afraid I don't immediately know what's going on here. But one of the things you could try is run this test individually. What I did when developing PyTorch-1.11.0-foss-2021a-CUDA-11.3.1.eb and debugging tests was:

Build locally in my homedir, while keeping the tmpdir and builddir, and skip the test step so that the build doesn't crash (e.g. eb PyTorch-1.11.0-foss-2021a-CUDA-11.3.1.eb --skip-test-step --disable-cleanup-tmpdir --disable-cleanup-builddir)
Load the module (from my homedir, it needs to be on your MODULEPATH of course) and change directory to the build directory. Add the site-packages in the corresponding tmpdir to the PYTHONPATH.

module load PyTorch/1.11.0-foss-2021a-CUDA-11.3.1
module load hypothesis/6.13.1-GCCcore-10.3.0
cd /scratch-shared/casparl/PyTorch/1.11.0/foss-2021a-CUDA-11.3.1/pytorch/test
export PYTHONPATH=/scratch-shared/casparl/eb-u986gtd2/tmpkj4e39ni/lib/python3.9/site-packages:$PYTHONPATH &&

Run the tests, e.g. to launch an individual test

PYTHONUNBUFFERED=1 /sw/arch/Centos8/EB_production/2021/software/Python/3.9.5-GCCcore-10.3.0/bin/python -m unittest test_linalg.TestLinalgCPU.test_norm_extreme_values_cpu -v

That allows much quicker testing, and also, and I've seen differences in the past between e.g. interactively running a test where I just ssh-ed to a node, instead of running it in a SLURM job. For me, a test that fails in my build (SLURM) job, but succeeds when run interactively, points to a build that is fine, but a test that fails because of some environmental differences.

sassy-crick · 2022-10-18T08:33:30Z

Thanks for all your help, much appreciated!
I have included the patches from #16339 but that made no difference. I am running these jobs by sshed into the node, so no environment from the queue.
From what I can see right now it looks like that could potentially be a problem with our setup. I would suggest the following:

test the updated EasyConfig file. If that one is ok, please be free to merge that
IF, however, these patches are causing more problem, revert to the previously tested EasyConfig file and merge.

The only thing I could not update was the way it is downloading the source code for PyTorch as my framework is not up for that right now. Please feel free to updated that.

I don't want to be the stopper of something which actually works but we got an issue with that node I am doing the testing on. I will keep you posted on that.
Again, thanks for your help and patience!

casparvl · 2022-10-18T13:28:46Z

@boegelbot please test @ generoso
CORE_CNT=16
EB_ARGS="--include-easyblocks-from-pr 2803"

boegelbot · 2022-10-18T13:30:29Z

@casparvl: Request for testing this PR well received on login1

PR test command 'EB_PR=16385 EB_ARGS="--include-easyblocks-from-pr 2803" /opt/software/slurm/bin/sbatch --job-name test_PR_16385 --ntasks="16" ~/boegelbot/eb_from_pr_upload_generoso.sh' executed!

exit code: 0
output:

Submitted batch job 9308

Test results coming soon (I hope)...

- notification for comment with ID 1282388299 processed

Message to humans: this is just bookkeeping information for me,
it is of no use to you (unless you think I have a bug, which I don't).

boegelbot · 2022-10-18T17:08:28Z

Test report by @boegelbot
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2803
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
cnx1 - Linux Rocky Linux 8.5, x86_64, Intel(R) Xeon(R) CPU E5-2690 v3 @ 2.60GHz (haswell), Python 3.6.8
See https://gist.github.com/abd8088172ff6e39711c5c6ac9ef705c for a full test report.

casparvl · 2022-10-18T23:06:30Z

Test report by @casparvl
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2803
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
software2.lisa.surfsara.nl - Linux debian 10.13, x86_64, Intel(R) Xeon(R) Bronze 3104 CPU @ 1.70GHz, 4 x NVIDIA NVIDIA TITAN V, 470.103.01, Python 3.7.3
See https://gist.github.com/1a9bd41325742b1733a75b7741151cf3 for a full test report.

fizwit · 2022-11-16T18:51:31Z

I was able to build this with no issues using eb --from-pr 16385, Thanks for creating this for the 2021b toolchain.
Pytorch is the base for many other tools, Like Torchvsion.
torchvision 0.13.0 has a requirement pillow!=8.3.*,>=5.3.0, but you have pillow 8.3.2.
Can you consider using using Pillow-9.1.0. ?

boegel · 2022-11-24T16:46:04Z

I was able to build this with no issues using eb --from-pr 16385, Thanks for creating this for the 2021b toolchain. Pytorch is the base for many other tools, Like Torchvsion. torchvision 0.13.0 has a requirement pillow!=8.3.*,>=5.3.0, but you have pillow 8.3.2. Can you consider using using Pillow-9.1.0. ?

@fizwit Pillow 8.3.2 is the version we're using a dependency in various places in easyconfigs of the 2021b generation.
If we really need to, we can diverge from that and add an exception (in the easyconfigs test suite), but that would be painful w.r.t. maximizing compatibility of modules installed with a 2021b toolchain (or subtoolchain thereof).

Is there more information available why the !=8.3.* restriction is there in torchvision?

boegel · 2022-11-25T11:06:45Z

Test report by @boegel
SUCCESS
Build succeeded for 27 out of 27 (1 easyconfigs in total)
node3306.joltik.os - Linux RHEL 8.4, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), 1 x NVIDIA Tesla V100-SXM2-32GB, 520.61.05, Python 3.6.8
See https://gist.github.com/b07b19a3448157d77122c6dde1005244 for a full test report.

edit:

3 test failures, 4 test errors (out of 88968):
distributed/fsdp/test_fsdp_input (2 total tests, failures=2)
distributions/test_distributions (216 total tests, errors=4)
test_autograd (464 total tests, failures=1, skipped=52, expected failures=1)

sassy-crick · 2022-12-16T17:49:31Z

Jumping back into that: IF the tests are working at other sites, I am in favour to merge this PR as I am not convinced that our setup we are currently having is working as expected. So please don't wait for me to fix my failing tests when it works everywhere else! :-)

boegel · 2023-01-11T16:56:58Z

@boegel So, issue a patch to 8.3.2 Pillow easyconfig to fix an issue with torchvision 0.13.1. tochvison will also require a patch to setup.py to allow Pillow 8.3.2. Does this look ok? Pill Path

@fizwit Except for the missing checksum for the patch file, your modified Pillow-8.3.2-GCCcore-11.2.0.eb (here) looks good to me. Can you open the PR for that?

boegel

lgtm

boegel · 2023-01-12T06:17:54Z

Test report by @boegel
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
node3305.joltik.os - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), 1 x NVIDIA Tesla V100-SXM2-32GB, 525.60.13, Python 3.6.8
See https://gist.github.com/aff84c9d838ad4da5ec32b4dca244223 for a full test report.

boegel · 2023-01-12T10:40:43Z

Test report by @boegel
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
node3901.accelgor.os - Linux RHEL 8.6, x86_64, AMD EPYC 7413 24-Core Processor (zen3), 1 x NVIDIA NVIDIA A100-SXM4-80GB, 525.60.13, Python 3.6.8
See https://gist.github.com/c80632f56b389e338a0220a5e8864f4f for a full test report.

Flamefire · 2023-01-12T14:35:26Z

@boegel

distributed/fsdp/test_fsdp_input failed!
distributed/test_c10d_gloo failed!
distributions/test_distributions failed!
test_autograd failed!

I don't know about the first test but the other 3 are likely fixed in my PyTorch 1.12 EC with:

PyTorch-1.12.1_skip-test_round_robin.patch
PyTorch-1.12.1_fix-test_wishart_log_prob.patch
PyTorch-1.12.1_fix-autograd-thread_shutdown-test.patch
respectively.

Why adding another 1.11 and not 1.12? Maybe some failures are fixed in the 1.12 EC already.

boegel · 2023-01-12T15:35:33Z

@Flamefire I guess we could upgrade this to PyTorch 1.12, but this PR has been sitting here for a while, so I figured I would get it merged...

There's a bug in the PyTorch easyblock w.r.t. counting the tests here though, no?
The installation fails at the test step because the failed test counting isn't able to find a match for test_c10d_gloo, I think? We should look into fixing that imho (regardless of whether or not we stick to PyTorch 1.11, which I'm actually in favor of, there could be situations where PyTorch 1.10 is not recent enough, yet PyTorch 1.12 is too recent, and we don't have any PyTorch yet in the 2021b generation...)

more patches need to be added to fix/skip failing tests!

boegel · 2023-01-13T08:22:54Z

Test report by @boegel
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2859
FAILED
Build succeeded for 0 out of 1 (1 easyconfigs in total)
node3305.joltik.os - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), 1 x NVIDIA Tesla V100-SXM2-32GB, 525.60.13, Python 3.6.8
See https://gist.github.com/f94ada55e0d0a9174740c2f13072b04b for a full test report.

Flamefire · 2023-01-13T09:56:08Z

FWIW: I'm testing the PyTorch 1.12 EC on foss-2021b to see what might need patching and what might be a real issue. Going to another toolchain always has the possibility of hitting compatibility issues which may be resolved later. E.g. IIRC 1.10 doesn't work on 2022a as it is incompatible with Python 3.10. Maybe we see something similar here as taking a closer look at the failures they seem to be not exactly the same as on 1.12

easybuild/easyconfigs/p/PyTorch/PyTorch-1.11.0-foss-2021b-CUDA-11.4.1.eb

…g-extensions dependency (already included with Python)

boegel · 2023-01-15T20:05:27Z

Test report by @boegel
Using easyblocks from PR(s) easybuilders/easybuild-easyblocks#2859
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
node3309.joltik.os - Linux RHEL 8.6, x86_64, Intel(R) Xeon(R) Gold 6242 CPU @ 2.80GHz (cascadelake), 1 x NVIDIA Tesla V100-SXM2-32GB, 525.60.13, Python 3.6.8
See https://gist.github.com/200f41addaca022e55bec2befdf02a49 for a full test report.

Flamefire · 2023-01-17T12:17:00Z

I'm currently testing with PyTorch 1.12 on 2021b and still have trouble. A test on an AVX2 x86 system is failing with a real issue: pytorch/pytorch#92246

The same test with same inputs works on 2022a it is only on 2021b where I get wrong results. I currently traced it into XNNPACK and am still investigating what the issue is. It is possible that the 2021b toolchain has a bug somewhere.

Found it: GCC has a bug until 11.3 (i.e. "GCC11.0 through 11.2"): https://stackoverflow.com/a/72837992/1930508

Flamefire · 2023-01-17T16:16:06Z

GCC bug fixed in:

add patch for GCCcore 11.1.0 + 11.2.0 to fix AVX2 bug #17135

Flamefire · 2023-01-19T09:28:28Z

I added PRs for PyTorch 1.12.1 on 2021b:

Maybe you can try those as a comparison. At least the CUDA version fails for me on some(?) nodes with

RuntimeError: CUDA error: CUBLAS_STATUS_NOT_SUPPORTED when calling cublasDgemv(handle, op, m, n, &alpha, a, lda, x, incx, &beta, y, incy)``

So check the logs for this if it already happens with PyTorch 1.11.0 which might hint that it isn't compatible with CUDA and/or cuBLAS 11.4

casparvl · 2023-03-17T04:41:49Z

Test report by @casparvl
SUCCESS
Build succeeded for 1 out of 1 (1 easyconfigs in total)
gcn31.local.snellius.surf.nl - Linux Rocky Linux 8.7, x86_64, Intel(R) Xeon(R) Platinum 8360Y CPU @ 2.40GHz, 1 x NVIDIA NVIDIA A100-SXM4-40GB, 515.86.01, Python 3.6.8
See https://gist.github.com/ac6ea5a6fcaf78fa9cf4b2129eb32ae2 for a full test report.

boegel · 2023-04-11T19:09:25Z

@sassy-crick Since we now have easyconfig merged for PyTorch 1.12.1 with foss/2021b, we should use those instead, and close this PR?

boegel · 2024-01-13T19:46:31Z

closing this, superseded by more recent PyTorch easyconfigs

adding easyconfigs: PyTorch-1.11.0-foss-2021b-CUDA-11.4.1.eb

bb3da21

casparvl reviewed Oct 11, 2022

View reviewed changes

casparvl added the update label Oct 11, 2022

boegel changed the title ~~{devel}[foss/2021b] PyTorch v1.11.0 w/ Python 3.9.6~~ {devel}[foss/2021b] PyTorch v1.11.0 w/ Python 3.9.6 + CUDA 11.4.1 Oct 11, 2022

boegel added this to the 4.x milestone Oct 12, 2022

casparvl reviewed Oct 13, 2022

View reviewed changes

Moduleclass changed to AI, comment for RTX 6000/Skylake added

5e94ee5

True changed to SYSTEM, comment for RTX 6000/Skylake modified

cdb3bbe

casparvl mentioned this pull request Oct 14, 2022

PyTorch easyblock fails to properly count total number of test errors easybuilders/easybuild-easyblocks#2802

Closed

sassy-crick mentioned this pull request Oct 17, 2022

{lib}[foss/2021b] DeepLabCut v2.2.3 w/ Python 3.9.6 #16420

Closed

4 tasks

Suggested patches from easybuilders#16339

712f2fe

boegel mentioned this pull request Nov 25, 2022

solo vscentrum/vsc-software-stack#60

Closed

boegel modified the milestones: 4.x, next release (4.7.1?) Jan 11, 2023

boegel previously approved these changes Jan 11, 2023

View reviewed changes

boegel mentioned this pull request Jan 12, 2023

fix finding of failed tests in output of PyTorch test step easybuilders/easybuild-easyblocks#2859

Open

Merge branch 'develop' into 20221011105155_new_pr_PyTorch1110

ab20240

Flamefire reviewed Jan 13, 2023

View reviewed changes

easybuild/easyconfigs/p/PyTorch/PyTorch-1.11.0-foss-2021b-CUDA-11.4.1.eb Outdated Show resolved Hide resolved

add patches to fix/skip failing tests for PyTorch 1.11.0 + drop typin…

8b58193

…g-extensions dependency (already included with Python)

sassy-crick mentioned this pull request Mar 16, 2023

{lib}[foss/2021b] DeepLabCut v2.3.1 w/ Python 3.9.6 #17551

Closed

4 tasks

boegel modified the milestones: next release (4.7.1), release after 4.7.1 Mar 17, 2023

boegel modified the milestones: next release (4.7.2), 4.x Apr 12, 2023

boegel closed this Jan 13, 2024

sassy-crick deleted the 20221011105155_new_pr_PyTorch1110 branch January 26, 2024 10:56

{devel}[foss/2021b] PyTorch v1.11.0 w/ Python 3.9.6 + CUDA 11.4.1 #16385

{devel}[foss/2021b] PyTorch v1.11.0 w/ Python 3.9.6 + CUDA 11.4.1 #16385

Conversation

sassy-crick commented Oct 11, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

casparvl Oct 13, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

casparvl commented Oct 11, 2022

casparvl commented Oct 11, 2022

casparvl commented Oct 13, 2022

casparvl commented Oct 13, 2022

boegelbot commented Oct 13, 2022

boegelbot commented Oct 13, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

sassy-crick commented Oct 13, 2022

casparvl commented Oct 14, 2022

sassy-crick commented Oct 17, 2022 • edited Loading

casparvl commented Oct 17, 2022

sassy-crick commented Oct 18, 2022

casparvl commented Oct 18, 2022 • edited Loading

boegelbot commented Oct 18, 2022

boegelbot commented Oct 18, 2022

casparvl commented Oct 18, 2022

fizwit commented Nov 16, 2022

boegel commented Nov 24, 2022

boegel commented Nov 25, 2022 • edited Loading

sassy-crick commented Dec 16, 2022

boegel commented Jan 11, 2023

boegel left a comment

Choose a reason for hiding this comment

boegel commented Jan 12, 2023

boegel commented Jan 12, 2023

Flamefire commented Jan 12, 2023

boegel commented Jan 12, 2023

boegel commented Jan 13, 2023

Flamefire commented Jan 13, 2023

boegel commented Jan 15, 2023

Flamefire commented Jan 17, 2023 • edited Loading

Flamefire commented Jan 17, 2023 • edited by boegel Loading

Flamefire commented Jan 19, 2023

casparvl commented Mar 17, 2023

boegel commented Apr 11, 2023

boegel commented Jan 13, 2024

casparvl Oct 13, 2022 •

edited

Loading

sassy-crick commented Oct 17, 2022 •

edited

Loading

casparvl commented Oct 18, 2022 •

edited

Loading

boegel commented Nov 25, 2022 •

edited

Loading

Flamefire commented Jan 17, 2023 •

edited

Loading

Flamefire commented Jan 17, 2023 •

edited by boegel

Loading