Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Topologies for 16-GPU gfx942 SuperNode #1417

Open
wants to merge 2 commits into
base: develop
Choose a base branch
from

Conversation

BKitor
Copy link
Contributor

@BKitor BKitor commented Nov 8, 2024

Support for GigaIO's 16x MI300x SuperNode.
Adds a rome_mode and update the searching algorithm to use hardware-efficient rings.
Some of the targeted topologes are provided as .xml files in topo_expl.
There are some other adjustments to the a2a and 4p2h parsing methods. This is so that non-consecutively numbered subsets of the system can still find efficient topologies.

@wenkaidu
Copy link
Collaborator

wenkaidu commented Nov 8, 2024

@BKitor can you elaborate why "ranks" are preferred over "dev" as GPU identifier?

@BKitor
Copy link
Contributor Author

BKitor commented Nov 8, 2024

@BKitor can you elaborate why "ranks" are preferred over "dev" as GPU identifier?

The dev values aren't guaranteed to be consecutive from 0 - ngpus. The rank values are.
This was a problem when running with HIP_VISIBLE_DEVIECS.

@wenkaidu
Copy link
Collaborator

wenkaidu commented Nov 8, 2024

Thanks! That's good observation. Can you help applying same dev2rank conversion to other matching functions in rome_models.cc as well?

@BKitor
Copy link
Contributor Author

BKitor commented Nov 11, 2024

@wenkaidu I've refactored the dev2rank mapping stuff a bit, and extended it to the other matching functions.
I'm going of the assumption that there are 3 values GPUs can be referenced by, their 'rocm-smi' id, their rank, and their index in system->nodes[GPU]. If each gpu's 'system index' is guaranteed to be equal to their rank, this could probably be simplify this further. If not, this implementation saves the devids during parseRomeSystem, and uses the devids to build the gpu_map before parseGraph[Light].

@wenkaidu
Copy link
Collaborator

@BKitor The patch looks good. However, some model matchings are failing, for example, model 82 and 83. Can you take a look at the issue?

- Add GigaIO topologies to tools/topo_expl for dev and testing
- Add GigaIO Columba 16 GPU romeModel and adjust topology
  matching algorithm in rome_models for 16 GPU system
- Fix bug which failed to match Rome Model when using subsets
  of system resources (i.e. ROCR_VISIBLE_DEVICES is set)
- Fixes for topo_expl
@wenkaidu
Copy link
Collaborator

@BKitor I have trouble building topo_expl with your latest commit. I got error:

hipify_rccl/graph/rome_models.cc:1:1: error: expected unqualified-id
    1 | / *
      | ^
In file included from hipify_rccl/graph/rome_models.cc:23:
In file included from hipify_rccl/include/core.h:37:
hipify_rccl/include/alloc.h:285:30: error: use of undeclared identifier '_SC_PAGESIZE'
  285 |   size_t page_size = sysconf(_SC_PAGESIZE);
      |                              ^
/usr/include/x86_64-linux-gnu/bits/confname.h:134:24: note: expanded from macro '_SC_PAGESIZE'
  134 | #define _SC_PAGESIZE                    _SC_PAGESIZE

@BKitor
Copy link
Contributor Author

BKitor commented Nov 18, 2024

@BKitor I have trouble building topo_expl with your latest commit. I got error:

hipify_rccl/graph/rome_models.cc:1:1: error: expected unqualified-id
    1 | / *
      | ^
In file included from hipify_rccl/graph/rome_models.cc:23:
In file included from hipify_rccl/include/core.h:37:
hipify_rccl/include/alloc.h:285:30: error: use of undeclared identifier '_SC_PAGESIZE'
  285 |   size_t page_size = sysconf(_SC_PAGESIZE);
      |                              ^
/usr/include/x86_64-linux-gnu/bits/confname.h:134:24: note: expanded from macro '_SC_PAGESIZE'
  134 | #define _SC_PAGESIZE                    _SC_PAGESIZE

Commit 3b69c50 was busted, should be fixed with 5a0766c. I've been doing git commit --amend and force pushing, which might not be the most seamless way of sharing work across systems. You might need to git reset --hard HEAD^ && git pull to make sure everything is up to date.

@wenkaidu
Copy link
Collaborator

@BKitor I have trouble building topo_expl with your latest commit. I got error:

hipify_rccl/graph/rome_models.cc:1:1: error: expected unqualified-id
    1 | / *
      | ^
In file included from hipify_rccl/graph/rome_models.cc:23:
In file included from hipify_rccl/include/core.h:37:
hipify_rccl/include/alloc.h:285:30: error: use of undeclared identifier '_SC_PAGESIZE'
  285 |   size_t page_size = sysconf(_SC_PAGESIZE);
      |                              ^
/usr/include/x86_64-linux-gnu/bits/confname.h:134:24: note: expanded from macro '_SC_PAGESIZE'
  134 | #define _SC_PAGESIZE                    _SC_PAGESIZE

Commit 3b69c50 was busted, should be fixed with 5a0766c. I've been doing git commit --amend and force pushing, which might not be the most seamless way of sharing work across systems. You might need to git reset --hard HEAD^ && git pull to make sure everything is up to date.

Thanks! It works now.

@corey-derochie-amd
Copy link
Collaborator

Unit test fails on Extended pipeline for "rhel8 && 16gfx90a" platform.

2b9c3f24d715:19030:19046 [5] [ INFO     ] SP 2-ranks AllReduce (custom-scalar Mode 0 ncclFloat32)

[ INFO     ] SP 2-ranks AllReduce (custom-scalar Mode 1 ncclFloat32)

[ INFO     ] SP 3-ranks AllReduce (custom-scalar Mode 0 ncclFloat32)

[ INFO     ] SP 3-ranks AllReduce (custom-scalar Mode 1 ncclFloat32)

[ INFO     ] SP 4-ranks AllReduce (custom-scalar Mode 0 ncclFloat32)

[ INFO     ] SP 4-ranks AllReduce (custom-scalar Mode 1 ncclFloat32)

[ INFO     ] SP 5-ranks AllReduce (custom-scalar Mode 0 ncclFloat32)

[ INFO     ] SP 5-ranks AllReduce (custom-scalar Mode 1 ncclFloat32)

[ INFO     ] SP 6-ranks AllReduce (custom-scalar Mode 0 ncclFloat32)

[ INFO     ] SP 6-ranks AllReduce (custom-scalar Mode 1 ncclFloat32)

[ INFO     ] SP 7-ranks AllReduce (custom-scalar Mode 0 ncclFloat32)

[ INFO     ] SP 7-ranks AllReduce (custom-scalar Mode 1 ncclFloat32)

[ INFO     ] SP 8-ranks AllReduce (custom-scalar Mode 0 ncclFloat32)

[ INFO     ] SP 8-ranks AllReduce (custom-scalar Mode 1 ncclFloat32)

[ INFO     ] SP 9-ranks AllReduce (custom-scalar Mode 0 ncclFloat32)

[ INFO     ] SP 9-ranks AllReduce (custom-scalar Mode 1 ncclFloat32)

[ INFO     ] SP 10-ranks AllReduce (custom-scalar Mode 0 ncclFloat32)

[ INFO     ] SP 10-ranks AllReduce (custom-scalar Mode 1 ncclFloat32)

[ INFO     ] SP 11-ranks AllReduce (custom-scalar Mode 0 ncclFloat32)

[ INFO     ] SP 11-ranks AllReduce (custom-scalar Mode 1 ncclFloat32)

[ INFO     ] SP 12-ranks AllReduce (custom-scalar Mode 0 ncclFloat32)

[ INFO     ] SP 12-ranks AllReduce (custom-scalar Mode 1 ncclFloat32)

[ INFO     ] SP 13-ranks AllReduce (custom-scalar Mode 0 ncclFloat32)

[ INFO     ] SP 13-ranks AllReduce (custom-scalar Mode 1 ncclFloat32)

[ INFO     ] SP 14-ranks AllReduce (custom-scalar Mode 0 ncclFloat32)

[ INFO     ] SP 14-ranks AllReduce (custom-scalar Mode 1 ncclFloat32)

[ INFO     ] SP 15-ranks AllReduce (custom-scalar Mode 0 ncclFloat32)

[ INFO     ] SP 15-ranks AllReduce (custom-scalar Mode 1 ncclFloat32)

[ ERROR    ] Child 0 reports failure

../../../test/common/TestBed.cpp:183: Failure

Expected equality of these values:

  response

    Which is: 1

  TEST_SUCCESS

    Which is: 0



hipify/src/graph/rings.cc:38 NCCL WARN Error : ring 0 does not loop back to start (8 != 5)

2b9c3f24d715:19030:19046 [5] NCCL INFO hipify/src/graph/connect.cc:727 -> 3

2b9c3f24d715:19030:19046 [5] NCCL INFO hipify/src/init.cc:1648 -> 3



2b9c3f24d715:19030:19045 [4] hipify/src/graph/rings.cc:38 NCCL WARN Error : ring 0 does not loop back to start (8 != 4)

2b9c3f24d715:19030:19045 [4] NCCL INFO hipify/src/graph/connect.cc:727 -> 3

2b9c3f24d715:19030:19045 [4] NCCL INFO hipify/src/init.cc:1648 -> 3

2b9c3f24d715:19030:19046 [5] NCCL INFO hipify/src/init.cc:2030 -> 3



2b9c3f24d715:19030:19050 [9] hipify/src/graph/rings.cc:38 NCCL WARN Error : ring 0 does not loop back to start (8 != 9)

2b9c3f24d715:19030:19046 [5] NCCL INFO hipify/src/group.cc:69 -> 3 [Async thread]

2b9c3f24d715:19030:19045 [4] NCCL INFO hipify/src/init.cc:2030 -> 3

2b9c3f24d715:19030:19050 [9] NCCL INFO hipify/src/graph/connect.cc:727 -> 3

2b9c3f24d715:19030:19045 [4] NCCL INFO hipify/src/group.cc:69 -> 3 [Async thread]

2b9c3f24d715:19030:19050 [9] NCCL INFO hipify/src/init.cc:1648 -> 3



2b9c3f24d715:19030:19055 [14] hipify/src/graph/rings.cc:38 NCCL WARN Error : ring 0 does not loop back to start (8 != 14)

2b9c3f24d715:19030:19055 [14] NCCL INFO hipify/src/graph/connect.cc:727 -> 3

2b9c3f24d715:19030:19055 [14] NCCL INFO hipify/src/init.cc:1648 -> 3

2b9c3f24d715:19030:19050 [9] NCCL INFO hipify/src/init.cc:2030 -> 3

2b9c3f24d715:19030:19050 [9] NCCL INFO hipify/src/group.cc:69 -> 3 [Async thread]

2b9c3f24d715:19030:19042 [1] NCCL INFO MSCCL: No external scheduler found, using internal implementation

2b9c3f24d715:19030:19055 [14] NCCL INFO hipify/src/init.cc:2030 -> 3

2b9c3f24d715:19030:19055 [14] NCCL INFO hipify/src/group.cc:69 -> 3 [Async thread]

2b9c3f24d715:19030:19042 [1] NCCL INFO Using MSCCL files from /var/jenkins_home/workspace/main_extended_rccl_PR-1417/2tLyLYsrp/rccl/build/release/msccl-algorithms



2b9c3f24d715:19030:19042 [1] hipify/src/graph/rings.cc:38 NCCL WARN Error : ring 0 does not loop back to start (8 != 1)

2b9c3f24d715:19030:19042 [1] NCCL INFO hipify/src/graph/connect.cc:727 -> 3

2b9c3f24d715:19030:19042 [1] NCCL INFO hipify/src/init.cc:1648 -> 3

2b9c3f24d715:19030:19042 [1] NCCL INFO hipify/src/init.cc:2030 -> 3

2b9c3f24d715:19030:19042 [1] NCCL INFO hipify/src/group.cc:69 -> 3 [Async thread]

2b9c3f24d715:19030:19030 [15] NCCL INFO hipify/src/group.cc:438 -> 3

2b9c3f24d715:19030:19030 [15] NCCL INFO hipify/src/group.cc:108 -> 3

[ ERROR    ] Child process 0 fails NCCL call ncclGroupEnd with code 3

[ ERROR    ] Child 0 failed on command [INIT_COMMS]:



2b9c3f24d715:19030:19030 [15] hipify/src/enqueue.cc:1609 NCCL WARN Error : no algorithm/protocol available

2b9c3f24d715:19030:19030 [15] NCCL INFO hipify/src/enqueue.cc:837 -> 3

2b9c3f24d715:19030:19030 [15] NCCL INFO hipify/src/enqueue.cc:1328 -> 3

2b9c3f24d715:19030:19030 [15] NCCL INFO hipify/src/group.cc:143 -> 3

2b9c3f24d715:19030:19030 [15] NCCL INFO hipify/src/group.cc:355 -> 3



[Process: 19030] Inside handler function signal: Segmentation fault (11)

/var/jenkins_home/workspace/main_extended_rccl_PR-1417/2tLyLYsrp/rccl/build/release/librccl.so.1(+0x107104) [0x7f970924d104]

/lib64/libc.so.6(+0x4ead0) [0x7f9706b49ad0]

/var/jenkins_home/workspace/main_extended_rccl_PR-1417/2tLyLYsrp/rccl/build/release/librccl.so.1(+0xa4c90) [0x7f97091eac90]

/var/jenkins_home/workspace/main_extended_rccl_PR-1417/2tLyLYsrp/rccl/build/release/librccl.so.1(+0xa3f2e) [0x7f97091e9f2e]

/var/jenkins_home/workspace/main_extended_rccl_PR-1417/2tLyLYsrp/rccl/build/release/librccl.so.1(+0xa39d1) [0x7f97091e99d1]

/var/jenkins_home/workspace/main_extended_rccl_PR-1417/2tLyLYsrp/rccl/build/release/librccl.so.1(+0xa36d6) [0x7f97091e96d6]

./rccl-UnitTests(_ZN15RcclUnitTesting12TestBedChild18ExecuteCollectivesEv+0x120b) [0x2c6f7b]

./rccl-UnitTests(_ZN15RcclUnitTesting12TestBedChild18StartExecutionLoopEv+0x1c6) [0x2c4366]

./rccl-UnitTests(_ZN15RcclUnitTesting7TestBed9InitCommsERKSt6vectorIS1_IiSaIiEESaIS3_EERKS3_S9_ib+0x7d4) [0x2b30d4]

./rccl-UnitTests(_ZN15RcclUnitTesting7TestBed9InitCommsERKSt6vectorIS1_IiSaIiEESaIS3_EEiiib+0x187) [0x2b4347]

./rccl-UnitTests(_ZN15RcclUnitTesting28AllReduce_PreMultScalar_Test8TestBodyEv+0x12e) [0x27429e]

./rccl-UnitTests(_ZN7testing8internal35HandleExceptionsInMethodIfSupportedINS_4TestEvEET0_PT_MS4_FS3_vEPKc+0x3a) [0x30007a]

./rccl-UnitTests() [0x2e9208]

./rccl-UnitTests(_ZN7testing8TestInfo3RunEv+0x26a) [0x2e955a]

./rccl-UnitTests() [0x2f1381]

./rccl-UnitTests(_ZN7testing8internal12UnitTestImpl11RunAllTestsEv+0xbe8) [0x2f2068]

./rccl-UnitTests(_ZN7testing8UnitTest3RunEv+0x79) [0x2f2689]

./rccl-UnitTests(main+0x41) [0x2a8ad1]

/lib64/libc.so.6(__libc_start_main+0xf3) [0x7f9706b35ca3]

./rccl-UnitTests(_start+0x2e) [0x26e28e]

[ INFO     ] SP 16-ranks AllReduce (custom-scalar Mode 0 ncclFloat32)

[ ERROR    ] Child 0 pipe closed unexpectedly

script returned exit code 1

@wenkaidu
Copy link
Collaborator

Unit test fails on Extended pipeline for "rhel8 && 16gfx90a" platform.

2b9c3f24d715:19030:19046 [5] [ INFO     ] SP 2-ranks AllReduce (custom-scalar Mode 0 ncclFloat32)

[ INFO     ] SP 2-ranks AllReduce (custom-scalar Mode 1 ncclFloat32)

[ INFO     ] SP 3-ranks AllReduce (custom-scalar Mode 0 ncclFloat32)

[ INFO     ] SP 3-ranks AllReduce (custom-scalar Mode 1 ncclFloat32)

[ INFO     ] SP 4-ranks AllReduce (custom-scalar Mode 0 ncclFloat32)

[ INFO     ] SP 4-ranks AllReduce (custom-scalar Mode 1 ncclFloat32)

[ INFO     ] SP 5-ranks AllReduce (custom-scalar Mode 0 ncclFloat32)

[ INFO     ] SP 5-ranks AllReduce (custom-scalar Mode 1 ncclFloat32)

[ INFO     ] SP 6-ranks AllReduce (custom-scalar Mode 0 ncclFloat32)

[ INFO     ] SP 6-ranks AllReduce (custom-scalar Mode 1 ncclFloat32)

[ INFO     ] SP 7-ranks AllReduce (custom-scalar Mode 0 ncclFloat32)

[ INFO     ] SP 7-ranks AllReduce (custom-scalar Mode 1 ncclFloat32)

[ INFO     ] SP 8-ranks AllReduce (custom-scalar Mode 0 ncclFloat32)

[ INFO     ] SP 8-ranks AllReduce (custom-scalar Mode 1 ncclFloat32)

[ INFO     ] SP 9-ranks AllReduce (custom-scalar Mode 0 ncclFloat32)

[ INFO     ] SP 9-ranks AllReduce (custom-scalar Mode 1 ncclFloat32)

[ INFO     ] SP 10-ranks AllReduce (custom-scalar Mode 0 ncclFloat32)

[ INFO     ] SP 10-ranks AllReduce (custom-scalar Mode 1 ncclFloat32)

[ INFO     ] SP 11-ranks AllReduce (custom-scalar Mode 0 ncclFloat32)

[ INFO     ] SP 11-ranks AllReduce (custom-scalar Mode 1 ncclFloat32)

[ INFO     ] SP 12-ranks AllReduce (custom-scalar Mode 0 ncclFloat32)

[ INFO     ] SP 12-ranks AllReduce (custom-scalar Mode 1 ncclFloat32)

[ INFO     ] SP 13-ranks AllReduce (custom-scalar Mode 0 ncclFloat32)

[ INFO     ] SP 13-ranks AllReduce (custom-scalar Mode 1 ncclFloat32)

[ INFO     ] SP 14-ranks AllReduce (custom-scalar Mode 0 ncclFloat32)

[ INFO     ] SP 14-ranks AllReduce (custom-scalar Mode 1 ncclFloat32)

[ INFO     ] SP 15-ranks AllReduce (custom-scalar Mode 0 ncclFloat32)

[ INFO     ] SP 15-ranks AllReduce (custom-scalar Mode 1 ncclFloat32)

[ ERROR    ] Child 0 reports failure

../../../test/common/TestBed.cpp:183: Failure

Expected equality of these values:

  response

    Which is: 1

  TEST_SUCCESS

    Which is: 0



hipify/src/graph/rings.cc:38 NCCL WARN Error : ring 0 does not loop back to start (8 != 5)

2b9c3f24d715:19030:19046 [5] NCCL INFO hipify/src/graph/connect.cc:727 -> 3

2b9c3f24d715:19030:19046 [5] NCCL INFO hipify/src/init.cc:1648 -> 3



2b9c3f24d715:19030:19045 [4] hipify/src/graph/rings.cc:38 NCCL WARN Error : ring 0 does not loop back to start (8 != 4)

2b9c3f24d715:19030:19045 [4] NCCL INFO hipify/src/graph/connect.cc:727 -> 3

2b9c3f24d715:19030:19045 [4] NCCL INFO hipify/src/init.cc:1648 -> 3

2b9c3f24d715:19030:19046 [5] NCCL INFO hipify/src/init.cc:2030 -> 3



2b9c3f24d715:19030:19050 [9] hipify/src/graph/rings.cc:38 NCCL WARN Error : ring 0 does not loop back to start (8 != 9)

2b9c3f24d715:19030:19046 [5] NCCL INFO hipify/src/group.cc:69 -> 3 [Async thread]

2b9c3f24d715:19030:19045 [4] NCCL INFO hipify/src/init.cc:2030 -> 3

2b9c3f24d715:19030:19050 [9] NCCL INFO hipify/src/graph/connect.cc:727 -> 3

2b9c3f24d715:19030:19045 [4] NCCL INFO hipify/src/group.cc:69 -> 3 [Async thread]

2b9c3f24d715:19030:19050 [9] NCCL INFO hipify/src/init.cc:1648 -> 3



2b9c3f24d715:19030:19055 [14] hipify/src/graph/rings.cc:38 NCCL WARN Error : ring 0 does not loop back to start (8 != 14)

2b9c3f24d715:19030:19055 [14] NCCL INFO hipify/src/graph/connect.cc:727 -> 3

2b9c3f24d715:19030:19055 [14] NCCL INFO hipify/src/init.cc:1648 -> 3

2b9c3f24d715:19030:19050 [9] NCCL INFO hipify/src/init.cc:2030 -> 3

2b9c3f24d715:19030:19050 [9] NCCL INFO hipify/src/group.cc:69 -> 3 [Async thread]

2b9c3f24d715:19030:19042 [1] NCCL INFO MSCCL: No external scheduler found, using internal implementation

2b9c3f24d715:19030:19055 [14] NCCL INFO hipify/src/init.cc:2030 -> 3

2b9c3f24d715:19030:19055 [14] NCCL INFO hipify/src/group.cc:69 -> 3 [Async thread]

2b9c3f24d715:19030:19042 [1] NCCL INFO Using MSCCL files from /var/jenkins_home/workspace/main_extended_rccl_PR-1417/2tLyLYsrp/rccl/build/release/msccl-algorithms



2b9c3f24d715:19030:19042 [1] hipify/src/graph/rings.cc:38 NCCL WARN Error : ring 0 does not loop back to start (8 != 1)

2b9c3f24d715:19030:19042 [1] NCCL INFO hipify/src/graph/connect.cc:727 -> 3

2b9c3f24d715:19030:19042 [1] NCCL INFO hipify/src/init.cc:1648 -> 3

2b9c3f24d715:19030:19042 [1] NCCL INFO hipify/src/init.cc:2030 -> 3

2b9c3f24d715:19030:19042 [1] NCCL INFO hipify/src/group.cc:69 -> 3 [Async thread]

2b9c3f24d715:19030:19030 [15] NCCL INFO hipify/src/group.cc:438 -> 3

2b9c3f24d715:19030:19030 [15] NCCL INFO hipify/src/group.cc:108 -> 3

[ ERROR    ] Child process 0 fails NCCL call ncclGroupEnd with code 3

[ ERROR    ] Child 0 failed on command [INIT_COMMS]:



2b9c3f24d715:19030:19030 [15] hipify/src/enqueue.cc:1609 NCCL WARN Error : no algorithm/protocol available

2b9c3f24d715:19030:19030 [15] NCCL INFO hipify/src/enqueue.cc:837 -> 3

2b9c3f24d715:19030:19030 [15] NCCL INFO hipify/src/enqueue.cc:1328 -> 3

2b9c3f24d715:19030:19030 [15] NCCL INFO hipify/src/group.cc:143 -> 3

2b9c3f24d715:19030:19030 [15] NCCL INFO hipify/src/group.cc:355 -> 3



[Process: 19030] Inside handler function signal: Segmentation fault (11)

/var/jenkins_home/workspace/main_extended_rccl_PR-1417/2tLyLYsrp/rccl/build/release/librccl.so.1(+0x107104) [0x7f970924d104]

/lib64/libc.so.6(+0x4ead0) [0x7f9706b49ad0]

/var/jenkins_home/workspace/main_extended_rccl_PR-1417/2tLyLYsrp/rccl/build/release/librccl.so.1(+0xa4c90) [0x7f97091eac90]

/var/jenkins_home/workspace/main_extended_rccl_PR-1417/2tLyLYsrp/rccl/build/release/librccl.so.1(+0xa3f2e) [0x7f97091e9f2e]

/var/jenkins_home/workspace/main_extended_rccl_PR-1417/2tLyLYsrp/rccl/build/release/librccl.so.1(+0xa39d1) [0x7f97091e99d1]

/var/jenkins_home/workspace/main_extended_rccl_PR-1417/2tLyLYsrp/rccl/build/release/librccl.so.1(+0xa36d6) [0x7f97091e96d6]

./rccl-UnitTests(_ZN15RcclUnitTesting12TestBedChild18ExecuteCollectivesEv+0x120b) [0x2c6f7b]

./rccl-UnitTests(_ZN15RcclUnitTesting12TestBedChild18StartExecutionLoopEv+0x1c6) [0x2c4366]

./rccl-UnitTests(_ZN15RcclUnitTesting7TestBed9InitCommsERKSt6vectorIS1_IiSaIiEESaIS3_EERKS3_S9_ib+0x7d4) [0x2b30d4]

./rccl-UnitTests(_ZN15RcclUnitTesting7TestBed9InitCommsERKSt6vectorIS1_IiSaIiEESaIS3_EEiiib+0x187) [0x2b4347]

./rccl-UnitTests(_ZN15RcclUnitTesting28AllReduce_PreMultScalar_Test8TestBodyEv+0x12e) [0x27429e]

./rccl-UnitTests(_ZN7testing8internal35HandleExceptionsInMethodIfSupportedINS_4TestEvEET0_PT_MS4_FS3_vEPKc+0x3a) [0x30007a]

./rccl-UnitTests() [0x2e9208]

./rccl-UnitTests(_ZN7testing8TestInfo3RunEv+0x26a) [0x2e955a]

./rccl-UnitTests() [0x2f1381]

./rccl-UnitTests(_ZN7testing8internal12UnitTestImpl11RunAllTestsEv+0xbe8) [0x2f2068]

./rccl-UnitTests(_ZN7testing8UnitTest3RunEv+0x79) [0x2f2689]

./rccl-UnitTests(main+0x41) [0x2a8ad1]

/lib64/libc.so.6(__libc_start_main+0xf3) [0x7f9706b35ca3]

./rccl-UnitTests(_start+0x2e) [0x26e28e]

[ INFO     ] SP 16-ranks AllReduce (custom-scalar Mode 0 ncclFloat32)

[ ERROR    ] Child 0 pipe closed unexpectedly

script returned exit code 1

@BKitor yes, I can confirm issue can by reproduced with topo_expl -m 56

@BKitor
Copy link
Contributor Author

BKitor commented Nov 25, 2024

Was a one-line fix in Parse1H16P, outputs match what 'develop' generates, shouldn't affect any of the other passing topologies.
@wenkaidu Could you please re-launch the extended-CI again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants