AZP/TEST: Add MAD tests #9735

Alexey-Rivkin · 2024-03-10T17:22:24Z

What

Add ucx_perftest between two nodes with MAD mode.

Why ?

How

High-level sequence of operations:
2. Build UCX with MAD on both nodes.
3. Run the server side in Docker.
4. Extract server-side target values for LID, GUID and pass them as vars to the client-side tasks.
5. Run a client-side test using LID.
6. Restart the server-side.
7. Run a client-side test using GUID.
8. Stop the server-side.

Note the stage locking, to prevent parallel test execution.

Alexey-Rivkin · 2024-03-11T17:29:14Z

Test build:
https://dev.azure.com/ucfconsort/ucx/_build/results?buildId=77684

contrib/test_mad.sh

tvegas1 · 2024-03-11T17:41:20Z

buildlib/pr/mad_tests.yml

+        name: Set_Vars
+        inputs:
+          targetType: "inline"
+          script: ./contrib/test_mad.sh set_vars


for my understanding: why not sourcing contrib/test_mad.sh then calling set_vars?

Sourcing for each step is repetitive. Why not directly execute the required functions?

I agree with Thomas, sourcing bash functions and then run them is best practice and what everyone is familier with, your implementation is a little bit confusing and very unusual (running functions as a cmdline argument of a bash script).
However, i understand that if we need to source it each task/step it might look like repetitive work, but some time we can have repetitive in favor of readable code

tvegas1 · 2024-03-11T17:52:56Z

buildlib/tools/ucx_perftest.template

+StartLimitBurst=100
+StandardOutput=${PWD}/ucx_perftest.log
+StandardError=${PWD}/ucx_perftest.log
+ExecStart=${PWD}/install/bin/ucx_perftest -e -K ${HCA}


double quotes around HCA bash variable and around PWD?

dpressle · 2024-04-07T09:25:31Z

@yosefe please review, thanks.

yosefe · 2024-04-07T12:13:49Z

buildlib/tools/test_mad.sh

+set -exE -o pipefail
+
+IMAGE="rdmz-harbor.rdmz.labs.mlnx/ucx/x86_64/rhel8.2/builder:mofed-5.0-1.0.0.0"
+cd "$BUILD_SOURCESDIRECTORY"


if this var is undefined the script will go to home dir and do things there

yosefe · 2024-04-07T12:15:20Z

buildlib/tools/test_mad.sh

+
+node_setup() {
+    funcname
+    sudo chmod 777 /dev/infiniband/umad*


This should be done by admin, not by test job

And what if this folder gets recreated (reboot???)? test will fail, and we will need to manually chmod it.

so maybe the test itself should run as root to be able to access this device node
in any case the test should not perform changes on the system

Removed and added to the Confluence page (internal link) instead.

As umad files are created dynamically, permissions must be set before running the test.
Otherwise we end up with errors:
UCX ERROR mad_rpc_open_port(ca="mlx5_0" ca_port=1 mgmt_classes=IB_SA_CLASS) failed: Permission denied

the mad files should be created by openibd service, what do you mean by they are created dynamically?

yosefe · 2024-04-07T12:16:31Z

buildlib/tools/test_mad.sh

+build_ucx_in_docker() {
+    docker run --rm \
+        --name ucx_build_"$BUILD_BUILDID" \
+        -v "$PWD":"$PWD" -w "$PWD" \
+        -v /hpc/local:/hpc/local \
+        $IMAGE \
+        bash -c "source ./buildlib/tools/test_mad.sh && build_ucx"
+
+    sudo chown -R swx-azure-svc:ecryptfs "$PWD"
+}


can we do it by Azure "container:" field, like in other build jobs, instead of running docker manually?

IMO, we should separate business logic from flow, having all the logic in this script helps by allowing easy way of running the tests without having to reverse engineer the AZP yaml.

Had to resort to the manual Docker run commands to run the service container in a detached mode.
To build UCX in Docker we can indeed use a separate job with a "container:" field.
As Daniel mentioned, this would affect the code's readability.

usually our build jobs are in a separate task with "container:" tag. so i think for consistency this one should do it as well.

yosefe · 2024-04-07T12:18:31Z

buildlib/tools/test_mad.sh

+        -v /hpc/local:/hpc/local \
+        --gpus all --ulimit memlock=-1:-1 --device=/dev/infiniband/ \
+        $IMAGE \
+        bash -c "source ./buildlib/tools/test_mad.sh && \


why need to source test_mad.sh?

Removed, thx.

yosefe · 2024-04-07T12:18:50Z

buildlib/tools/test_mad.sh

+    docker run --rm \
+        --detach \
+        --net=host \
+        -e HCA="$HCA" \


why needed?

Removed, thx.

yosefe · 2024-04-07T12:20:38Z

buildlib/tools/test_mad.sh

+
+detect_hca() {
+    echo "Detect first active HCA port"
+    HCA="$(ibv_devinfo | awk '/hca_id:/ {hca=$2} /port:/ {port=$2} /PORT_ACTIVE/ {print hca ":" port; exit}')"


IMO better return the value as printing to stdout and caller of this function will do HCA=$(detect_hca)

buildlib/tools/test_mad.sh

dpressle · 2024-04-08T07:49:54Z

buildlib/tools/test_mad.sh

+}
+
+run_mad_test() {
+    test_type="$1"


if you use arguments you should validate them and make them local to function, something like:
local -r test_type=${1:-}
local -r ib_add=${2:-}
[[ -z $test_type ]] && { exit 1; }
[[ -z $ib_add ]] && { exit 1; }

The ucx_perftest binary has built-in params validation.
Added local to limit var's scope.

tvegas1 · 2024-04-08T11:35:00Z

buildlib/pr/mad_tests.yml

+      - checkout: none
+      - bash: |
+          source ./buildlib/tools/test_mad.sh
+          run_mad_test lid $(LID)


we might as well either pass lid:$(LID) or guid:$(GUID) to simplify argument handling in the called function?

Done, thanks!
The simpler, the better.

tvegas1 · 2024-04-08T11:44:08Z

Why do we have "Error response from daemon: No such container: ucx_perftest_79028" in server restart? https://dev.azure.com/ucfconsort/ucx/_build/results?buildId=79028&view=logs&j=551595a6-c33f-5841-440b-a6858ba5ebd6&t=67dc2fe8-6bfb-54a0-4bcb-22f44968bb72&l=24

Server-side exit following a test. The 'docker stop' command is to confirm the killing.

@tvegas1 please confirm this is expected behavior

Yes it's expected behaviour, after a successful test, server and client both exit successfully.

yosefe · 2024-04-14T08:52:41Z

buildlib/tools/test_mad.sh

+funcname() {
+    set +x
+    echo "==== Running: ${FUNCNAME[1]} ===="
+    set -x
+}


seems redundant, it's used only once

yosefe · 2024-04-14T08:52:59Z

buildlib/tools/test_mad.sh

+}
+
+run_mad_test() {
+    local ib_add="$1"


what is "ib_add"? maybe "ib_address"?

Yes, changed

buildlib/tools/test_mad.sh

buildlib/pr/main.yml

yosefe · 2024-04-14T14:15:20Z

buildlib/pr/main.yml

@@ -255,9 +255,9 @@ stages:
              long_test: $(long_test)
              test_static: $(test_static)

-  - stage: MadTests
+  - stage: ucx_perftestOverMAD_RTE


ucx_perftest_mad_rte

yosefe · 2024-04-14T14:15:45Z

buildlib/tools/test_mad.sh

@@ -10,7 +10,7 @@ fi
 cd "$BUILD_SOURCESDIRECTORY"

 build_ucx() {
-    funcname
+    echo "==== Running: ${FUNCNAME[1]} ===="


IMO can remove this

yosefe · 2024-04-14T14:30:25Z

@Alexey-Rivkin can you pls squash?

Alexey-Rivkin added the WIP-DNM Work in progress / Do not review label Mar 10, 2024

Alexey-Rivkin force-pushed the topic/MAD_tests branch 21 times, most recently from 0927e85 to 28a34a4 Compare March 11, 2024 17:15

Alexey-Rivkin force-pushed the topic/MAD_tests branch from 28a34a4 to 4832a6b Compare March 11, 2024 17:29

Alexey-Rivkin marked this pull request as ready for review March 11, 2024 17:30

Alexey-Rivkin removed the WIP-DNM Work in progress / Do not review label Mar 11, 2024

Alexey-Rivkin requested review from dpressle and tvegas1 March 11, 2024 17:35

Alexey-Rivkin force-pushed the topic/MAD_tests branch from 4832a6b to 2fa0151 Compare March 11, 2024 17:55

tvegas1 reviewed Mar 11, 2024

View reviewed changes

yosefe reviewed Apr 7, 2024

View reviewed changes

Alexey-Rivkin force-pushed the topic/MAD_tests branch 11 times, most recently from 5f2a2f5 to 917b2c8 Compare April 8, 2024 07:31

dpressle reviewed Apr 8, 2024

View reviewed changes

tvegas1 reviewed Apr 8, 2024

View reviewed changes

Alexey-Rivkin force-pushed the topic/MAD_tests branch 2 times, most recently from 715cc5e to 2321138 Compare April 8, 2024 21:46

Alexey-Rivkin marked this pull request as draft April 11, 2024 06:16

Alexey-Rivkin force-pushed the topic/MAD_tests branch from 65b4bcf to 10242eb Compare April 11, 2024 08:38

Alexey-Rivkin marked this pull request as ready for review April 11, 2024 12:14

yosefe reviewed Apr 14, 2024

View reviewed changes

yosefe approved these changes Apr 14, 2024

View reviewed changes

AZP/TEST: ucx_perftest over MAD RTE

5c7e141

Alexey-Rivkin force-pushed the topic/MAD_tests branch from ccd10a6 to 5c7e141 Compare April 14, 2024 16:08

yosefe enabled auto-merge April 14, 2024 16:36

yosefe merged commit 41c1b27 into openucx:master Apr 15, 2024
144 checks passed

Alexey-Rivkin deleted the topic/MAD_tests branch April 15, 2024 14:25

AZP/TEST: Add MAD tests #9735

AZP/TEST: Add MAD tests #9735

Conversation

Alexey-Rivkin commented Mar 10, 2024 • edited Loading

What

Why ?

How

Alexey-Rivkin commented Mar 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dpressle Mar 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dpressle commented Apr 7, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tvegas1 commented Apr 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yosefe commented Apr 14, 2024

Alexey-Rivkin commented Mar 10, 2024 •

edited

Loading

dpressle Mar 12, 2024 •

edited

Loading