Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AZP/TEST: Add MAD tests #9735

Merged
merged 1 commit into from
Apr 15, 2024
Merged

Conversation

Alexey-Rivkin
Copy link
Collaborator

@Alexey-Rivkin Alexey-Rivkin commented Mar 10, 2024

What

Add ucx_perftest between two nodes with MAD mode.

Why ?

HPCINFRA-1704 (Internal link).

How

High-level sequence of operations:
2. Build UCX with MAD on both nodes.
3. Run the server side in Docker.
4. Extract server-side target values for LID, GUID and pass them as vars to the client-side tasks.
5. Run a client-side test using LID.
6. Restart the server-side.
7. Run a client-side test using GUID.
8. Stop the server-side.

Note the stage locking, to prevent parallel test execution.

@Alexey-Rivkin Alexey-Rivkin added the WIP-DNM Work in progress / Do not review label Mar 10, 2024
@Alexey-Rivkin Alexey-Rivkin force-pushed the topic/MAD_tests branch 21 times, most recently from 0927e85 to 28a34a4 Compare March 11, 2024 17:15
@Alexey-Rivkin
Copy link
Collaborator Author

@Alexey-Rivkin Alexey-Rivkin marked this pull request as ready for review March 11, 2024 17:30
@Alexey-Rivkin Alexey-Rivkin removed the WIP-DNM Work in progress / Do not review label Mar 11, 2024
contrib/test_mad.sh Outdated Show resolved Hide resolved
contrib/test_mad.sh Outdated Show resolved Hide resolved
contrib/test_mad.sh Outdated Show resolved Hide resolved
contrib/test_mad.sh Outdated Show resolved Hide resolved
contrib/test_mad.sh Outdated Show resolved Hide resolved
contrib/test_mad.sh Outdated Show resolved Hide resolved
name: Set_Vars
inputs:
targetType: "inline"
script: ./contrib/test_mad.sh set_vars
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for my understanding: why not sourcing contrib/test_mad.sh then calling set_vars?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sourcing for each step is repetitive. Why not directly execute the required functions?

Copy link

@dpressle dpressle Mar 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Thomas, sourcing bash functions and then run them is best practice and what everyone is familier with, your implementation is a little bit confusing and very unusual (running functions as a cmdline argument of a bash script).
However, i understand that if we need to source it each task/step it might look like repetitive work, but some time we can have repetitive in favor of readable code

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

StartLimitBurst=100
StandardOutput=${PWD}/ucx_perftest.log
StandardError=${PWD}/ucx_perftest.log
ExecStart=${PWD}/install/bin/ucx_perftest -e -K ${HCA}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

double quotes around HCA bash variable and around PWD?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@dpressle
Copy link

dpressle commented Apr 7, 2024

@yosefe please review, thanks.

set -exE -o pipefail

IMAGE="rdmz-harbor.rdmz.labs.mlnx/ucx/x86_64/rhel8.2/builder:mofed-5.0-1.0.0.0"
cd "$BUILD_SOURCESDIRECTORY"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this var is undefined the script will go to home dir and do things there


node_setup() {
funcname
sudo chmod 777 /dev/infiniband/umad*
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be done by admin, not by test job

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And what if this folder gets recreated (reboot???)? test will fail, and we will need to manually chmod it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so maybe the test itself should run as root to be able to access this device node
in any case the test should not perform changes on the system

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed and added to the Confluence page (internal link) instead.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As umad files are created dynamically, permissions must be set before running the test.
Otherwise we end up with errors:
UCX ERROR mad_rpc_open_port(ca="mlx5_0" ca_port=1 mgmt_classes=IB_SA_CLASS) failed: Permission denied

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the mad files should be created by openibd service, what do you mean by they are created dynamically?

Comment on lines 25 to 34
build_ucx_in_docker() {
docker run --rm \
--name ucx_build_"$BUILD_BUILDID" \
-v "$PWD":"$PWD" -w "$PWD" \
-v /hpc/local:/hpc/local \
$IMAGE \
bash -c "source ./buildlib/tools/test_mad.sh && build_ucx"

sudo chown -R swx-azure-svc:ecryptfs "$PWD"
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we do it by Azure "container:" field, like in other build jobs, instead of running docker manually?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, we should separate business logic from flow, having all the logic in this script helps by allowing easy way of running the tests without having to reverse engineer the AZP yaml.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had to resort to the manual Docker run commands to run the service container in a detached mode.
To build UCX in Docker we can indeed use a separate job with a "container:" field.
As Daniel mentioned, this would affect the code's readability.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

usually our build jobs are in a separate task with "container:" tag. so i think for consistency this one should do it as well.

-v /hpc/local:/hpc/local \
--gpus all --ulimit memlock=-1:-1 --device=/dev/infiniband/ \
$IMAGE \
bash -c "source ./buildlib/tools/test_mad.sh && \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why need to source test_mad.sh?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed, thx.

docker run --rm \
--detach \
--net=host \
-e HCA="$HCA" \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why needed?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed, thx.


detect_hca() {
echo "Detect first active HCA port"
HCA="$(ibv_devinfo | awk '/hca_id:/ {hca=$2} /port:/ {port=$2} /PORT_ACTIVE/ {print hca ":" port; exit}')"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO better return the value as printing to stdout and caller of this function will do HCA=$(detect_hca)

buildlib/tools/test_mad.sh Outdated Show resolved Hide resolved
buildlib/tools/test_mad.sh Outdated Show resolved Hide resolved
buildlib/tools/test_mad.sh Outdated Show resolved Hide resolved
@Alexey-Rivkin Alexey-Rivkin force-pushed the topic/MAD_tests branch 11 times, most recently from 5f2a2f5 to 917b2c8 Compare April 8, 2024 07:31
}

run_mad_test() {
test_type="$1"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you use arguments you should validate them and make them local to function, something like:
local -r test_type=${1:-}
local -r ib_add=${2:-}
[[ -z $test_type ]] && { exit 1; }
[[ -z $ib_add ]] && { exit 1; }

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ucx_perftest binary has built-in params validation.
Added local to limit var's scope.

- checkout: none
- bash: |
source ./buildlib/tools/test_mad.sh
run_mad_test lid $(LID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we might as well either pass lid:$(LID) or guid:$(GUID) to simplify argument handling in the called function?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, thanks!
The simpler, the better.

@tvegas1
Copy link
Contributor

tvegas1 commented Apr 8, 2024

Why do we have "Error response from daemon: No such container: ucx_perftest_79028" in server restart? https://dev.azure.com/ucfconsort/ucx/_build/results?buildId=79028&view=logs&j=551595a6-c33f-5841-440b-a6858ba5ebd6&t=67dc2fe8-6bfb-54a0-4bcb-22f44968bb72&l=24

Server-side exit following a test. The 'docker stop' command is to confirm the killing.

@tvegas1 please confirm this is expected behavior

Yes it's expected behaviour, after a successful test, server and client both exit successfully.

@Alexey-Rivkin Alexey-Rivkin force-pushed the topic/MAD_tests branch 2 times, most recently from 715cc5e to 2321138 Compare April 8, 2024 21:46
@Alexey-Rivkin Alexey-Rivkin marked this pull request as draft April 11, 2024 06:16
@Alexey-Rivkin Alexey-Rivkin marked this pull request as ready for review April 11, 2024 12:14
Comment on lines 79 to 76
funcname() {
set +x
echo "==== Running: ${FUNCNAME[1]} ===="
set -x
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems redundant, it's used only once

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

}

run_mad_test() {
local ib_add="$1"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is "ib_add"? maybe "ib_address"?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, changed

buildlib/tools/test_mad.sh Show resolved Hide resolved
buildlib/pr/main.yml Outdated Show resolved Hide resolved
@@ -255,9 +255,9 @@ stages:
long_test: $(long_test)
test_static: $(test_static)

- stage: MadTests
- stage: ucx_perftestOverMAD_RTE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ucx_perftest_mad_rte

@@ -10,7 +10,7 @@ fi
cd "$BUILD_SOURCESDIRECTORY"

build_ucx() {
funcname
echo "==== Running: ${FUNCNAME[1]} ===="
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO can remove this

@yosefe
Copy link
Contributor

yosefe commented Apr 14, 2024

@Alexey-Rivkin can you pls squash?

@yosefe yosefe enabled auto-merge April 14, 2024 16:36
@yosefe yosefe merged commit 41c1b27 into openucx:master Apr 15, 2024
144 checks passed
@Alexey-Rivkin Alexey-Rivkin deleted the topic/MAD_tests branch April 15, 2024 14:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants