-
Notifications
You must be signed in to change notification settings - Fork 428
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AZP/TEST: Add MAD tests #9735
AZP/TEST: Add MAD tests #9735
Conversation
0927e85
to
28a34a4
Compare
28a34a4
to
4832a6b
Compare
4832a6b
to
2fa0151
Compare
buildlib/pr/mad_tests.yml
Outdated
name: Set_Vars | ||
inputs: | ||
targetType: "inline" | ||
script: ./contrib/test_mad.sh set_vars |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for my understanding: why not sourcing contrib/test_mad.sh then calling set_vars?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sourcing for each step is repetitive. Why not directly execute the required functions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Thomas, sourcing bash functions and then run them is best practice and what everyone is familier with, your implementation is a little bit confusing and very unusual (running functions as a cmdline argument of a bash script).
However, i understand that if we need to source it each task/step it might look like repetitive work, but some time we can have repetitive in favor of readable code
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
buildlib/tools/ucx_perftest.template
Outdated
StartLimitBurst=100 | ||
StandardOutput=${PWD}/ucx_perftest.log | ||
StandardError=${PWD}/ucx_perftest.log | ||
ExecStart=${PWD}/install/bin/ucx_perftest -e -K ${HCA} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double quotes around HCA bash variable and around PWD?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@yosefe please review, thanks. |
buildlib/tools/test_mad.sh
Outdated
set -exE -o pipefail | ||
|
||
IMAGE="rdmz-harbor.rdmz.labs.mlnx/ucx/x86_64/rhel8.2/builder:mofed-5.0-1.0.0.0" | ||
cd "$BUILD_SOURCESDIRECTORY" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this var is undefined the script will go to home dir and do things there
buildlib/tools/test_mad.sh
Outdated
|
||
node_setup() { | ||
funcname | ||
sudo chmod 777 /dev/infiniband/umad* |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be done by admin, not by test job
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And what if this folder gets recreated (reboot???)? test will fail, and we will need to manually chmod it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so maybe the test itself should run as root to be able to access this device node
in any case the test should not perform changes on the system
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed and added to the Confluence page (internal link) instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As umad files are created dynamically, permissions must be set before running the test.
Otherwise we end up with errors:
UCX ERROR mad_rpc_open_port(ca="mlx5_0" ca_port=1 mgmt_classes=IB_SA_CLASS) failed: Permission denied
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the mad files should be created by openibd service, what do you mean by they are created dynamically?
buildlib/tools/test_mad.sh
Outdated
build_ucx_in_docker() { | ||
docker run --rm \ | ||
--name ucx_build_"$BUILD_BUILDID" \ | ||
-v "$PWD":"$PWD" -w "$PWD" \ | ||
-v /hpc/local:/hpc/local \ | ||
$IMAGE \ | ||
bash -c "source ./buildlib/tools/test_mad.sh && build_ucx" | ||
|
||
sudo chown -R swx-azure-svc:ecryptfs "$PWD" | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we do it by Azure "container:" field, like in other build jobs, instead of running docker manually?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, we should separate business logic from flow, having all the logic in this script helps by allowing easy way of running the tests without having to reverse engineer the AZP yaml.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had to resort to the manual Docker run commands to run the service container in a detached mode.
To build UCX in Docker we can indeed use a separate job with a "container:" field.
As Daniel mentioned, this would affect the code's readability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
usually our build jobs are in a separate task with "container:" tag. so i think for consistency this one should do it as well.
buildlib/tools/test_mad.sh
Outdated
-v /hpc/local:/hpc/local \ | ||
--gpus all --ulimit memlock=-1:-1 --device=/dev/infiniband/ \ | ||
$IMAGE \ | ||
bash -c "source ./buildlib/tools/test_mad.sh && \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why need to source test_mad.sh?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed, thx.
buildlib/tools/test_mad.sh
Outdated
docker run --rm \ | ||
--detach \ | ||
--net=host \ | ||
-e HCA="$HCA" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed, thx.
buildlib/tools/test_mad.sh
Outdated
|
||
detect_hca() { | ||
echo "Detect first active HCA port" | ||
HCA="$(ibv_devinfo | awk '/hca_id:/ {hca=$2} /port:/ {port=$2} /PORT_ACTIVE/ {print hca ":" port; exit}')" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO better return the value as printing to stdout and caller of this function will do HCA=$(detect_hca)
5f2a2f5
to
917b2c8
Compare
buildlib/tools/test_mad.sh
Outdated
} | ||
|
||
run_mad_test() { | ||
test_type="$1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you use arguments you should validate them and make them local to function, something like:
local -r test_type=${1:-}
local -r ib_add=${2:-}
[[ -z $test_type ]] && { exit 1; }
[[ -z $ib_add ]] && { exit 1; }
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The ucx_perftest
binary has built-in params validation.
Added local
to limit var's scope.
buildlib/pr/mad_tests.yml
Outdated
- checkout: none | ||
- bash: | | ||
source ./buildlib/tools/test_mad.sh | ||
run_mad_test lid $(LID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we might as well either pass lid:$(LID)
or guid:$(GUID)
to simplify argument handling in the called function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, thanks!
The simpler, the better.
Yes it's expected behaviour, after a successful test, server and client both exit successfully. |
715cc5e
to
2321138
Compare
65b4bcf
to
10242eb
Compare
buildlib/tools/test_mad.sh
Outdated
funcname() { | ||
set +x | ||
echo "==== Running: ${FUNCNAME[1]} ====" | ||
set -x | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems redundant, it's used only once
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
buildlib/tools/test_mad.sh
Outdated
} | ||
|
||
run_mad_test() { | ||
local ib_add="$1" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is "ib_add"? maybe "ib_address"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, changed
buildlib/pr/main.yml
Outdated
@@ -255,9 +255,9 @@ stages: | |||
long_test: $(long_test) | |||
test_static: $(test_static) | |||
|
|||
- stage: MadTests | |||
- stage: ucx_perftestOverMAD_RTE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ucx_perftest_mad_rte
buildlib/tools/test_mad.sh
Outdated
@@ -10,7 +10,7 @@ fi | |||
cd "$BUILD_SOURCESDIRECTORY" | |||
|
|||
build_ucx() { | |||
funcname | |||
echo "==== Running: ${FUNCNAME[1]} ====" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO can remove this
@Alexey-Rivkin can you pls squash? |
ccd10a6
to
5c7e141
Compare
What
Add ucx_perftest between two nodes with MAD mode.
Why ?
HPCINFRA-1704 (Internal link).
How
High-level sequence of operations:
2. Build UCX with MAD on both nodes.
3. Run the server side in Docker.
4. Extract server-side target values for LID, GUID and pass them as vars to the client-side tasks.
5. Run a client-side test using LID.
6. Restart the server-side.
7. Run a client-side test using GUID.
8. Stop the server-side.
Note the stage locking, to prevent parallel test execution.