Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-16265 test: Fix erasurecode/rebuild_fio.py out of space (#15020) #15340

Merged
merged 1 commit into from
Oct 21, 2024

Conversation

phender
Copy link
Contributor

@phender phender commented Oct 17, 2024

Prevent accumulating large server log files caused by temporarily enabling the DEBUG log mask while creating or destroying pools.

Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: EcodFioRebuild EcodOnlineMultFail
Skip-func-hw-test-large-md-on-ssd: false

Before requesting gatekeeper:

  • Two review approvals and any prior change requests have been resolved.
  • Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
  • Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
  • Commit messages follows the guidelines outlined here.
  • Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

  • You are the appropriate gatekeeper to be landing the patch.
  • The PR has 2 reviews by people familiar with the code, including appropriate owners.
  • Githooks were used. If not, request that user install them and check copyright dates.
  • Checkpatch issues are resolved. Pay particular attention to ones that will show up on future PRs.
  • All builds have passed. Check non-required builds for any new compiler warnings.
  • Sufficient testing is done. Check feature pragmas and test tags and that tests skipped for the ticket are run and now pass with the changes.
  • If applicable, the PR has addressed any potential version compatibility issues.
  • Check the target branch. If it is master branch, should the PR go to a feature branch? If it is a release branch, does it have merge approval in the JIRA ticket.
  • Extra checks if forced landing is requested
    • Review comments are sufficiently resolved, particularly by prior reviewers that requested changes.
    • No new NLT or valgrind warnings. Check the classic view.
    • Quick-build or Quick-functional is not used.
  • Fix the commit message upon landing. Check the standard here. Edit it to create a single commit. If necessary, ask submitter for a new summary.

Prevent accumulating large server log files caused by temporarily
enabling the DEBUG log mask while creating or destroying pools.

Skip-unit-tests: true
Skip-fault-injection-test: true
Test-tag: EcodFioRebuild EcodOnlineMultFail
Skip-func-hw-test-large-md-on-ssd: false

Signed-off-by: Phil Henderson <[email protected]>
@phender phender requested review from a team as code owners October 17, 2024 23:17
@phender phender added clean-cherry-pick Cherry-pick from another branch that did not require additional edits forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed. labels Oct 17, 2024
@phender phender requested a review from daltonbohning October 17, 2024 23:18
Copy link

Ticket title is '[12-24]-./erasurecode/rebuild_fio.py:EcodFioRebuild.test_ec_online_rebuild_fio tests fail due to daos_server startup problem.'
Status is 'Awaiting backport'
Labels: 'ci_master_weekly,md_on_ssd,scrubbed_2.8,weekly_test,request_for_2.6.2'
https://daosio.atlassian.net/browse/DAOS-16265

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15340/1/execution/node/960/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Large MD on SSD completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15340/1/execution/node/976/log

@phender
Copy link
Contributor Author

phender commented Oct 18, 2024

Failures in https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15340/1/testReport/:

2024-10-18 02:03:39,059 test             L0773 INFO | ----------------------------------------------------------------------------------------------------
2024-10-18 02:03:39,059 test             L0776 DEBUG| Common test directory (/var/tmp/daos_testing) contents (check > 90%):
2024-10-18 02:03:39,060 run_utils        L0470 DEBUG| Running on wolf-[304,318-325] with a 120 second timeout: df -h /var/tmp/daos_testing
2024-10-18 02:03:39,285 run_utils        L0336 DEBUG|   wolf-320 (rc=0):
2024-10-18 02:03:39,285 run_utils        L0341 DEBUG|     Filesystem      Size  Used Avail Use% Mounted on
2024-10-18 02:03:39,286 run_utils        L0341 DEBUG|     /dev/sda7        28G  6.9G   20G  27% /var/tmp
2024-10-18 02:03:39,286 run_utils        L0336 DEBUG|   wolf-304 (rc=0):
2024-10-18 02:03:39,286 run_utils        L0341 DEBUG|     Filesystem      Size  Used Avail Use% Mounted on
2024-10-18 02:03:39,286 run_utils        L0341 DEBUG|     /dev/sda7        28G  3.8G   23G  15% /var/tmp
2024-10-18 02:03:39,286 run_utils        L0336 DEBUG|   wolf-318 (rc=0):
2024-10-18 02:03:39,286 run_utils        L0341 DEBUG|     Filesystem      Size  Used Avail Use% Mounted on
2024-10-18 02:03:39,286 run_utils        L0341 DEBUG|     /dev/sda7        28G  8.4G   18G  33% /var/tmp
2024-10-18 02:03:39,286 run_utils        L0336 DEBUG|   wolf-319 (rc=0):
2024-10-18 02:03:39,286 run_utils        L0341 DEBUG|     Filesystem      Size  Used Avail Use% Mounted on
2024-10-18 02:03:39,286 run_utils        L0341 DEBUG|     /dev/sda7        28G   18G  8.0G  70% /var/tmp
2024-10-18 02:03:39,286 run_utils        L0336 DEBUG|   wolf-322 (rc=0):
2024-10-18 02:03:39,286 run_utils        L0341 DEBUG|     Filesystem      Size  Used Avail Use% Mounted on
2024-10-18 02:03:39,286 run_utils        L0341 DEBUG|     /dev/sda7        28G   83M   26G   1% /var/tmp
2024-10-18 02:03:39,286 run_utils        L0336 DEBUG|   wolf-323 (rc=0):
2024-10-18 02:03:39,286 run_utils        L0341 DEBUG|     Filesystem      Size  Used Avail Use% Mounted on
2024-10-18 02:03:39,286 run_utils        L0341 DEBUG|     /dev/sda7        28G   26G  4.0K 100% /var/tmp
2024-10-18 02:03:39,286 run_utils        L0336 DEBUG|   wolf-321 (rc=0):
2024-10-18 02:03:39,286 run_utils        L0341 DEBUG|     Filesystem      Size  Used Avail Use% Mounted on
2024-10-18 02:03:39,286 run_utils        L0341 DEBUG|     /dev/sda7        28G   14M   26G   1% /var/tmp
2024-10-18 02:03:39,286 run_utils        L0336 DEBUG|   wolf-324 (rc=0):
2024-10-18 02:03:39,286 run_utils        L0341 DEBUG|     Filesystem      Size  Used Avail Use% Mounted on
2024-10-18 02:03:39,286 run_utils        L0341 DEBUG|     /dev/sda7        28G  7.0M   26G   1% /var/tmp
2024-10-18 02:03:39,286 run_utils        L0336 DEBUG|   wolf-325 (rc=0):
2024-10-18 02:03:39,286 run_utils        L0341 DEBUG|     Filesystem      Size  Used Avail Use% Mounted on
2024-10-18 02:03:39,286 run_utils        L0341 DEBUG|     /dev/sda7        28G  6.9M   26G   1% /var/tmp
2024-10-18 02:03:39,287 run_utils        L0470 DEBUG| Running on wolf-323 with a 120 second timeout: du -sh /var/tmp/daos_testing/*
2024-10-18 02:03:39,472 run_utils        L0336 DEBUG|   wolf-323 (rc=0):
2024-10-18 02:03:39,472 run_utils        L0341 DEBUG|     4.0K	/var/tmp/daos_testing/cart_logs
2024-10-18 02:03:39,472 run_utils        L0341 DEBUG|     8.0K	/var/tmp/daos_testing/configs
2024-10-18 02:03:39,472 run_utils        L0341 DEBUG|     32K	/var/tmp/daos_testing/daosCA
2024-10-18 02:03:39,472 run_utils        L0341 DEBUG|     4.0K	/var/tmp/daos_testing/daos_configs
2024-10-18 02:03:39,472 run_utils        L0341 DEBUG|     4.0K	/var/tmp/daos_testing/daos_dumps
2024-10-18 02:03:39,473 run_utils        L0341 DEBUG|     4.0K	/var/tmp/daos_testing/daos_logs
2024-10-18 02:03:39,473 run_utils        L0341 DEBUG|     4.0K	/var/tmp/daos_testing/stacktraces
2024-10-18 02:03:39,473 run_utils        L0341 DEBUG|     164K	/var/tmp/daos_testing/test_ec_multiple_rank_failure_daos_control.log
2024-10-18 02:03:39,473 run_utils        L0341 DEBUG|     76K	/var/tmp/daos_testing/test_ec_multiple_rank_failure_daos_server_0.log.106425
2024-10-18 02:03:39,473 run_utils        L0341 DEBUG|     8.6M	/var/tmp/daos_testing/test_ec_multiple_rank_failure_daos_server_0.log.112877
2024-10-18 02:03:39,473 run_utils        L0341 DEBUG|     516K	/var/tmp/daos_testing/test_ec_multiple_rank_failure_daos_server_0.log.119559
2024-10-18 02:03:39,473 run_utils        L0341 DEBUG|     30M	/var/tmp/daos_testing/test_ec_multiple_rank_failure_daos_server_0.log.126201
2024-10-18 02:03:39,473 run_utils        L0341 DEBUG|     72K	/var/tmp/daos_testing/test_ec_multiple_rank_failure_daos_server_0.log.92443
2024-10-18 02:03:39,473 run_utils        L0341 DEBUG|     1.9M	/var/tmp/daos_testing/test_ec_multiple_rank_failure_daos_server_0.log.99194
2024-10-18 02:03:39,473 run_utils        L0341 DEBUG|     1.3G	/var/tmp/daos_testing/test_ec_multiple_rank_failure_daos_server_1.log.106239
2024-10-18 02:03:39,473 run_utils        L0341 DEBUG|     1.4G	/var/tmp/daos_testing/test_ec_multiple_rank_failure_daos_server_1.log.113065
2024-10-18 02:03:39,473 run_utils        L0341 DEBUG|     1.5G	/var/tmp/daos_testing/test_ec_multiple_rank_failure_daos_server_1.log.119745
2024-10-18 02:03:39,473 run_utils        L0341 DEBUG|     1.9G	/var/tmp/daos_testing/test_ec_multiple_rank_failure_daos_server_1.log.126388
2024-10-18 02:03:39,473 run_utils        L0341 DEBUG|     76K	/var/tmp/daos_testing/test_ec_multiple_rank_failure_daos_server_1.log.92630
2024-10-18 02:03:39,473 run_utils        L0341 DEBUG|     2.0M	/var/tmp/daos_testing/test_ec_multiple_rank_failure_daos_server_1.log.99007
2024-10-18 02:03:39,473 run_utils        L0341 DEBUG|     220K	/var/tmp/daos_testing/test_ec_multiple_rank_failure_daos_server_helper.log
2024-10-18 02:03:39,473 run_utils        L0341 DEBUG|     156K	/var/tmp/daos_testing/test_ec_multiple_targets_on_diff_ranks_daos_control.log
2024-10-18 02:03:39,473 run_utils        L0341 DEBUG|     40K	/var/tmp/daos_testing/test_ec_multiple_targets_on_diff_ranks_daos_server_0.log.171884
2024-10-18 02:03:39,473 run_utils        L0341 DEBUG|     48K	/var/tmp/daos_testing/test_ec_multiple_targets_on_diff_ranks_daos_server_0.log.178366
2024-10-18 02:03:39,473 run_utils        L0341 DEBUG|     40K	/var/tmp/daos_testing/test_ec_multiple_targets_on_diff_ranks_daos_server_0.log.184997
2024-10-18 02:03:39,473 run_utils        L0341 DEBUG|     52K	/var/tmp/daos_testing/test_ec_multiple_targets_on_diff_ranks_daos_server_0.log.191388
2024-10-18 02:03:39,473 run_utils        L0341 DEBUG|     52K	/var/tmp/daos_testing/test_ec_multiple_targets_on_diff_ranks_daos_server_0.log.197899
2024-10-18 02:03:39,473 run_utils        L0341 DEBUG|     72K	/var/tmp/daos_testing/test_ec_multiple_targets_on_diff_ranks_daos_server_0.log.204583
2024-10-18 02:03:39,473 run_utils        L0341 DEBUG|     738M	/var/tmp/daos_testing/test_ec_multiple_targets_on_diff_ranks_daos_server_1.log.172072
2024-10-18 02:03:39,473 run_utils        L0341 DEBUG|     704M	/var/tmp/daos_testing/test_ec_multiple_targets_on_diff_ranks_daos_server_1.log.178557
2024-10-18 02:03:39,473 run_utils        L0341 DEBUG|     694M	/var/tmp/daos_testing/test_ec_multiple_targets_on_diff_ranks_daos_server_1.log.184806
2024-10-18 02:03:39,473 run_utils        L0341 DEBUG|     984M	/var/tmp/daos_testing/test_ec_multiple_targets_on_diff_ranks_daos_server_1.log.191576
2024-10-18 02:03:39,473 run_utils        L0341 DEBUG|     1.1G	/var/tmp/daos_testing/test_ec_multiple_targets_on_diff_ranks_daos_server_1.log.198088
2024-10-18 02:03:39,473 run_utils        L0341 DEBUG|     1.3G	/var/tmp/daos_testing/test_ec_multiple_targets_on_diff_ranks_daos_server_1.log.204395
2024-10-18 02:03:39,473 run_utils        L0341 DEBUG|     220K	/var/tmp/daos_testing/test_ec_multiple_targets_on_diff_ranks_daos_server_helper.log
2024-10-18 02:03:39,474 run_utils        L0341 DEBUG|     152K	/var/tmp/daos_testing/test_ec_multiple_targets_on_same_rank_daos_control.log
2024-10-18 02:03:39,474 run_utils        L0341 DEBUG|     40K	/var/tmp/daos_testing/test_ec_multiple_targets_on_same_rank_daos_server_0.log.132854
2024-10-18 02:03:39,474 run_utils        L0341 DEBUG|     40K	/var/tmp/daos_testing/test_ec_multiple_targets_on_same_rank_daos_server_0.log.139462
2024-10-18 02:03:39,474 run_utils        L0341 DEBUG|     40K	/var/tmp/daos_testing/test_ec_multiple_targets_on_same_rank_daos_server_0.log.146035
2024-10-18 02:03:39,474 run_utils        L0341 DEBUG|     68K	/var/tmp/daos_testing/test_ec_multiple_targets_on_same_rank_daos_server_0.log.152260
2024-10-18 02:03:39,474 run_utils        L0341 DEBUG|     48K	/var/tmp/daos_testing/test_ec_multiple_targets_on_same_rank_daos_server_0.log.158768
2024-10-18 02:03:39,474 run_utils        L0341 DEBUG|     52K	/var/tmp/daos_testing/test_ec_multiple_targets_on_same_rank_daos_server_0.log.165270
2024-10-18 02:03:39,474 run_utils        L0341 DEBUG|     368M	/var/tmp/daos_testing/test_ec_multiple_targets_on_same_rank_daos_server_1.log.133041
2024-10-18 02:03:39,474 run_utils        L0341 DEBUG|     353M	/var/tmp/daos_testing/test_ec_multiple_targets_on_same_rank_daos_server_1.log.139275
2024-10-18 02:03:39,474 run_utils        L0341 DEBUG|     346M	/var/tmp/daos_testing/test_ec_multiple_targets_on_same_rank_daos_server_1.log.145844
2024-10-18 02:03:39,474 run_utils        L0341 DEBUG|     656M	/var/tmp/daos_testing/test_ec_multiple_targets_on_same_rank_daos_server_1.log.152449
2024-10-18 02:03:39,474 run_utils        L0341 DEBUG|     733M	/var/tmp/daos_testing/test_ec_multiple_targets_on_same_rank_daos_server_1.log.158955
2024-10-18 02:03:39,474 run_utils        L0341 DEBUG|     1.3G	/var/tmp/daos_testing/test_ec_multiple_targets_on_same_rank_daos_server_1.log.165458
2024-10-18 02:03:39,474 run_utils        L0341 DEBUG|     220K	/var/tmp/daos_testing/test_ec_multiple_targets_on_same_rank_daos_server_helper.log
2024-10-18 02:03:39,474 run_utils        L0341 DEBUG|     4.0K	/var/tmp/daos_testing/test_ec_single_target_rank_failure
2024-10-18 02:03:39,474 run_utils        L0341 DEBUG|     124K	/var/tmp/daos_testing/test_ec_single_target_rank_failure_daos_control.log
2024-10-18 02:03:39,474 run_utils        L0341 DEBUG|     60K	/var/tmp/daos_testing/test_ec_single_target_rank_failure_daos_server_0.log.210830
2024-10-18 02:03:39,474 run_utils        L0341 DEBUG|     60K	/var/tmp/daos_testing/test_ec_single_target_rank_failure_daos_server_0.log.217340
2024-10-18 02:03:39,474 run_utils        L0341 DEBUG|     84K	/var/tmp/daos_testing/test_ec_single_target_rank_failure_daos_server_0.log.223953
2024-10-18 02:03:39,474 run_utils        L0341 DEBUG|     92K	/var/tmp/daos_testing/test_ec_single_target_rank_failure_daos_server_0.log.230281
2024-10-18 02:03:39,474 run_utils        L0341 DEBUG|     36K	/var/tmp/daos_testing/test_ec_single_target_rank_failure_daos_server_0.log.237138
2024-10-18 02:03:39,474 run_utils        L0341 DEBUG|     1.9G	/var/tmp/daos_testing/test_ec_single_target_rank_failure_daos_server_1.log.211017
2024-10-18 02:03:39,474 run_utils        L0341 DEBUG|     1.4G	/var/tmp/daos_testing/test_ec_single_target_rank_failure_daos_server_1.log.217530
2024-10-18 02:03:39,474 run_utils        L0341 DEBUG|     1.3G	/var/tmp/daos_testing/test_ec_single_target_rank_failure_daos_server_1.log.223764
2024-10-18 02:03:39,474 run_utils        L0341 DEBUG|     2.1G	/var/tmp/daos_testing/test_ec_single_target_rank_failure_daos_server_1.log.223764.old
2024-10-18 02:03:39,474 run_utils        L0341 DEBUG|     1.4G	/var/tmp/daos_testing/test_ec_single_target_rank_failure_daos_server_1.log.230470
2024-10-18 02:03:39,474 run_utils        L0341 DEBUG|     2.1G	/var/tmp/daos_testing/test_ec_single_target_rank_failure_daos_server_1.log.230470.old
2024-10-18 02:03:39,474 run_utils        L0341 DEBUG|     1.3G	/var/tmp/daos_testing/test_ec_single_target_rank_failure_daos_server_1.log.236950
2024-10-18 02:03:39,474 run_utils        L0341 DEBUG|     184K	/var/tmp/daos_testing/test_ec_single_target_rank_failure_daos_server_helper.log
2024-10-18 02:03:39,474 run_utils        L0341 DEBUG|     4.0K	/var/tmp/daos_testing/user
2024-10-18 02:03:39,475 run_utils        L0341 DEBUG|     4.0K	/var/tmp/daos_testing/valgrind_logs
2024-10-18 02:03:39,475 test             L0788 INFO | ----------------------------------------------------------------------------------------------------
  • launch/functional_hardware_large/01-erasurecode-multiple_failure-launch.log
    • wolf-323 (rc=1): mkdir: cannot create directory /var/tmp/daos_testing/configs/daos_configs: No space left on device
  • 01-./erasurecode/multiple_failure.py:EcodOnlineMultFail.test_ec_multiple_rank_failure

In the Functional HW Large MD on SSD stage the 24-./erasurecode/multiple_failure.py:EcodOnlineMultFail.test_ec_single_target_rank_failure test passed with the max use percentage being 51%:

2024-10-18 04:34:05,371 test             L0773 INFO | ----------------------------------------------------------------------------------------------------
2024-10-18 04:34:05,371 test             L0776 DEBUG| Common test directory (/var/tmp/daos_testing) contents (check > 90%):
2024-10-18 04:34:05,371 run_utils        L0470 DEBUG| Running on wolf-[51,110-117] with a 120 second timeout: df -h /var/tmp/daos_testing
2024-10-18 04:34:05,592 run_utils        L0336 DEBUG|   wolf-110 (rc=0):
2024-10-18 04:34:05,592 run_utils        L0341 DEBUG|     Filesystem      Size  Used Avail Use% Mounted on
2024-10-18 04:34:05,592 run_utils        L0341 DEBUG|     /dev/sda7        28G   14G   13G  51% /var/tmp
2024-10-18 04:34:05,592 run_utils        L0336 DEBUG|   wolf-114 (rc=0):
2024-10-18 04:34:05,592 run_utils        L0341 DEBUG|     Filesystem      Size  Used Avail Use% Mounted on
2024-10-18 04:34:05,592 run_utils        L0341 DEBUG|     /dev/sda7        28G  357M   26G   2% /var/tmp
2024-10-18 04:34:05,592 run_utils        L0336 DEBUG|   wolf-113 (rc=0):
2024-10-18 04:34:05,592 run_utils        L0341 DEBUG|     Filesystem      Size  Used Avail Use% Mounted on
2024-10-18 04:34:05,592 run_utils        L0341 DEBUG|     /dev/sda7        28G  713M   26G   3% /var/tmp
2024-10-18 04:34:05,592 run_utils        L0336 DEBUG|   wolf-[51,115] (rc=0):
2024-10-18 04:34:05,592 run_utils        L0341 DEBUG|     Filesystem      Size  Used Avail Use% Mounted on
2024-10-18 04:34:05,593 run_utils        L0341 DEBUG|     /dev/sda7        28G  2.8G   24G  11% /var/tmp
2024-10-18 04:34:05,593 run_utils        L0336 DEBUG|   wolf-117 (rc=0):
2024-10-18 04:34:05,593 run_utils        L0341 DEBUG|     Filesystem      Size  Used Avail Use% Mounted on
2024-10-18 04:34:05,593 run_utils        L0341 DEBUG|     /dev/sda7        28G   12M   26G   1% /var/tmp
2024-10-18 04:34:05,593 run_utils        L0336 DEBUG|   wolf-112 (rc=0):
2024-10-18 04:34:05,593 run_utils        L0341 DEBUG|     Filesystem      Size  Used Avail Use% Mounted on
2024-10-18 04:34:05,593 run_utils        L0341 DEBUG|     /dev/sda7        28G   13G   14G  50% /var/tmp
2024-10-18 04:34:05,593 run_utils        L0336 DEBUG|   wolf-111 (rc=0):
2024-10-18 04:34:05,593 run_utils        L0341 DEBUG|     Filesystem      Size  Used Avail Use% Mounted on
2024-10-18 04:34:05,593 run_utils        L0341 DEBUG|     /dev/sda7        28G  362M   26G   2% /var/tmp
2024-10-18 04:34:05,593 run_utils        L0336 DEBUG|   wolf-116 (rc=0):
2024-10-18 04:34:05,593 run_utils        L0341 DEBUG|     Filesystem      Size  Used Avail Use% Mounted on
2024-10-18 04:34:05,593 run_utils        L0341 DEBUG|     /dev/sda7        28G   11M   26G   1% /var/tmp
2024-10-18 04:34:05,594 test             L0788 INFO | ----------------------------------------------------------------------------------------------------

@phender phender requested a review from a team October 21, 2024 18:03
@phender phender merged commit b913d3e into release/2.6 Oct 21, 2024
42 of 47 checks passed
@phender phender deleted the pahender/DAOS-16265 branch October 21, 2024 18:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clean-cherry-pick Cherry-pick from another branch that did not require additional edits forced-landing The PR has known failures or has intentionally reduced testing, but should still be landed.
Development

Successfully merging this pull request may close these issues.

3 participants