Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

mjmac/dead ranks #15609

Merged
merged 7 commits into from
Dec 13, 2024
Merged

mjmac/dead ranks #15609

merged 7 commits into from
Dec 13, 2024

Conversation

mjmac
Copy link
Contributor

@mjmac mjmac commented Dec 12, 2024

jolivier23 and others added 3 commits December 11, 2024 14:10
Required-githooks: true

Change-Id: Ifd3f793661ea9f64aa47162a791b17b4987164ba
Signed-off-by: Jeff Olivier <[email protected]>
* DAOS-16702 rebuild: restart rebuild for a massive failure case

In special massive failure case -
1. some engines down and triggered rebuild.
2. one engine participated the rebuild, not finished yet, it down again,
   the #failures exceeds pool RF and will not change pool map.
3. That engine restarted by administrator.

In that case should recover the rebuild task on the engine, to simplify it now just
abort and retry the global rebuild task.
No such issue by the typical recover approach that restart the whole
system including the PS leader.

Signed-off-by: Xuezhao Liu <[email protected]>
#14436)

Allow enabled and disabled ranks option to be used simultaneously (DAOS-10250).
Update and add cmocka unit tests of engine management related functions (DAOS-10253).

Fix memory leaks of ranks string in function ds_mgmt_drpc_pool_query().

Required-githooks: true

Signed-off-by: Cedric Koch-Hofer <[email protected]>
Copy link

Errors are component not formatted correctly,Ticket number prefix incorrect,PR title is malformatted. See https://daosio.atlassian.net/wiki/spaces/DC/pages/11133911069/Commit+Comments,Unable to load ticket data
https://daosio.atlassian.net/browse/mjmac/dead

@mjmac mjmac requested a review from jolivier23 December 12, 2024 21:27
knard-intel and others added 4 commits December 12, 2024 21:36
Always display the disabled targets and remove the old associated options.

Required-githooks: true

Signed-off-by: Cedric Koch-Hofer <[email protected]>
Fix regression on pool/list_verbose.py functional test introduced with
DAOS-14419.

Signed-off-by: Cedric Koch-Hofer <[email protected]>
After significant failures, the system may leave behind some suspect
engines that were marked as DEAD by the SWIM protocol, but were not
excluded from the system to prevent data loss. An administrator
can bring these ranks back online by restarting them.

This PR aims to provide an administrative interface for querying
suspect engines following a massive failure. These suspect engines
can be retrieved using the daos/dmg --health-only command.

An example of output of dmg pool query --health-only:

Pool 6f450a68-8c7d-4da9-8900-02691650f6a2, ntarget=8, disabled=2, leader=3, version=4, state=Degraded
Pool health info:
- Disabled ranks: 1
- Suspect ranks: 2
- Rebuild busy, 0 objs, 0 recs

Required-githooks: true

Signed-off-by: Wang Shilong <[email protected]>
Signed-off-by: Phil Henderson <[email protected]>
Co-authored-by: Phil Henderson <[email protected]>
Change the name to more closely reflect the underlying
SWIM status, and reduce user confusion. An engine that
has been marked DEAD by SWIM cannot participate in pool
services, and has most likely already SIGKILL-ed itself.

Allow-unstable-test: true
Features: pool
Required-githooks: true
Signed-off-by: Michael MacDonald <[email protected]>
@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15609/2/execution/node/1211/log

@jolivier23 jolivier23 merged commit 8088155 into google/2.6 Dec 13, 2024
48 of 51 checks passed
@jolivier23 jolivier23 deleted the mjmac/dead_ranks branch December 13, 2024 04:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

6 participants