-
Notifications
You must be signed in to change notification settings - Fork 302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
mjmac/dead ranks #15609
mjmac/dead ranks #15609
Conversation
mjmac
commented
Dec 12, 2024
•
edited
Loading
edited
- Partial backport of debug macro patch
- DAOS-16702 rebuild: restart rebuild for a massive failure case (DAOS-16702 rebuild: restart rebuild for a massive failure case #15343)
- DAOS-10250 control: Get enabled and disabled ranks with dmg pool query (DAOS-10250 control: Get enabled and disabled ranks with dmg pool query #14436)
- DAOS-14419 control: Display disabled ranks by default (DAOS-14419 control: Display disabled ranks by default #15112)
- DAOS-16669 test: fix pool list ftest (DAOS-16669 test: fix pool list ftest #15373)
- DAOS-16477 mgmt: return suspect engines for pool healthy query (DAOS-16477 mgmt: return suspect engines for pool healthy query #15458)
- DAOS-16477 pool: Rename Suspect state to Dead
Required-githooks: true Change-Id: Ifd3f793661ea9f64aa47162a791b17b4987164ba Signed-off-by: Jeff Olivier <[email protected]>
* DAOS-16702 rebuild: restart rebuild for a massive failure case In special massive failure case - 1. some engines down and triggered rebuild. 2. one engine participated the rebuild, not finished yet, it down again, the #failures exceeds pool RF and will not change pool map. 3. That engine restarted by administrator. In that case should recover the rebuild task on the engine, to simplify it now just abort and retry the global rebuild task. No such issue by the typical recover approach that restart the whole system including the PS leader. Signed-off-by: Xuezhao Liu <[email protected]>
#14436) Allow enabled and disabled ranks option to be used simultaneously (DAOS-10250). Update and add cmocka unit tests of engine management related functions (DAOS-10253). Fix memory leaks of ranks string in function ds_mgmt_drpc_pool_query(). Required-githooks: true Signed-off-by: Cedric Koch-Hofer <[email protected]>
Errors are component not formatted correctly,Ticket number prefix incorrect,PR title is malformatted. See https://daosio.atlassian.net/wiki/spaces/DC/pages/11133911069/Commit+Comments,Unable to load ticket data |
Always display the disabled targets and remove the old associated options. Required-githooks: true Signed-off-by: Cedric Koch-Hofer <[email protected]>
Fix regression on pool/list_verbose.py functional test introduced with DAOS-14419. Signed-off-by: Cedric Koch-Hofer <[email protected]>
After significant failures, the system may leave behind some suspect engines that were marked as DEAD by the SWIM protocol, but were not excluded from the system to prevent data loss. An administrator can bring these ranks back online by restarting them. This PR aims to provide an administrative interface for querying suspect engines following a massive failure. These suspect engines can be retrieved using the daos/dmg --health-only command. An example of output of dmg pool query --health-only: Pool 6f450a68-8c7d-4da9-8900-02691650f6a2, ntarget=8, disabled=2, leader=3, version=4, state=Degraded Pool health info: - Disabled ranks: 1 - Suspect ranks: 2 - Rebuild busy, 0 objs, 0 recs Required-githooks: true Signed-off-by: Wang Shilong <[email protected]> Signed-off-by: Phil Henderson <[email protected]> Co-authored-by: Phil Henderson <[email protected]>
Change the name to more closely reflect the underlying SWIM status, and reduce user confusion. An engine that has been marked DEAD by SWIM cannot participate in pool services, and has most likely already SIGKILL-ed itself. Allow-unstable-test: true Features: pool Required-githooks: true Signed-off-by: Michael MacDonald <[email protected]>
2f560a5
to
c57c674
Compare
Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15609/2/execution/node/1211/log |