DAOS-16702 rebuild: restart rebuild for a massive failure case #15343

liuxuezhao · 2024-10-18T10:46:13Z

In special massive failure case -

some engines down and triggered rebuild.
one engine participated the rebuild, not finished yet, it down again, the #failures exceeds pool RF and will not change pool map.
That engine restarted by administrator.

In that case should recover the rebuild task on the engine, to simplify it now just abort and retry the global rebuild task.
No such issue by the typical recover approach that restart the whole system including the PS leader.

Before requesting gatekeeper:

Two review approvals and any prior change requests have been resolved.
Testing is complete and all tests passed or there is a reason documented in the PR why it should be force landed and forced-landing tag is set.
Features: (or Test-tag*) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.
Commit messages follows the guidelines outlined here.
Any tests skipped by the ticket being addressed have been run and passed in the PR.

Gatekeeper:

github-actions · 2024-10-18T10:46:30Z

Ticket title is 'Rebuilding cannot be completed after restarting ranks in cases of massive failures.'
Status is 'In Progress'
https://daosio.atlassian.net/browse/DAOS-16702

daosbuild1 · 2024-10-19T00:19:19Z

Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15343/2/execution/node/1397/log

src/pool/srv_pool.c

wangshilong

I run test locally, somehow rebuild finished but with 2007 errors, and retry forever.

liuxuezhao · 2024-10-21T08:03:14Z

I run test locally, somehow rebuild finished but with 2007 errors, and retry forever.

at my side I tested DER_STALE will retry and finish. I'll check details with you offline

wangshilong · 2024-10-21T13:51:30Z

I run test locally, somehow rebuild finished but with 2007 errors, and retry forever.

at my side I tested DER_STALE will retry and finish. I'll check details with you offline

See logs here:

https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15354/1/artifact/Functional%20on%20EL%208.8/control/dmg_pool_query_ranks.py/job.log/*view*/

dmg pool query timeout after restarting rank, but rebuild did not finish too.

liuxuezhao · 2024-10-21T17:07:57Z

https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15354/1/artifact/Functional%20on%20EL%208.8/control/dmg_pool_query_ranks.py/job.log/*view*/

dmg pool query timeout after restarting rank, but rebuild did not finish too.

I ran several test on wolf wit similar steps and be able to reproduce a rebuild timeout issue.
The problem is the ALIVE event got from CRT_EVS_GRPMOD rather than CRT_EVS_SWIM, so cannot ignore CRT_EVS_GRPMOD event.
I changed handle_event a little bit @liw please check if it is good for you.
@wangshilong please retest with the new version. thx

In special massive failure case - 1. some engines down and triggered rebuild. 2. one engine participated the rebuild, not finished yet, it down again, the #failures exceeds pool RF and will not change pool map. 3. That engine restarted by administrator. In that case should recover the rebuild task on the engine, to simplify it now just abort and retry the global rebuild task. No such issue by the typical recover approach that restart the whole system including the PS leader. Skip-nlt: true Signed-off-by: Xuezhao Liu <[email protected]>

Skip-nlt: true Signed-off-by: Xuezhao Liu <[email protected]>

refine handle_event. Skip-nlt: true Signed-off-by: Xuezhao Liu <[email protected]>

Skip-nlt: true Signed-off-by: Xuezhao Liu <[email protected]>

liuxuezhao · 2024-10-24T09:20:07Z

pushed a new commit to resolve a race condition of rpt not stoped intime when restart new rebuild.

wangshilong

Thanks for fix.

* DAOS-16702 rebuild: restart rebuild for a massive failure case In special massive failure case - 1. some engines down and triggered rebuild. 2. one engine participated the rebuild, not finished yet, it down again, the #failures exceeds pool RF and will not change pool map. 3. That engine restarted by administrator. In that case should recover the rebuild task on the engine, to simplify it now just abort and retry the global rebuild task. No such issue by the typical recover approach that restart the whole system including the PS leader. Signed-off-by: Xuezhao Liu <[email protected]>

liuxuezhao marked this pull request as ready for review October 18, 2024 10:55

liuxuezhao requested review from a team as code owners October 18, 2024 10:55

liuxuezhao removed request for a team October 18, 2024 10:55

liuxuezhao force-pushed the lxz/massive_rb branch from 5e68d53 to 967b84e Compare October 18, 2024 13:30

liuxuezhao requested review from wangshilong and wangdi1 October 18, 2024 13:31

liuxuezhao requested a review from liw October 21, 2024 01:03

liw requested changes Oct 21, 2024

View reviewed changes

src/pool/srv_pool.c Outdated Show resolved Hide resolved

src/pool/srv_pool.c Outdated Show resolved Hide resolved

liuxuezhao requested a review from liw October 21, 2024 01:57

liuxuezhao force-pushed the lxz/massive_rb branch from f76cdb6 to 01c8b62 Compare October 21, 2024 01:58

liw reviewed Oct 21, 2024

View reviewed changes

src/pool/srv_pool.c Outdated Show resolved Hide resolved

src/pool/srv_pool.c Outdated Show resolved Hide resolved

liuxuezhao force-pushed the lxz/massive_rb branch from 01c8b62 to a952a14 Compare October 21, 2024 02:05

liuxuezhao requested a review from liw October 21, 2024 02:07

liw previously approved these changes Oct 21, 2024

View reviewed changes

wangshilong reviewed Oct 21, 2024

View reviewed changes

src/pool/srv_pool.c Show resolved Hide resolved

wangshilong reviewed Oct 21, 2024

View reviewed changes

liuxuezhao dismissed liw’s stale review via 0a34490 October 21, 2024 17:05

liuxuezhao requested review from liw and wangshilong October 23, 2024 01:13

liuxuezhao added 4 commits October 24, 2024 09:18

DAOS-16702 rebuild: address comment

b99edd8

Skip-nlt: true Signed-off-by: Xuezhao Liu <[email protected]>

DAOS-16702 rebuild: CRT_EVT_ALIVE possibly from CRT_EVS_GRPMOD

1d2de7d

refine handle_event. Skip-nlt: true Signed-off-by: Xuezhao Liu <[email protected]>

DAOS-16702 rebuild: start new rpt for new rebuild_gen

6e5935d

Skip-nlt: true Signed-off-by: Xuezhao Liu <[email protected]>

liuxuezhao force-pushed the lxz/massive_rb branch from 0a34490 to 6e5935d Compare October 24, 2024 09:18

wangshilong approved these changes Oct 25, 2024

View reviewed changes

liw approved these changes Oct 25, 2024

View reviewed changes

liuxuezhao requested review from a team and gnailzenh October 25, 2024 08:11

gnailzenh merged commit 50128bd into master Oct 29, 2024
53 of 54 checks passed

gnailzenh deleted the lxz/massive_rb branch October 29, 2024 15:30

mjmac mentioned this pull request Nov 13, 2024

mjmac/DAOS 16787 google 2.6 #15498

Closed

mjmac mentioned this pull request Dec 12, 2024

mjmac/dead ranks #15609

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DAOS-16702 rebuild: restart rebuild for a massive failure case #15343

DAOS-16702 rebuild: restart rebuild for a massive failure case #15343

liuxuezhao commented Oct 18, 2024

github-actions bot commented Oct 18, 2024 •

edited

Loading

daosbuild1 commented Oct 19, 2024

wangshilong left a comment

liuxuezhao commented Oct 21, 2024

wangshilong commented Oct 21, 2024

liuxuezhao commented Oct 21, 2024

liuxuezhao commented Oct 24, 2024

wangshilong left a comment

DAOS-16702 rebuild: restart rebuild for a massive failure case #15343

DAOS-16702 rebuild: restart rebuild for a massive failure case #15343

Conversation

liuxuezhao commented Oct 18, 2024

Before requesting gatekeeper:

Gatekeeper:

github-actions bot commented Oct 18, 2024 • edited Loading

daosbuild1 commented Oct 19, 2024

wangshilong left a comment

Choose a reason for hiding this comment

liuxuezhao commented Oct 21, 2024

wangshilong commented Oct 21, 2024

liuxuezhao commented Oct 21, 2024

liuxuezhao commented Oct 24, 2024

wangshilong left a comment

Choose a reason for hiding this comment

github-actions bot commented Oct 18, 2024 •

edited

Loading