-
Notifications
You must be signed in to change notification settings - Fork 302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-16702 rebuild: restart rebuild for a massive failure case #15343
Conversation
Ticket title is 'Rebuilding cannot be completed after restarting ranks in cases of massive failures.' |
5e68d53
to
967b84e
Compare
Test stage Functional Hardware Medium Verbs Provider completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15343/2/execution/node/1397/log |
f76cdb6
to
01c8b62
Compare
01c8b62
to
a952a14
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I run test locally, somehow rebuild finished but with 2007 errors, and retry forever.
at my side I tested DER_STALE will retry and finish. I'll check details with you offline |
See logs here: dmg pool query timeout after restarting rank, but rebuild did not finish too. |
I ran several test on wolf wit similar steps and be able to reproduce a rebuild timeout issue. |
In special massive failure case - 1. some engines down and triggered rebuild. 2. one engine participated the rebuild, not finished yet, it down again, the #failures exceeds pool RF and will not change pool map. 3. That engine restarted by administrator. In that case should recover the rebuild task on the engine, to simplify it now just abort and retry the global rebuild task. No such issue by the typical recover approach that restart the whole system including the PS leader. Skip-nlt: true Signed-off-by: Xuezhao Liu <[email protected]>
Skip-nlt: true Signed-off-by: Xuezhao Liu <[email protected]>
refine handle_event. Skip-nlt: true Signed-off-by: Xuezhao Liu <[email protected]>
Skip-nlt: true Signed-off-by: Xuezhao Liu <[email protected]>
0a34490
to
6e5935d
Compare
pushed a new commit to resolve a race condition of rpt not stoped intime when restart new rebuild. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fix.
* DAOS-16702 rebuild: restart rebuild for a massive failure case In special massive failure case - 1. some engines down and triggered rebuild. 2. one engine participated the rebuild, not finished yet, it down again, the #failures exceeds pool RF and will not change pool map. 3. That engine restarted by administrator. In that case should recover the rebuild task on the engine, to simplify it now just abort and retry the global rebuild task. No such issue by the typical recover approach that restart the whole system including the PS leader. Signed-off-by: Xuezhao Liu <[email protected]>
* DAOS-16702 rebuild: restart rebuild for a massive failure case In special massive failure case - 1. some engines down and triggered rebuild. 2. one engine participated the rebuild, not finished yet, it down again, the #failures exceeds pool RF and will not change pool map. 3. That engine restarted by administrator. In that case should recover the rebuild task on the engine, to simplify it now just abort and retry the global rebuild task. No such issue by the typical recover approach that restart the whole system including the PS leader. Signed-off-by: Xuezhao Liu <[email protected]>
In special massive failure case -
In that case should recover the rebuild task on the engine, to simplify it now just abort and retry the global rebuild task.
No such issue by the typical recover approach that restart the whole system including the PS leader.
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: