-
Notifications
You must be signed in to change notification settings - Fork 302
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DAOS-16329 chk: maintenance mode after checking pool with dryrun - b26 #14985
Conversation
Ticket title is 'Maintenance mode after CR checking the pool with dryrun option' |
Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14985/1/testReport/ |
Test stage Unit Test with memcheck on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14985/1/testReport/ |
f7a7ee1
to
792a624
Compare
Test stage Unit Test on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14985/2/testReport/ |
Test stage Unit Test with memcheck on EL 8.8 completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-14985/2/testReport/ |
792a624
to
055d8bd
Compare
@@ -209,7 +209,7 @@ def subtract(self, val): | |||
"DER_NOTLEADER(-2008): 'Not service leader'") | |||
|
|||
# Functions that are never reported as errors. | |||
IGNORED_FUNCTIONS = ('sched_watchdog_post', 'rdb_timerd') | |||
IGNORED_FUNCTIONS = ('sched_watchdog_post', 'rdb_timerd', 'cont_open') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we ignoring cont_open here instead of fixing the errors? And what are the errors we're ignoring?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is some temporary workaround for the following NLT warning:
https://build.hpdd.intel.com/job/daos-stack/job/daos/view/change-requests/job/PR-14912/10/VM_test/folder.-412761578/
cont_open() 07139793/8d9ab078/6908b833: No permission to open the container with flags 2, capas 7/0 warning in strict mode
I do not know how the new warning is valuable for us.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not know how to fix such NLT warning because even if I modified some existing logic line in cont_open
, the CI test will also report similar NLT warning. @daltonbohning , any suggestion for that? Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should not disable the test because something introduced a new warning.
This PR adds the line that it's complaining about:
daos/src/container/srv_container.c
Lines 2292 to 2301 in 055d8bd
if (((flags & DAOS_COO_EX) && !ds_sec_cont_can_open_ex(sec_capas)) || | |
((flags & DAOS_COO_RW) && !ds_sec_cont_can_modify(sec_capas))) { | |
D_ERROR(DF_UUID "/" DF_UUID "/" DF_UUID ": permission denied opening the " | |
"container with flags " DF_X64 ", capas " DF_X64 "/" DF_X64 "\n", | |
DP_UUID(cont->c_svc->cs_pool_uuid), DP_UUID(pool_hdl->sph_uuid), | |
DP_UUID(cont->c_uuid), flags, pool_hdl->sph_sec_capas, sec_capas); | |
daos_prop_free(prop); | |
rc = -DER_NO_PERM; | |
goto out; | |
} |
I think the problem is it doesn't follow the standard format with the rc in the error. The same would apply for other error messages as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, that was the wrong code snippet. But I think the problem is the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
According to the commit edeafc1 that introduced the IGNORED_FUNCTIONS list, I did not find any special reason for adding "rdb_timerd" into such list except reducing unnecessary "fail" report.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In fact, such NLT warning for cont_open
log exists on current master, it is not reported as CI failure if no patch touch related code. But my patch touches it because I need more information in the log, then it failed the CI test.
From the test logic perspective (although I do not agree), it seems that such log line should not be there.
@daltonbohning , do you still think it is not an acceptable solution to "skip" the existing cont_open
NLT warning?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here is an example for the NLT warning existence:
https://build.hpdd.intel.com/job/daos-stack/job/daos/job/PR-15055/1/VM_test/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not very familiar with NLT so I don't know what is best here. It just seems bad to disable a subset of testing because it's failing. @ashleypittman is the best source here but is on sabbatical.
If it's pre-existing, maybe this applies? Copied from our dev channel
Tips from Ashley:
the way Jenkins works is it checks if messages are the same, and obviously it can't just check the line number for this as that changes. It takes a snapshot of the 5 lines before and after the line where the error is reported, and if these are the same then the error is treated as the same. If these lines are modified then the error does not match and therefore you get "one removed error", "one error added"in which case the procedure is the same as for adding a new error, complete your PR as you want it, run with Allow-unstable-test: true commit tags, the PR will need force-landing but once it's landed master will run and create new reference builds so other PRs should not be affected.
I.e. add Allow-unstable-test: true
to your commit pragmas and request force landing, since it's a pre-existing error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let me try.
Sometimes, after system shutdown unexpectedly, the users may expect to check their critical data under some kind of maintenance mode. Under such mode, no user data can be modified or moved or aggregated. That will guarantee no further potential (DAOS logic caused) damage can happen during the check. For such purpose, we will enhance current DAOS CR logic with --dryrun option to allow the pool (after check) to be opened as immutable with disabling some mechanism that may potentially cause data modification or movement (such as rebuild or aggregation). Under such mode, if client wants to connect to the pool, the read-only option must be specified. Similarly for opening container in such pool. Test-tag: pr cat_recov Allow-unstable-test: true Signed-off-by: Fan Yong <[email protected]>
055d8bd
to
d10aad6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will reserve approval for those familiar with core code, I think we will need to force land for the existing NLT failure
Ping reviewers for the backport patch. Thanks! |
FYI - will need to wait for 2.6.1 GA, then merge approval for this ticket, then we can merge. |
Ticket does not have approval to merge to 2.6.2 |
So we will not land this patch to any 2.6 based branch any longer, right? |
It means we need to go through the merge approval process for this ticket/PR |
It is shown on the Jira ticket DAOS-16329 that |
To be clear, I posted this on Oct 8 and |
Sometimes, after system shutdown unexpectedly, the users may expect to check their critical data under some kind of maintenance mode. Under such mode, no user data can be modified or moved or aggregated. That will guarantee no further potential (DAOS logic caused) damage can happen during the check.
For such purpose, we will enhance current DAOS CR logic with --dryrun option to allow the pool (after check) to be opened as immutable with disabling some mechanism that may potentially cause data modification or movement (such as rebuild or aggregation).
Under such mode, if client wants to connect to the pool, the read-only option must be specified. Similarly for opening container in such pool.
Test-tag: pr cat_recov
Allow-unstable-test: true
Before requesting gatekeeper:
Features:
(orTest-tag*
) commit pragma was used or there is a reason documented that there are no appropriate tags for this PR.Gatekeeper: