Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DAOS-16477 pool: Rename Suspect state to Dead #15584

Merged
merged 2 commits into from
Dec 13, 2024
Merged

Conversation

mjmac
Copy link
Contributor

@mjmac mjmac commented Dec 11, 2024

Change the name to more closely reflect the underlying
SWIM status, and reduce user confusion. An engine that
has been marked DEAD by SWIM cannot participate in pool
services, and has most likely already SIGKILL-ed itself.

Features: pool
Required-githooks: true

Signed-off-by: Michael MacDonald [email protected]

Change the name to more closely reflect the underlying
SWIM status, and reduce user confusion. An engine that
has been marked DEAD by SWIM cannot participate in pool
services, and has most likely already SIGKILL-ed itself.

Features: pool
Required-githooks: true

Signed-off-by: Michael MacDonald <[email protected]>
@mjmac mjmac requested review from a team as code owners December 11, 2024 01:57
Copy link

Ticket title is 'Provide admin interface to query hanging engines after massive failure'
Status is 'Awaiting backport'
Job should run at elevated priority (1)
https://daosio.atlassian.net/browse/DAOS-16477

@github-actions github-actions bot added the priority Ticket has high priority (automatically managed) label Dec 11, 2024
@@ -162,8 +162,8 @@ enum daos_pool_info_bit {
DPI_ENGINES_ENABLED = 1ULL << 2,
/** true to include (in \a ranks) engines with some or all targets disabled (down). */
DPI_ENGINES_DISABLED = 1ULL << 3,
/** true to include (in \a ranks) suspect engines. */
DPI_ENGINES_SUSPECT = 1ULL << 4,
/** true to include (in \a ranks) engines marked DEAD by SWIM. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[Nit; not requesting changes] Maybe this degree of inaccuracy is acceptable, but I just want to clarify that these "dead" engines are actually the set of pool engines who are not "alive". In other words, they may also include engines absent from the SWIM membership.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if that is the case then maybe suspect is a more accurate term?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liw: When would we have engines in the pool map that are absent from the SWIM membership, but aren't dead?

Copy link
Contributor

@liw liw Dec 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the MS has excluded the engines from the system (and the new system map has been distributed out), but the PS hasn't excluded the engines from the pool. In this state, from the pool's point of view, these engines are not strictly SWIM-DEAD, just unavailable---loosely speaking, "dead". (I think we should generally not differentiate between these two kinds of unavailability, SWIM-DEAD versus absent.)

Hmm, personally I don't think "suspect" is more accurate. :)

@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15584/1/execution/node/1210/log

Copy link
Contributor

@daltonbohning daltonbohning left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ftest LGTM

@@ -162,8 +162,8 @@ enum daos_pool_info_bit {
DPI_ENGINES_ENABLED = 1ULL << 2,
/** true to include (in \a ranks) engines with some or all targets disabled (down). */
DPI_ENGINES_DISABLED = 1ULL << 3,
/** true to include (in \a ranks) suspect engines. */
DPI_ENGINES_SUSPECT = 1ULL << 4,
/** true to include (in \a ranks) engines marked DEAD by SWIM. */
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if that is the case then maybe suspect is a more accurate term?

@mjmac
Copy link
Contributor Author

mjmac commented Dec 11, 2024

Test failure appears to be https://daosio.atlassian.net/issues/DAOS-16829, which was introduced with the PR that added the suspect state. I could re-run without Features: pool, but probably best to wait until the fix is merged and then re-run after merging.

Features: pool
Allow-unstable-test: true
Required-githooks: true
@mjmac
Copy link
Contributor Author

mjmac commented Dec 12, 2024

I've merged master and re-pushed with the Allow-unstable-test: true pragma set.

@daosbuild1
Copy link
Collaborator

Test stage Functional on EL 8.8 completed with status FAILURE. https://build.hpdd.intel.com//job/daos-stack/job/daos/view/change-requests/job/PR-15584/2/execution/node/1210/log

@daosbuild1
Copy link
Collaborator

Test stage Functional Hardware Medium completed with status UNSTABLE. https://build.hpdd.intel.com/job/daos-stack/job/daos//view/change-requests/job/PR-15584/2/testReport/

@mjmac
Copy link
Contributor Author

mjmac commented Dec 13, 2024

Two known failures:

@jolivier23 jolivier23 merged commit 29ad64d into master Dec 13, 2024
50 of 53 checks passed
@jolivier23 jolivier23 deleted the mjmac/DAOS-16477-dead branch December 13, 2024 05:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
priority Ticket has high priority (automatically managed)
Development

Successfully merging this pull request may close these issues.

6 participants