Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Recovery status reported for any status other than COMPLETE #41

Closed

Conversation

ofaaland
Copy link
Contributor

In a multi-MDT system, under Lustre 2.10, all MDTs connect to each other via exports/imports. If one MDT cannot connect to one or more other MDTs, it will not service requests and will refuse connections from clients. There are other target states which may indicate action is required by an admin. These states are reflected in the "recovery_status" procfile exported by Lustre targets.

However, for LMT 3.2.7 and some releases before that, MDTs in such a state were not reported as such in ltop, because ltop checked for "RECOV" in the status field, indicting recovery, but did not check for the strings corresponding to any other states.

According lprocfs_recovery_status_seq_show() in Lustre 2.13,
valid "status" values in recovery_status are (roughly):

COMPLETE             The target is active and handling requests
WAITING              The target is active but waiting for another MDT
WAITING_FOR_CLIENTS  The target is active but no clients have connected
RECOVERY             The target is active and recovering after failover
INACTIVE             The target is inactive

For individual targets, for all states other than COMPLETE, display the
recov_status field instead of metric values. This makes it easier for
the admin to see unhealthy targets.

At the top of the window, report the lowest-numbered MDT which is not
COMPLETE or INACTIVE. If an MDT is INACTIVE, it was set that way by
an admin and she likely already knows - but other states may not be
expected and should be brought to her attention.

According lprocfs_recovery_status_seq_show() in Lustre 2.13,
valid "status" values in recovery_status are:
COMPLETE             The target is active and handling requests
WAITING              The target is active but waiting for another MDT
WAITING_FOR_CLIENTS  The target is active but no clients have connected
RECOVERY             The target is active and recovering after failover
INACTIVE             The target is inactive

For individual targets, for all states other than COMPLETE, display the
recov_status field instead of metric values.   This makes it easier for
the admin to see unhealthy targets.

At the top of the window, report the lowest-numbered MDT which is not
COMPLETE or INACTIVE.  If an MDT is INACTIVE, it was set that way by
an admin and she likely already knows - but other states may not be
expected and should be brought to her attention.

Signed-off-by: Olaf Faaland <[email protected]>
@ofaaland
Copy link
Contributor Author

ofaaland commented Nov 8, 2019

@morrone or @tonyhutter , can you take a look? Thanks!

Copy link
Member

@morrone morrone left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks reasonable to me.

@ofaaland
Copy link
Contributor Author

Cherry-picked to master at 7c7266e

@ofaaland ofaaland closed this Nov 11, 2019
@ofaaland ofaaland deleted the b-recovery-status-not-complete branch November 11, 2019 19:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants