-
Notifications
You must be signed in to change notification settings - Fork 155
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
"whereis" output can be wrong when the storage node is slow #1090
Comments
IMHO, this should be fixed. how do
Since the additional error column is so lengthy and outputting the error detail in an existing column looks weird, How about showing the error detail at the bottom apart from the table like below?
|
@mocchira I agree to clarify the result of
|
I have a suggestion, I think that output
is confusing. Because it's in the list of failures and looks like some kind of internal error ("File not found" just sounds like a very typical error where some actual file is missing on the filesystem), so the real meaning - that the object is just missing on that node - is a bit hard to grasp. I suggest either not to list "File not found" in "Failure Nodes" at all - just keep the old output, all empty fields mean that the object is not present. Or at least keep it separate from "Failure Nodes" and use some other wording, for example
|
@vstax Thanks for your suggestion.
Make sense to me.
The former sounds good to me.
@yosukehara Thoughts? |
@mocchira Sorry for replying so late. LGTM. |
Situation: in cluster of 6 nodes running 1.4.2 a new node (node02) is getting recovered in takeover operation (node09 -> node02). This appears in log of the node02:
To check what's going on I execute whereis and it is very slow to respond, after some time it produces the output:
I execute whereis again and get this output:
There are two problems. First (minor) is that object didn't get recovered on node02 yet - however I suspect that it will be recovered at some point in the future (maybe it's in some queue on node02 right now which will be processed later, or maybe it will be pushed to this node by node05 or node08 in the future, I have no idea). The queues on all nodes (except for node02) are filled with millions of messages as part of recovery process at the moment so maybe read-repair just didn't have time to work yet.
Second, main problem is that first output of whereis was lying. I'm pretty sure that the object was present on node05 at that moment, it's just that request from manager to check its state failed. The output of "whereis" should clearly differentiate between missing object and inability to contact node (like if it's not running at all), or timed out request and produce different output for these two cases.
Just in case, logs related to this object (the timeout I had during first "whereis" operation is probably there somewhere as well). Error logs from node02 (nothing in info logs):
Error logs from node05:
Info logs from node05:
Info logs from node08 (nothing in error logs):
The text was updated successfully, but these errors were encountered: