-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(manager): update expected error message for enospc restore test #9590
fix(manager): update expected error message for enospc restore test #9590
Conversation
@fruch same here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
mgmt_cli_test.py
Outdated
f"but with an ill fitting error message: {full_progress_string}" | ||
full_progress_string = restore_task.progress_string(parse_table_res=False, | ||
is_verify_errorless_result=True).stdout | ||
assert "failed to restore sstables from location" in full_progress_string.lower(), \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure what's the point of asserting the error message at all.
Right now it is really vague and the real problem can be identified by looking at the Scylla logs or metrics.
Also, if we change the error message slightly, it will keep on causing pain in the future test runs.
I'm just pondering on the nature of this test and SM implementation.
What SM is doing is that before downloading each batch it checks if the node that is downloading the batch has at lest 10% free disk space - but it does not check it for the other nodes, because we don't know where the data will end up. That's the reason why we don't have the "not enough disk space" error anymore - the problem is that some node with enough disk space downloaded the batch, but it was unable to stream it to the node without the disk space.
Perhaps that's the real issue with SM and we should not only validate that the node which does the download has enough disk space, but that's it also the case for all other nodes in the cluster. Maybe the 10% is a little bit too strict for such check (with the current effort of reaching 90% disk utilization), but something like 5% might be just fine.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for now I would either drop the error message assertion or the whole test in general.
I created an issue to evaluate SM behavior in such cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, I'd probably drop this assertion.
In case of current, very common error message, it actually doesn't make much sense.
In future when we implement this, let's say, "smart" free space checker you mentioned, we can think about this test rework.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
Error message Manager returns for enospc scenario has been changed to more generic one (#1). So, it doesn't make much sense to verify it. Moreover, there is a plan to fix check free disk space behaviour and the whole test will probably require rework to be done (#2). refs: #1 - scylladb/scylla-manager#4087 #2 - scylladb/scylla-manager#4184
0d3c8f1
to
11a4ab4
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Besides one comment,
LGTM
@@ -830,13 +829,6 @@ def test_enospc_before_restore(self): | |||
assert final_status == TaskStatus.ERROR, \ | |||
f"The restore task is supposed to fail, since node {target_node} lacks the disk space to download" \ | |||
f"the snapshot files" | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could be useful to at least print error mesage with log.info level for clarity in case someone once needed to verify the flow.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What kind of message you'd like to see in log.info for such case?
I supposed the assertion message is quite enough to understand the testing scenario if someone wants to verify it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I meant to log full_progress_string = restore_task.progress_string(parse_table_res=False, is_verify_errorless_result=True).stdout
instead of asserting it (as its prone to change)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I got your point, but it seems to me it makes little sense.. When we analyze Manager failures, we usually look into sct python log. It contains all sctool outputs logged. full_progress_string
can be easily found there.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, looks it's good to go then
It was mistakenly closed without merging, reopening it |
@fruch Could you please merge it? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Error message that Manager returns for enospc scenario has been changed to more generic one. So, this change updates expected error message and removes skip per issue since it was closed without resolution.
refs:
scylladb/scylla-manager#4087
Testing
PR pre-checks (self review)
backport
labels