-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Block extensions disallowed by policy #3259
Conversation
b440696
to
a37508f
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## develop #3259 +/- ##
===========================================
+ Coverage 71.97% 72.77% +0.79%
===========================================
Files 103 114 +11
Lines 15692 17081 +1389
Branches 2486 2277 -209
===========================================
+ Hits 11295 12431 +1136
- Misses 3881 4107 +226
- Partials 516 543 +27 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did an initial review not including tests
""" | ||
# TODO: when CRP adds terminal error code for policy-related extension failures, set that as the default code. | ||
def __init__(self, msg, inner=None, code=-1): | ||
msg = "Extension is disallowed by agent policy and will not be processed: {0}".format(msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case where agent failed to parse policy, I'm not sure we should say 'Extension is disallowed by policy'. In this case, extension is disallowed because there's some issue reading or parsing the policy.
I also am hesitant about 'agent policy' since policy is provided by customer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could change this to "Extension will not be processed: "
Parsing errors (InvalidPolicyError) would look like "Extension will not be processed: customer-provided policy file (path) is invalid, please correct the following error..."
Extension disallowed errors (ExtensionPolicyError) would look like "Extension will not be processed: failed to enable extension CustomScript because extension is not specified in policy allowlist. To enable, add extension to the allowed list in policy file (path)."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the error message as discussed above.
azurelinuxagent/ga/exthandlers.py
Outdated
policy_op, policy_err_code = policy_err_map.get(ext_handler.state) | ||
if policy_error is not None: | ||
err = ExtensionPolicyError(msg="", inner=policy_error, code=policy_err_code) | ||
self.__handle_and_report_ext_handler_errors(handler_i, err, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this create .status files for single config extensions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a new function __handle_and_report_policy_error() - this should create a status file for any extension with settings.
azurelinuxagent/ga/exthandlers.py
Outdated
ext_handler.name, | ||
conf.get_policy_file_path()) | ||
err = ExtensionPolicyError(msg, code=policy_err_code) | ||
self.__handle_and_report_ext_handler_errors(handler_i, err, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same question here about .status file for single config extensions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a new function __handle_and_report_policy_error() - this should create a status file for any extension with settings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changes look good. I'm going to spend tomorrow going through each e2e scenario and unit test, sorry for the slow review :/
azurelinuxagent/ga/exthandlers.py
Outdated
ExtensionRequestedState.Enabled: ('enable', ExtensionErrorCodes.PluginEnableProcessingFailed), | ||
# TODO: CRP does not currently have a terminal error code for uninstall. Once CRP adds | ||
# an error code for uninstall or for policy, use this code instead of PluginDisableProcessingFailed | ||
# Note that currently, CRP waits for 90 minutes to time out for a failed uninstall operation, instead of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we add some more detail to this comment?
Something like:
Note that currently, CRP will poll until the agent does not report a status for an extension that should be uninstalled. In the case of a policy error, the agent will report a failed status on behalf of the extension, which will cause CRP to poll for the full timeout period, instead of failing fast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
azurelinuxagent/ga/exthandlers.py
Outdated
@@ -692,6 +734,26 @@ def __handle_and_report_ext_handler_errors(ext_handler_i, error, report_op, mess | |||
add_event(name=name, version=handler_version, op=report_op, is_success=False, log_event=True, | |||
message=message) | |||
|
|||
@staticmethod | |||
def __handle_and_report_policy_error(ext_handler_i, error, report_op, message, report=True, extension=None): | |||
# TODO: Consider merging this function with __handle_and_report_ext_handler_errors() above. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please leave some comment explaining why we broke this into a separate function? For policy related failures, we want to fail extensions fast. CRP will continue to poll for single-config ext status until timeout, so agent should write a status for single-config extensions. The other function does not create that status and we didn't want to touch the other function without investigating the impact of that change further
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
azurelinuxagent/ga/exthandlers.py
Outdated
|
||
# Create status file for extensions with settings (single and multi config). | ||
if extension is not None: | ||
ext_handler_i.create_status_file_if_not_exist(extension, status=ExtensionStatusValue.error, code=error.code, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
create_status_file_if_not_exist() will not overwrite existing status file (for the current sequence number). Is this behavior acceptable?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should overwrite the existing file with the policy error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We now overwrite the existing file with policy error. I've added an "overwrite" parameter and changed the function name to create_status_file( ).
azurelinuxagent/ga/exthandlers.py
Outdated
ExtensionRequestedState.Enabled: ('enable', ExtensionErrorCodes.PluginEnableProcessingFailed), | ||
# Note: currently, when uninstall is requested for an extension, CRP polls until the agent does not | ||
# report status for that extension, or until timeout is reached. In the case of a policy error, the | ||
# agent reports failed status on behalf of the extension, which will cause CRP to for the full timeout, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# agent reports failed status on behalf of the extension, which will cause CRP to for the full timeout, | |
# agent reports failed status on behalf of the extension, which will cause CRP to poll for the full timeout, |
nit
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed, thanks!
tests_e2e/test_suites/ext_policy.yml
Outdated
name: "ExtensionPolicy" | ||
tests: | ||
- "ext_policy/ext_policy.py" | ||
images: "random(endorsed)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should run this on more distros so we can get better coverage before releasing the changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will running on all endorsed distros add too much overhead to the daily runs?
fail(f"The agent should have reported an error trying to {operation} {extension_case.extension.__str__()} " | ||
f"because the extension is disallowed by policy.") | ||
except Exception as error: | ||
assert_that("Extension will not be processed" in str(error)) \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could we also check for [ExtensionPolicyError] in the message to confirm the failure was due to policy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
@@ -630,6 +630,70 @@ def test_it_should_handle_and_report_enable_errors_properly(self): | |||
} | |||
self._assert_extension_status(sc_handler, expected_extensions) | |||
|
|||
def test_it_should_handle_and_report_disallowed_extensions_properly(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also please add a case for multi config ext allowed by policy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added
name: "ExtPolicyWithDependencies" | ||
tests: | ||
- "ext_policy/ext_policy_with_dependencies.py" | ||
images: "random(endorsed)" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same comment here, we should get more coverage than 1 run per day, maybe consider running on all endorsed, or 5-10 endorsed images per day
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated to all endorsed, but I can change to 5-10 if this adds too much overhead.
e909568
to
86de0c5
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Posting comments for Agent code.
Will post comments for test code separately.
""" | ||
Error raised during agent extension policy enforcement. | ||
""" | ||
def __init__(self, msg, code, inner=None): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 'code' and 'inner' parameters are not in the same order as in the base class, which can lead to subtle coding errors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote it this way because I wanted "code" to be a required parameter in ExtensionPolicyEngine, but not "inner". But I can set a default value for "code", to keep them in the same order as in the base class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up removing this class, based on the other comments
azurelinuxagent/ga/exthandlers.py
Outdated
# Invoke policy engine to determine if extension is allowed. If not, block extension and report error on | ||
# behalf of the extension. | ||
policy_err_map = { | ||
ExtensionRequestedState.Enabled: ('enable', ExtensionErrorCodes.PluginEnableProcessingFailed), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a comment describing the elements in the tuple?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added and moved this to the class level.
azurelinuxagent/ga/exthandlers.py
Outdated
for extension, ext_handler in all_extensions: | ||
|
||
handler_i = ExtHandlerInstance(ext_handler, self.protocol, extension=extension) | ||
|
||
# Invoke policy engine to determine if extension is allowed. If not, block extension and report error on | ||
# behalf of the extension. | ||
policy_err_map = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like this is a constant... define it at the class level?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
azurelinuxagent/ga/exthandlers.py
Outdated
# Invoke policy engine to determine if extension is allowed. If not, block extension and report error on | ||
# behalf of the extension. | ||
policy_err_map = { | ||
ExtensionRequestedState.Enabled: ('enable', ExtensionErrorCodes.PluginEnableProcessingFailed), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
'enable' and 'disable' are internal CRP/Agent operations; users are not aware of them. They should not be propagated to error messages displayed to the user
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated this to "run" and "uninstall"
azurelinuxagent/ga/exthandlers.py
Outdated
} | ||
policy_op, policy_err_code = policy_err_map.get(ext_handler.state) | ||
if policy_error is not None: | ||
err = ExtensionPolicyError(msg="", inner=policy_error, code=policy_err_code) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is the intention of creating an exception object here? seems like it is only used to pass the error code, but it is never raised
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially implemented the ExtensionPolicyError class to have a centralized error message for extensions blocked by policy, and also to pass the code. But you make a good point - since we never actually raise the exception, I've removed the ExtensionPolicyError class and now pass the code/message directly into the reporting function.
azurelinuxagent/ga/exthandlers.py
Outdated
|
||
# Create status file for extensions with settings (single and multi config). | ||
if extension is not None: | ||
ext_handler_i.create_status_file_if_not_exist(extension, status=ExtensionStatusValue.error, code=error.code, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should overwrite the existing file with the policy error
azurelinuxagent/ga/exthandlers.py
Outdated
ext_handler_i.create_status_file_if_not_exist(extension, status=ExtensionStatusValue.error, code=error.code, | ||
operation=report_op, message=message) | ||
|
||
if report: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when would report be False?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently it isn't ever false, I initially wrote it this way because I was copying the exact structure of __handle_and_report_ext_handler_errors(). But I've removed it since that parameter isn't being used for now.
azurelinuxagent/ga/exthandlers.py
Outdated
@@ -990,7 +1061,10 @@ def report_ext_handler_status(self, vm_status, ext_handler, goal_state_changed): | |||
# extension even if HandlerState == NotInstalled (Sample scenario: ExtensionsGoalStateError, DecideVersionError, etc) | |||
# We also need to report extension status for an uninstalled handler if extensions are disabled because CRP | |||
# waits for extension runtime status before failing the extension operation. | |||
if handler_state != ExtHandlerState.NotInstalled or ext_handler.supports_multi_config or not conf.get_extensions_enabled(): | |||
# In the case of policy failures, we want to report extension status with a terminal code so CRP fails fast. If |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's change this
# We also need to report extension status for an uninstalled handler if extensions are disabled because CRP
# waits for extension runtime status before failing the extension operation.
# In the case of policy failures, we want to report extension status with a terminal code so CRP fails fast. If
# extension status is not present, collect_ext_status() will set a default transitioning status, and CRP will
# wait for timeout.
to
# We also need to report extension status for an uninstalled handler if extensions are disabled, or if the extension
# failed due to policy, because CRP waits for extension runtime status before failing the extension operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The intention of the change is to enter this condition when the extension fails due to policy, but this change means that we enter the condition whenever policy is enabled.
Is there any negative effect to calling ext_handler_i.get_extension_handler_statuses...
whenever policy is enabled? Why is this behind the if condition in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Discussed offline - removed this condition, because it would cause us to enter the if condition even for non-policy-related uninstall failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left comments mainly for e2e tests. I'll review unit tests once the comments in exthandlers.py are resolved
|
||
# Only allowlisted extensions should be processed. | ||
# We only allowlist CustomScript: CustomScript should be enabled, RunCommand and AzureMonitor should fail. | ||
# (Note that CustomScript blocked by policy is tested in a later test case.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding comments to the review so I can follow the scenarios easier. Consider adding these as comments in the code, but ultimately up to you:
This policy tests the following scenarios:
- single config ext (CSE) enable operation succeeds when allowed by policy
- no-config ext (AzureMonitor) enable operation fails fast when disallowed by policy
- single multi-config instance (RunCommandHandler) enable operation fails fast when disallowed by policy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added, thanks!
self._operation_should_succeed("delete", azure_monitor) | ||
|
||
# Should not uninstall disallowed extensions. | ||
# CustomScript is removed from the allowlist: delete operation should fail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This policy tests the following scenarios:
- a delete operation on a previously enabled single-config ext (CSE) which is now disallowed by policy fails fast
- multiple multi-config instances (RunCommandHandler and RunCommandHandler2) enable operations fail fast when disallowed by policy
- single-config ext (CSE) enable operation fails fast when disallowed by policy
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added, thanks!
log.info("CRP returned an error for deletion operation, may be a false error. Checking agent log to determine if operation succeeded. Exception: {0}".format(crp_err)) | ||
try: | ||
for ssh_client in ssh_clients.values(): | ||
msg = ("Remove the extension slice: {0}".format(str(ext_to_delete))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This message is related to cgroup. Right now it's logged even when cgroup isn't enabled (which might and probably should change in the future).
Instead, we should check that the handler was successfully uninstalled. i.e. the last ext status reported by the agent shouldn't include the handler:
2024-11-26T23:54:04.306568Z INFO ExtHandler ExtHandler Extension status: [('Microsoft.Azure.Monitor.AzureMonitorLinuxAgent', 'Ready')]
You might also consider doing this by checking the instance view
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated this - we now check for the last status reported by the agent and confirm that there is no handler status reported. (Instance view doesn't work because CRP reports a stale status)
_test_cases = [ | ||
_should_fail_single_config_depends_on_disallowed_no_config, | ||
_should_fail_single_config_depends_on_disallowed_single_config, | ||
# TODO: RunCommand is unable to be installed properly, so these tests are currently disabled. Investigate the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's specify 'RunCommandHandler' since RunCommand is a different extension (confusing, I know :)
Also is it that RunCommandHandler is unable to be "installed properly" or uninstalled?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch - it's that RunCommandHandler is unable to be uninstalled. I've updated this.
return policy, template, expected_errors, deletion_order | ||
|
||
|
||
def _should_no_dependencies(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this may be leftover code, I don't see it referenced
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed, thanks!
from pathlib import Path | ||
from tests_e2e.tests.lib.agent_log import AgentLog | ||
|
||
|
||
def main(): | ||
parser = argparse.ArgumentParser() | ||
parser.add_argument("--data", dest='data', required=True) | ||
parser.add_argument("--after-timestamp", dest='after_timestamp', required=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice :) thanks for adding this!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comments for test code
@@ -630,6 +630,98 @@ def test_it_should_handle_and_report_enable_errors_properly(self): | |||
} | |||
self._assert_extension_status(sc_handler, expected_extensions) | |||
|
|||
def test_it_should_handle_and_report_extensions_disallowed_by_policy_properly(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does "properly" mean? (what is the expected behavior?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
changed this to "test_it_should_report_successful_status_for_extensions_allowed_by_policy" and "test_it_should_report_failed_status_for_extensions_disallowed_by_policy"
def test_it_should_handle_and_report_extensions_disallowed_by_policy_properly(self): | ||
"""If multiconfig extension is disallowed by policy, all instances should be blocked.""" | ||
policy_path = os.path.join(self.tmp_dir, "waagent_policy.json") | ||
patch('azurelinuxagent.common.conf.get_policy_file_path', return_value=str(policy_path)).start() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should be done using the 'with' statement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, thanks
- "ext_policy/ext_policy_with_dependencies.py" | ||
images: "endorsed" | ||
executes_on_scale_set: true | ||
# This test should run on its own VMSS, because other tests may leave behind extensions |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should handle this in the test and allow it to share the vm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added code at the start of each test to remove leftover extensions
|
||
# RunCommandHandler is a multi-config extension, so we set up two instances (configurations) here and test both. | ||
run_command = ExtPolicy.TestCase( | ||
VirtualMachineExtensionClient(self._context.vm, VmExtensionIds.RunCommandHandler, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should use VirtualMachineRunCommand instead of VirtualMachineExtensionClient
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
self._create_policy_file(policy) | ||
self._operation_should_succeed("enable", custom_script) | ||
self._operation_should_fail("enable", run_command) | ||
if VmExtensionIds.AzureMonitorLinuxAgent.supports_distro((self._ssh_client.run_command("get_distro.py").rstrip())): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how much coverage are we getting for this case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've changed ext_policy.yml to run on all endorsed distros, so this case will be run on all distros that support AzureMonitorLinuxAgent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need the check on distro if it's going to fail due to policy anyways?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I figured I would avoid testing this extension at all on an unsupported distro, but I can remove the distro check specifically for this case, if you think it would be useful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, no need for that check
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I realized this is causing an error - when the policy is changed to allow all, the agent tries to re-enable AMA with the next goal state, and the install fails on unsupported distros.
I've added the supported distro check again. Another option is to uninstall AMA before sending any other goal state and then check agent log to validate the uninstall (race condition will result in an error). If you think it's necessary, I could add that change.
if VmExtensionIds.AzureMonitorLinuxAgent.supports_distro((self._ssh_client.run_command("get_distro.py").rstrip())): | ||
self._operation_should_fail("enable", azure_monitor) | ||
|
||
# When allowlist is turned off, all extensions should be processed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about CustomScript?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The final test case tests this - line 256
self._operation_should_succeed("enable", azure_monitor) | ||
self._operation_should_succeed("delete", azure_monitor) | ||
|
||
# Should not uninstall disallowed extensions. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems like
# Only allowlisted extensions should be processed.
and
# Should not uninstall disallowed extensions.
or are we trying to test something different?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the comments to make it more clear what is being tested with each policy
} | ||
} | ||
self._create_policy_file(policy) | ||
# # Known CRP issue - delete/uninstall operation times out instead of reporting an error. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not a CRP issue, but rather a design issue. Uninstall is best effort and never fails. You should consider checking the agent log for the error and then the instance view to confirm the extension was not uninstalled
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've enabled the failed deletion case and updated the test to validate the agent log and instance view.
azurelinuxagent/ga/exthandlers.py
Outdated
# - For extensions with settings (install/enable errors): report at both handler and extension levels. | ||
|
||
# Keep a list of disallowed extensions so that report_ext_handler_status() can report status for them. | ||
self.disallowed_ext_handlers.append(ext_handler_i.ext_handler) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When does an extension get removed from disallowed_ext_handlers? The ExtHandlersHandler is instantiated only once on agent init
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
appending to the list in a method named "report_policy_error" does not feel right (reporting should not change the state of the object). rename to handle_policy_error?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can change this to "handle_ext_disallowed_error"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
azurelinuxagent/ga/exthandlers.py
Outdated
@@ -498,11 +530,22 @@ def handle_ext_handlers(self, goal_state_id): | |||
logger.info("{0}: {1}".format(ext_full_name, msg)) | |||
add_event(op=WALAEventOperation.ExtensionProcessing, message="{0}: {1}".format(ext_full_name, msg)) | |||
handler_i.set_handler_status(status=ExtHandlerStatusValue.not_ready, message=msg, code=-1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we want to handle extensions_disabled scenario in this PR, I think we should remove the logic to create handler status and extension status here and call __report_policy_error instead.
Although, if we do that, I think we should rename __report_policy_error to something like '__report_ext_disallowed_error' and update the comments in that function to indicate it can be called for either extension disallowed by policy OR extension processing disabled via config.
We should also rename the policy_error_map to be generic to extensions disabled too
@@ -0,0 +1,11 @@ | |||
# | |||
# The test suite verifies that disallowed extensions are not processed, but the agent should still report status. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
# The test suite verifies that disallowed extensions are not processed, but the agent should still report status. | |
# The test suite verifies that disallowed extensions and any extensions dependent on disallowed extensions are not processed, but the agent should still report status. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed
|
||
# The full CRP timeout period for extension operation failure is 90 minutes. For efficiency, we reduce the | ||
# timeout limit to 15 minutes here. We expect "delete" operations on disallowed VMs to reach timeout instead of | ||
# failing fast, because delete is a best effort operation by-design and should not fail. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because delete is a best effort operation by-design and should not fail.
I think we should these details to this comment for future readers:
- Delete operations on disallowed VMs reach timeout because the agent reports status for the failure, but CRP is waiting for no status to be reported for the extension.
- CRP continues to poll for no status to be reported because delete is treated as a best effort operation and should not fail.
- Essentially, we're forcing a delete failure, which is unexpected by CRP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test no longer updates the crp timeout, but I've added these comments in _operation_should_fail()
self._operation_should_fail("enable", run_command) | ||
self._operation_should_fail("enable", run_command_2) | ||
# Only call enable on AMA if supported. The agent will try to re-enable AMA as a part of the next goal state, when | ||
# policy is changed to allow all. This will cause errors on an unsupported distro. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment is a little confusing.
I think you added this part as an explanation for why you only call enable on supported distros: The agent will try to re-enable AMA as a part of the next goal state, when policy is changed to allow all. This will cause errors on an unsupported distro.
But it sounds like this is an explanation of what is going to happen even if you only call enable on supported distros
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated the comment
assert_that(expected_msg in str(error)) \ | ||
.described_as( | ||
f"Error message is expected to contain '{expected_msg}', but actual error message was '{error}'").is_true() | ||
log.info(f"{extension_case.extension} {operation} failed as expected") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
log.info(f"{extension_case.extension} {operation} failed as expected") | |
log.info(f"{extension_case.extension} {operation} failed as expected due to policy") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
f"The agent should have reported an error trying to {operation} {extension_case.extension} " | ||
f"because the extension is disallowed by policy.") | ||
except Exception as error: | ||
expected_msg = "Extension will not be processed: failed to run extension" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the expected_msg should include that the ext failed to run due to being disallowed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
azurelinuxagent/ga/exthandlers.py
Outdated
@@ -276,6 +297,8 @@ class ExtHandlersHandler(object): | |||
def __init__(self, protocol): | |||
self.protocol = protocol | |||
self.ext_handlers = None | |||
# Maintain a list of extensions that are disallowed, and always report extension status for disallowed extensions. | |||
self.disallowed_ext_handlers = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be private
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
@@ -19,7 +19,8 @@ tar --exclude='journal/*' --exclude='omsbundle' --exclude='omsagent' --exclude=' | |||
-czf "$logs_file_name" \ | |||
/var/log \ | |||
/var/lib/waagent/ \ | |||
$waagent_conf | |||
$waagent_conf \ | |||
/etc/waagent_policy.json |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
no issue with collecting this, just curious: why do we need it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Earlier, I found a bug where the policy file was not being updated by the test correctly. Collecting the policy file was useful for debugging that scenario, and I figured it might be helpful for future debugging too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, but as far as debugging the tests, they change the policy file multiple times and this collects only the last update. You should remove it from here and add it to the goal state history (ok to mark as TODO for next PR)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added as a TODO
azurelinuxagent/ga/exthandlers.py
Outdated
agent_conf_file_path = get_osutil().agent_conf_file_path | ||
msg = "Extension '{0}' will not be processed since extension processing is disabled. To enable extension " \ | ||
"processing, set Extensions.Enabled=y in '{1}'".format(ext_full_name, agent_conf_file_path) | ||
# logger.info(msg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
left-over?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was removed in commit ce5cf20
azurelinuxagent/ga/exthandlers.py
Outdated
# it instead of PluginDisableProcessingFailed below. | ||
# | ||
# Note: currently, when uninstall is requested for an extension, CRP polls until the agent does not | ||
# report status for that extension, or until timeout is reached. In the case of a policy error, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit
# report status for that extension, or until timeout is reached. In the case of a policy error, the | |
# report status for that extension, or until timeout is reached. In the case of an extension disallowed error, the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
azurelinuxagent/ga/exthandlers.py
Outdated
@@ -276,6 +297,9 @@ class ExtHandlersHandler(object): | |||
def __init__(self, protocol): | |||
self.protocol = protocol | |||
self.ext_handlers = None | |||
# Maintain a list of extension handler objects that are disallowed (e.g. blocked by policy, extensions disabled, etc.). | |||
# Extension status is always reported for the extensions in this list. List is reset for each goal state. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think this comment should indicate that extension status is always reported for the extensions in the list if an extension status exists (i.e. it's not a no-config ext)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated this wording to
Extension status, if it exists, is always reported for the extensions in this list.
azurelinuxagent/ga/exthandlers.py
Outdated
code=-1, | ||
operation=handler_i.operation, | ||
message=msg) | ||
agent_conf_file_path = get_osutil().agent_conf_file_path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: should probably use get_agent_conf_file_path() instead
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated
Description
Issue #
PR #2 for the policy engine allowlist feature:
PR information
Quality of Code and Contribution Guidelines