Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Block extensions disallowed by policy #3259

Merged
merged 43 commits into from
Jan 22, 2025
Merged

Conversation

mgunnala
Copy link

Description

Issue #


PR #2 for the policy engine allowlist feature:

  • invoke policy engine from exthandlers.py
  • block all extensions and report status if any errors are thrown during engine initialization
  • block any extensions that are disallowed by policy
  • add unit and e2e tests

PR information

  • The title of the PR is clear and informative.
  • There are a small number of commits, each of which has an informative message. This means that previously merged commits do not appear in the history of the PR. For information on cleaning up the commits in your pull request, see this page.
  • If applicable, the PR references the bug/issue that it fixes in the description.
  • New Unit tests were added for the changes made

Quality of Code and Contribution Guidelines

Copy link

codecov bot commented Nov 13, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 72.77%. Comparing base (3aebcdd) to head (a37508f).
Report is 328 commits behind head on develop.

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #3259      +/-   ##
===========================================
+ Coverage    71.97%   72.77%   +0.79%     
===========================================
  Files          103      114      +11     
  Lines        15692    17081    +1389     
  Branches      2486     2277     -209     
===========================================
+ Hits         11295    12431    +1136     
- Misses        3881     4107     +226     
- Partials       516      543      +27     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@maddieford maddieford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did an initial review not including tests

azurelinuxagent/ga/policy/policy_engine.py Outdated Show resolved Hide resolved
"""
# TODO: when CRP adds terminal error code for policy-related extension failures, set that as the default code.
def __init__(self, msg, inner=None, code=-1):
msg = "Extension is disallowed by agent policy and will not be processed: {0}".format(msg)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case where agent failed to parse policy, I'm not sure we should say 'Extension is disallowed by policy'. In this case, extension is disallowed because there's some issue reading or parsing the policy.

I also am hesitant about 'agent policy' since policy is provided by customer

Copy link
Author

@mgunnala mgunnala Nov 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could change this to "Extension will not be processed: "

Parsing errors (InvalidPolicyError) would look like "Extension will not be processed: customer-provided policy file (path) is invalid, please correct the following error..."

Extension disallowed errors (ExtensionPolicyError) would look like "Extension will not be processed: failed to enable extension CustomScript because extension is not specified in policy allowlist. To enable, add extension to the allowed list in policy file (path)."

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the error message as discussed above.

policy_op, policy_err_code = policy_err_map.get(ext_handler.state)
if policy_error is not None:
err = ExtensionPolicyError(msg="", inner=policy_error, code=policy_err_code)
self.__handle_and_report_ext_handler_errors(handler_i, err,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this create .status files for single config extensions?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a new function __handle_and_report_policy_error() - this should create a status file for any extension with settings.

ext_handler.name,
conf.get_policy_file_path())
err = ExtensionPolicyError(msg, code=policy_err_code)
self.__handle_and_report_ext_handler_errors(handler_i, err,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same question here about .status file for single config extensions

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a new function __handle_and_report_policy_error() - this should create a status file for any extension with settings.

Copy link
Contributor

@maddieford maddieford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changes look good. I'm going to spend tomorrow going through each e2e scenario and unit test, sorry for the slow review :/

ExtensionRequestedState.Enabled: ('enable', ExtensionErrorCodes.PluginEnableProcessingFailed),
# TODO: CRP does not currently have a terminal error code for uninstall. Once CRP adds
# an error code for uninstall or for policy, use this code instead of PluginDisableProcessingFailed
# Note that currently, CRP waits for 90 minutes to time out for a failed uninstall operation, instead of
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we add some more detail to this comment?
Something like:

Note that currently, CRP will poll until the agent does not report a status for an extension that should be uninstalled. In the case of a policy error, the agent will report a failed status on behalf of the extension, which will cause CRP to poll for the full timeout period, instead of failing fast.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

@@ -692,6 +734,26 @@ def __handle_and_report_ext_handler_errors(ext_handler_i, error, report_op, mess
add_event(name=name, version=handler_version, op=report_op, is_success=False, log_event=True,
message=message)

@staticmethod
def __handle_and_report_policy_error(ext_handler_i, error, report_op, message, report=True, extension=None):
# TODO: Consider merging this function with __handle_and_report_ext_handler_errors() above.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please leave some comment explaining why we broke this into a separate function? For policy related failures, we want to fail extensions fast. CRP will continue to poll for single-config ext status until timeout, so agent should write a status for single-config extensions. The other function does not create that status and we didn't want to touch the other function without investigating the impact of that change further

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added


# Create status file for extensions with settings (single and multi config).
if extension is not None:
ext_handler_i.create_status_file_if_not_exist(extension, status=ExtensionStatusValue.error, code=error.code,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

create_status_file_if_not_exist() will not overwrite existing status file (for the current sequence number). Is this behavior acceptable?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should overwrite the existing file with the policy error

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We now overwrite the existing file with policy error. I've added an "overwrite" parameter and changed the function name to create_status_file( ).

ExtensionRequestedState.Enabled: ('enable', ExtensionErrorCodes.PluginEnableProcessingFailed),
# Note: currently, when uninstall is requested for an extension, CRP polls until the agent does not
# report status for that extension, or until timeout is reached. In the case of a policy error, the
# agent reports failed status on behalf of the extension, which will cause CRP to for the full timeout,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
# agent reports failed status on behalf of the extension, which will cause CRP to for the full timeout,
# agent reports failed status on behalf of the extension, which will cause CRP to poll for the full timeout,

nit

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed, thanks!

tests/ga/test_extension.py Show resolved Hide resolved
name: "ExtensionPolicy"
tests:
- "ext_policy/ext_policy.py"
images: "random(endorsed)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should run this on more distros so we can get better coverage before releasing the changes

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will running on all endorsed distros add too much overhead to the daily runs?

tests_e2e/tests/ext_policy/ext_policy.py Show resolved Hide resolved
fail(f"The agent should have reported an error trying to {operation} {extension_case.extension.__str__()} "
f"because the extension is disallowed by policy.")
except Exception as error:
assert_that("Extension will not be processed" in str(error)) \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could we also check for [ExtensionPolicyError] in the message to confirm the failure was due to policy

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

azurelinuxagent/ga/exthandlers.py Outdated Show resolved Hide resolved
@@ -630,6 +630,70 @@ def test_it_should_handle_and_report_enable_errors_properly(self):
}
self._assert_extension_status(sc_handler, expected_extensions)

def test_it_should_handle_and_report_disallowed_extensions_properly(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also please add a case for multi config ext allowed by policy

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added

name: "ExtPolicyWithDependencies"
tests:
- "ext_policy/ext_policy_with_dependencies.py"
images: "random(endorsed)"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment here, we should get more coverage than 1 run per day, maybe consider running on all endorsed, or 5-10 endorsed images per day

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to all endorsed, but I can change to 5-10 if this adds too much overhead.

tests_e2e/tests/ext_policy/ext_policy.py Show resolved Hide resolved
Copy link
Member

@narrieta narrieta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Posting comments for Agent code.

Will post comments for test code separately.

"""
Error raised during agent extension policy enforcement.
"""
def __init__(self, msg, code, inner=None):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 'code' and 'inner' parameters are not in the same order as in the base class, which can lead to subtle coding errors.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote it this way because I wanted "code" to be a required parameter in ExtensionPolicyEngine, but not "inner". But I can set a default value for "code", to keep them in the same order as in the base class.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up removing this class, based on the other comments

# Invoke policy engine to determine if extension is allowed. If not, block extension and report error on
# behalf of the extension.
policy_err_map = {
ExtensionRequestedState.Enabled: ('enable', ExtensionErrorCodes.PluginEnableProcessingFailed),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a comment describing the elements in the tuple?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added and moved this to the class level.

for extension, ext_handler in all_extensions:

handler_i = ExtHandlerInstance(ext_handler, self.protocol, extension=extension)

# Invoke policy engine to determine if extension is allowed. If not, block extension and report error on
# behalf of the extension.
policy_err_map = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like this is a constant... define it at the class level?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

# Invoke policy engine to determine if extension is allowed. If not, block extension and report error on
# behalf of the extension.
policy_err_map = {
ExtensionRequestedState.Enabled: ('enable', ExtensionErrorCodes.PluginEnableProcessingFailed),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

'enable' and 'disable' are internal CRP/Agent operations; users are not aware of them. They should not be propagated to error messages displayed to the user

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this to "run" and "uninstall"

}
policy_op, policy_err_code = policy_err_map.get(ext_handler.state)
if policy_error is not None:
err = ExtensionPolicyError(msg="", inner=policy_error, code=policy_err_code)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what is the intention of creating an exception object here? seems like it is only used to pass the error code, but it is never raised

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially implemented the ExtensionPolicyError class to have a centralized error message for extensions blocked by policy, and also to pass the code. But you make a good point - since we never actually raise the exception, I've removed the ExtensionPolicyError class and now pass the code/message directly into the reporting function.

azurelinuxagent/ga/exthandlers.py Show resolved Hide resolved
azurelinuxagent/ga/exthandlers.py Outdated Show resolved Hide resolved

# Create status file for extensions with settings (single and multi config).
if extension is not None:
ext_handler_i.create_status_file_if_not_exist(extension, status=ExtensionStatusValue.error, code=error.code,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should overwrite the existing file with the policy error

ext_handler_i.create_status_file_if_not_exist(extension, status=ExtensionStatusValue.error, code=error.code,
operation=report_op, message=message)

if report:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when would report be False?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently it isn't ever false, I initially wrote it this way because I was copying the exact structure of __handle_and_report_ext_handler_errors(). But I've removed it since that parameter isn't being used for now.

@@ -990,7 +1061,10 @@ def report_ext_handler_status(self, vm_status, ext_handler, goal_state_changed):
# extension even if HandlerState == NotInstalled (Sample scenario: ExtensionsGoalStateError, DecideVersionError, etc)
# We also need to report extension status for an uninstalled handler if extensions are disabled because CRP
# waits for extension runtime status before failing the extension operation.
if handler_state != ExtHandlerState.NotInstalled or ext_handler.supports_multi_config or not conf.get_extensions_enabled():
# In the case of policy failures, we want to report extension status with a terminal code so CRP fails fast. If
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's change this

        # We also need to report extension status for an uninstalled handler if extensions are disabled because CRP
        # waits for extension runtime status before failing the extension operation.
        # In the case of policy failures, we want to report extension status with a terminal code so CRP fails fast. If
        # extension status is not present, collect_ext_status() will set a default transitioning status, and CRP will
        # wait for timeout.

to

        # We also need to report extension status for an uninstalled handler if extensions are disabled, or if the extension
        # failed due to policy, because CRP waits for extension runtime status before failing the extension operation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intention of the change is to enter this condition when the extension fails due to policy, but this change means that we enter the condition whenever policy is enabled.

Is there any negative effect to calling ext_handler_i.get_extension_handler_statuses... whenever policy is enabled? Why is this behind the if condition in the first place?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed offline - removed this condition, because it would cause us to enter the if condition even for non-policy-related uninstall failures.

Copy link
Contributor

@maddieford maddieford left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left comments mainly for e2e tests. I'll review unit tests once the comments in exthandlers.py are resolved


# Only allowlisted extensions should be processed.
# We only allowlist CustomScript: CustomScript should be enabled, RunCommand and AzureMonitor should fail.
# (Note that CustomScript blocked by policy is tested in a later test case.)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adding comments to the review so I can follow the scenarios easier. Consider adding these as comments in the code, but ultimately up to you:
This policy tests the following scenarios:
- single config ext (CSE) enable operation succeeds when allowed by policy
- no-config ext (AzureMonitor) enable operation fails fast when disallowed by policy
- single multi-config instance (RunCommandHandler) enable operation fails fast when disallowed by policy

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, thanks!

tests_e2e/tests/ext_policy/ext_policy.py Show resolved Hide resolved
tests_e2e/tests/ext_policy/ext_policy.py Show resolved Hide resolved
self._operation_should_succeed("delete", azure_monitor)

# Should not uninstall disallowed extensions.
# CustomScript is removed from the allowlist: delete operation should fail.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This policy tests the following scenarios:
- a delete operation on a previously enabled single-config ext (CSE) which is now disallowed by policy fails fast
- multiple multi-config instances (RunCommandHandler and RunCommandHandler2) enable operations fail fast when disallowed by policy
- single-config ext (CSE) enable operation fails fast when disallowed by policy

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added, thanks!

tests_e2e/tests/ext_policy/ext_policy.py Show resolved Hide resolved
log.info("CRP returned an error for deletion operation, may be a false error. Checking agent log to determine if operation succeeded. Exception: {0}".format(crp_err))
try:
for ssh_client in ssh_clients.values():
msg = ("Remove the extension slice: {0}".format(str(ext_to_delete)))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This message is related to cgroup. Right now it's logged even when cgroup isn't enabled (which might and probably should change in the future).

Instead, we should check that the handler was successfully uninstalled. i.e. the last ext status reported by the agent shouldn't include the handler:

2024-11-26T23:54:04.306568Z INFO ExtHandler ExtHandler Extension status: [('Microsoft.Azure.Monitor.AzureMonitorLinuxAgent', 'Ready')]

You might also consider doing this by checking the instance view

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this - we now check for the last status reported by the agent and confirm that there is no handler status reported. (Instance view doesn't work because CRP reports a stale status)

_test_cases = [
_should_fail_single_config_depends_on_disallowed_no_config,
_should_fail_single_config_depends_on_disallowed_single_config,
# TODO: RunCommand is unable to be installed properly, so these tests are currently disabled. Investigate the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's specify 'RunCommandHandler' since RunCommand is a different extension (confusing, I know :)

Also is it that RunCommandHandler is unable to be "installed properly" or uninstalled?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch - it's that RunCommandHandler is unable to be uninstalled. I've updated this.

return policy, template, expected_errors, deletion_order


def _should_no_dependencies():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this may be leftover code, I don't see it referenced

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed, thanks!

from pathlib import Path
from tests_e2e.tests.lib.agent_log import AgentLog


def main():
parser = argparse.ArgumentParser()
parser.add_argument("--data", dest='data', required=True)
parser.add_argument("--after-timestamp", dest='after_timestamp', required=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice :) thanks for adding this!

Copy link
Member

@narrieta narrieta left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comments for test code

tests/ga/test_extension.py Show resolved Hide resolved
@@ -630,6 +630,98 @@ def test_it_should_handle_and_report_enable_errors_properly(self):
}
self._assert_extension_status(sc_handler, expected_extensions)

def test_it_should_handle_and_report_extensions_disallowed_by_policy_properly(self):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does "properly" mean? (what is the expected behavior?)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed this to "test_it_should_report_successful_status_for_extensions_allowed_by_policy" and "test_it_should_report_failed_status_for_extensions_disallowed_by_policy"

def test_it_should_handle_and_report_extensions_disallowed_by_policy_properly(self):
"""If multiconfig extension is disallowed by policy, all instances should be blocked."""
policy_path = os.path.join(self.tmp_dir, "waagent_policy.json")
patch('azurelinuxagent.common.conf.get_policy_file_path', return_value=str(policy_path)).start()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be done using the 'with' statement

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks

- "ext_policy/ext_policy_with_dependencies.py"
images: "endorsed"
executes_on_scale_set: true
# This test should run on its own VMSS, because other tests may leave behind extensions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should handle this in the test and allow it to share the vm

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added code at the start of each test to remove leftover extensions

tests_e2e/tests/ext_policy/ext_policy.py Show resolved Hide resolved

# RunCommandHandler is a multi-config extension, so we set up two instances (configurations) here and test both.
run_command = ExtPolicy.TestCase(
VirtualMachineExtensionClient(self._context.vm, VmExtensionIds.RunCommandHandler,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should use VirtualMachineRunCommand instead of VirtualMachineExtensionClient

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

self._create_policy_file(policy)
self._operation_should_succeed("enable", custom_script)
self._operation_should_fail("enable", run_command)
if VmExtensionIds.AzureMonitorLinuxAgent.supports_distro((self._ssh_client.run_command("get_distro.py").rstrip())):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how much coverage are we getting for this case?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've changed ext_policy.yml to run on all endorsed distros, so this case will be run on all distros that support AzureMonitorLinuxAgent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need the check on distro if it's going to fail due to policy anyways?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I figured I would avoid testing this extension at all on an unsupported distro, but I can remove the distro check specifically for this case, if you think it would be useful.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, no need for that check

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I realized this is causing an error - when the policy is changed to allow all, the agent tries to re-enable AMA with the next goal state, and the install fails on unsupported distros.

I've added the supported distro check again. Another option is to uninstall AMA before sending any other goal state and then check agent log to validate the uninstall (race condition will result in an error). If you think it's necessary, I could add that change.

if VmExtensionIds.AzureMonitorLinuxAgent.supports_distro((self._ssh_client.run_command("get_distro.py").rstrip())):
self._operation_should_fail("enable", azure_monitor)

# When allowlist is turned off, all extensions should be processed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about CustomScript?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The final test case tests this - line 256

self._operation_should_succeed("enable", azure_monitor)
self._operation_should_succeed("delete", azure_monitor)

# Should not uninstall disallowed extensions.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems like

# Only allowlisted extensions should be processed.

and

 # Should not uninstall disallowed extensions.

or are we trying to test something different?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the comments to make it more clear what is being tested with each policy

}
}
self._create_policy_file(policy)
# # Known CRP issue - delete/uninstall operation times out instead of reporting an error.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not a CRP issue, but rather a design issue. Uninstall is best effort and never fails. You should consider checking the agent log for the error and then the instance view to confirm the extension was not uninstalled

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've enabled the failed deletion case and updated the test to validate the agent log and instance view.

# - For extensions with settings (install/enable errors): report at both handler and extension levels.

# Keep a list of disallowed extensions so that report_ext_handler_status() can report status for them.
self.disallowed_ext_handlers.append(ext_handler_i.ext_handler)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When does an extension get removed from disallowed_ext_handlers? The ExtHandlersHandler is instantiated only once on agent init

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

appending to the list in a method named "report_policy_error" does not feel right (reporting should not change the state of the object). rename to handle_policy_error?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can change this to "handle_ext_disallowed_error"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed

@@ -498,11 +530,22 @@ def handle_ext_handlers(self, goal_state_id):
logger.info("{0}: {1}".format(ext_full_name, msg))
add_event(op=WALAEventOperation.ExtensionProcessing, message="{0}: {1}".format(ext_full_name, msg))
handler_i.set_handler_status(status=ExtHandlerStatusValue.not_ready, message=msg, code=-1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to handle extensions_disabled scenario in this PR, I think we should remove the logic to create handler status and extension status here and call __report_policy_error instead.

Although, if we do that, I think we should rename __report_policy_error to something like '__report_ext_disallowed_error' and update the comments in that function to indicate it can be called for either extension disallowed by policy OR extension processing disabled via config.

We should also rename the policy_error_map to be generic to extensions disabled too

azurelinuxagent/ga/exthandlers.py Show resolved Hide resolved
tests/data/test_waagent.conf Show resolved Hide resolved
@@ -0,0 +1,11 @@
#
# The test suite verifies that disallowed extensions are not processed, but the agent should still report status.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
# The test suite verifies that disallowed extensions are not processed, but the agent should still report status.
# The test suite verifies that disallowed extensions and any extensions dependent on disallowed extensions are not processed, but the agent should still report status.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed


# The full CRP timeout period for extension operation failure is 90 minutes. For efficiency, we reduce the
# timeout limit to 15 minutes here. We expect "delete" operations on disallowed VMs to reach timeout instead of
# failing fast, because delete is a best effort operation by-design and should not fail.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because delete is a best effort operation by-design and should not fail.
I think we should these details to this comment for future readers:

  • Delete operations on disallowed VMs reach timeout because the agent reports status for the failure, but CRP is waiting for no status to be reported for the extension.
  • CRP continues to poll for no status to be reported because delete is treated as a best effort operation and should not fail.
  • Essentially, we're forcing a delete failure, which is unexpected by CRP

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The test no longer updates the crp timeout, but I've added these comments in _operation_should_fail()

self._operation_should_fail("enable", run_command)
self._operation_should_fail("enable", run_command_2)
# Only call enable on AMA if supported. The agent will try to re-enable AMA as a part of the next goal state, when
# policy is changed to allow all. This will cause errors on an unsupported distro.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is a little confusing.
I think you added this part as an explanation for why you only call enable on supported distros: The agent will try to re-enable AMA as a part of the next goal state, when policy is changed to allow all. This will cause errors on an unsupported distro.
But it sounds like this is an explanation of what is going to happen even if you only call enable on supported distros

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the comment

assert_that(expected_msg in str(error)) \
.described_as(
f"Error message is expected to contain '{expected_msg}', but actual error message was '{error}'").is_true()
log.info(f"{extension_case.extension} {operation} failed as expected")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
log.info(f"{extension_case.extension} {operation} failed as expected")
log.info(f"{extension_case.extension} {operation} failed as expected due to policy")

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

f"The agent should have reported an error trying to {operation} {extension_case.extension} "
f"because the extension is disallowed by policy.")
except Exception as error:
expected_msg = "Extension will not be processed: failed to run extension"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the expected_msg should include that the ext failed to run due to being disallowed

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

tests_e2e/tests/ext_policy/ext_policy.py Show resolved Hide resolved
@@ -276,6 +297,8 @@ class ExtHandlersHandler(object):
def __init__(self, protocol):
self.protocol = protocol
self.ext_handlers = None
# Maintain a list of extensions that are disallowed, and always report extension status for disallowed extensions.
self.disallowed_ext_handlers = []
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be private

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

@@ -19,7 +19,8 @@ tar --exclude='journal/*' --exclude='omsbundle' --exclude='omsagent' --exclude='
-czf "$logs_file_name" \
/var/log \
/var/lib/waagent/ \
$waagent_conf
$waagent_conf \
/etc/waagent_policy.json
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no issue with collecting this, just curious: why do we need it?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Earlier, I found a bug where the policy file was not being updated by the test correctly. Collecting the policy file was useful for debugging that scenario, and I figured it might be helpful for future debugging too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, but as far as debugging the tests, they change the policy file multiple times and this collects only the last update. You should remove it from here and add it to the goal state history (ok to mark as TODO for next PR)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added as a TODO

agent_conf_file_path = get_osutil().agent_conf_file_path
msg = "Extension '{0}' will not be processed since extension processing is disabled. To enable extension " \
"processing, set Extensions.Enabled=y in '{1}'".format(ext_full_name, agent_conf_file_path)
# logger.info(msg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left-over?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was removed in commit ce5cf20

narrieta
narrieta previously approved these changes Jan 21, 2025
# it instead of PluginDisableProcessingFailed below.
#
# Note: currently, when uninstall is requested for an extension, CRP polls until the agent does not
# report status for that extension, or until timeout is reached. In the case of a policy error, the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit

Suggested change
# report status for that extension, or until timeout is reached. In the case of a policy error, the
# report status for that extension, or until timeout is reached. In the case of an extension disallowed error, the

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@@ -276,6 +297,9 @@ class ExtHandlersHandler(object):
def __init__(self, protocol):
self.protocol = protocol
self.ext_handlers = None
# Maintain a list of extension handler objects that are disallowed (e.g. blocked by policy, extensions disabled, etc.).
# Extension status is always reported for the extensions in this list. List is reset for each goal state.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I think this comment should indicate that extension status is always reported for the extensions in the list if an extension status exists (i.e. it's not a no-config ext)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated this wording to

Extension status, if it exists, is always reported for the extensions in this list.

code=-1,
operation=handler_i.operation,
message=msg)
agent_conf_file_path = get_osutil().agent_conf_file_path
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: should probably use get_agent_conf_file_path() instead

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated

azurelinuxagent/ga/exthandlers.py Show resolved Hide resolved
@narrieta narrieta merged commit 9e2ada3 into Azure:develop Jan 22, 2025
11 checks passed
@mgunnala mgunnala deleted the allowlist_2 branch January 22, 2025 16:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants