Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure Multi-Disk VM successful boot with errors Azure Linux Agent #1253

Open
bignay2000 opened this issue Nov 14, 2023 · 4 comments
Open

Azure Multi-Disk VM successful boot with errors Azure Linux Agent #1253

bignay2000 opened this issue Nov 14, 2023 · 4 comments
Labels
kind/bug Something isn't working platform/Azure

Comments

@bignay2000
Copy link

bignay2000 commented Nov 14, 2023

Description

Launching a new Azure VM with multiple disks results in some errors
These seem to be unique to Azure

Azure_Flatcar_Successful_Boot_With_Errors.txt

[   15.197228] ignition[1044]: CRITICAL : files: createFilesystemsFiles: createFiles: op(e): op(f): [failed]   mounting "/dev/disk/by-label/OEM" at "/mnt/oem1037826206": device or resource busy
[   15.202443] ignition[1044]: ERROR    : files: createFilesystemsFiles: createFiles: op(e): failed to mount ext4 device "/dev/disk/by-label/OEM" at "/mnt/oem1037826206", trying btrfs: device or resource busy
2023-11-14T18:11:49.020850Z INFO Daemon Daemon Azure Linux Agent Version:2.6.0.2
2023-11-14T18:11:49.022269Z INFO Daemon Daemon OS: flatcar 3602.2.1
2023-11-14T18:11:49.023217Z INFO Daemon Daemon Python: 3.10.10
2023-11-14T18:11:49.024276Z INFO Daemon Daemon Run daemon
2023-11-14T18:11:49.025137Z INFO Daemon Daemon No RDMA handler exists for distro='Flatcar Container Linux by Kinvolk' version='3602.2.1'
2023-11-14T18:11:49.037300Z INFO Daemon Daemon Unable to get cloud-init enabled status from systemctl: Command '['systemctl', 'is-enabled', 'cloud-init-local.service']' returned non-zero exit status 1.
2023-11-14T18:11:49.043683Z INFO Daemon Daemon Unable to get cloud-init enabled status from service: [Errno 2] No such file or directory: 'service'
2023-11-14T18:11:49.048121Z INFO Daemon Daemon cloud-init is enabled: False
2023-11-14T18:11:49.050555Z INFO Daemon Daemon Using waagent for provisioning
2023-11-14T18:11:49.053263Z INFO Daemon Daemon Activate resource disk
2023-11-14T18:11:49.055581Z INFO Daemon Daemon Searching gen1 prefix 00000000-0001 or gen2 f8b3781a-1e82-4818-a1c3-63d806ec15bb
2023-11-14T18:11:49.065311Z INFO Daemon Daemon Found device: None
2023-11-14T18:11:49.067684Z ERROR Daemon Daemon Failed to mount resource disk [ResourceDiskError] unable to detect disk topology
2023-11-14T18:11:49.071412Z ERROR Daemon Daemon Event: name=WALinuxAgent, op=ActivateResourceDisk, message=[ResourceDiskError] unable to detect disk topology, duration=0

waagent.log

2023-11-14T18:11:51.070247Z ERROR ExtHandler ExtHandler Unable to setup the persistent firewall rules: [Errno 30] Read-only file system: '/lib/systemd/system/waagent-network-setup.service'
2023-11-14T18:11:51.091747Z WARNING ExtHandler ExtHandler Fetch failed: [HttpError] HTTPS is unavailable and required
2023-11-14T18:11:51.093042Z INFO ExtHandler ExtHandler [PERIODIC] Request failed using the direct channel. Error: 'NoneType' object has no attribute 'getheaders'
2023-11-14T18:11:51.093788Z ERROR EnvHandler ExtHandler Failed to get the PID of the DHCP client: invalid literal for int() with base 10: 'MainPID=1640'
2023-11-14T22:11:55.641723Z ERROR ExtHandler ExtHandler Event: name=Microsoft.EnterpriseCloud.Monitoring.OmsAgentForLinux, op=Install, message=[ExtensionOperationError] Non-zero exit code: 51, /var/lib/waagent/Microsoft.EnterpriseCloud.Monitoring.OmsAgentForLinux-1.17.2/omsagent_shim.sh -install
[stdout]
2023/11/14 22:11:54 [Microsoft.EnterpriseCloud.Monitoring.OmsAgentForLinux-1.17.2] Install,failed,51,Unsupported operating system: flatcar 3602.2.1

Impact

Unsure if these disk errors are benign and worried that I may run into a VM disk issue.

Also concerned that Microsoft Azure Linux Agent may not be fully working and incorrectly logging unsupported Operating system. Not sure how this impacts Azure metrics, monitoring, and security - the Firewall error is particularly concerning.

Environment and steps to reproduce

Microsoft Azure using Azure CLI

  1. az login
  2. az account set --subscription "linux-nonproduction"
  3. az group create --location westus2 --name linux-nonproduction
  4. az network vnet create --name linux-nonproduction --resource-group linux-nonproduction --address-prefix 172.17.0.0/16
  5. az network vnet subnet create --name vmdecld01 --address-prefixes 172.17.1.0/24 --resource-group linux-nonproduction --vnet-name linux-nonproduction
  6. az disk create --name vmdecld01_DataDisk_sdb_var_lib_docker --resource-group linux-nonproduction --size-gb 32 --sku Premium_LRS --tier P50
  7. az disk create --name vmdecld01_DataDisk_sdc_srv_dockercompose --resource-group linux-nonproduction --size-gb 4 --sku Premium_LRS --tier P50
  8. az disk create --name vmdecld01_DataDisk_sdd_srv_dockervolumes --resource-group linux-nonproduction --size-gb 64 --sku Premium_LRS --tier P50
  9. az disk create --name vmdecld01_DataDisk_sde_srv_dockerlogs --resource-group linux-nonproduction --size-gb 10 --sku Premium_LRS --tier P50
  10. az disk create --name vmdecld01_DataDisk_sdf_srv_dockerbackups --resource-group linux-nonproduction --size-gb 32 --sku Premium_LRS --tier P50
  11. flatcar.bu.yaml.txt (remove the .txt extension)
  12. ./butane-x86_64-pc-windows-gnu.exe --strict --output flatcar-container-linux-config.json flatcar.bu.yaml
  13. az vm create --custom-data flatcar-container-linux-config.json --resource-group linux-nonproduction --name vmdecld01 --size Standard_E4-2as_v5 --os-disk-size-gb 16 --storage-sku Premium_LRS --attach-data-disks vmdecld01_DataDisk_sdb_var_lib_docker vmdecld01_DataDisk_sdc_srv_dockercompose vmdecld01_DataDisk_sdd_srv_dockervolumes vmdecld01_DataDisk_sde_srv_dockerlogs vmdecld01_DataDisk_sdf_srv_dockerbackups --admin-username abcadmin --generate-ssh-keys --accept-term --image kinvolk:flatcar-container-linux:stable-gen2:latest

Expected behavior

Boot up does not have Errors or Critical logs. If these are benign then they should be Warn.

Additional information

I had to update my flatcar.bu.yaml to have wipe_filesystem = true. I have used this same configuration for years on ESXi and this year with Promox and had wipe_filesystem = false. Not sure why this flag is required for Azure but not other platforms. Something is different in how Azure presents a disk to Flatcar? As above, the disks new & empty using the az disk create command.

Submitted Azure/WALinuxAgent#2984 bug report for Microsoft Azure Linux Agent not handling Flatcar setup and logging unsupported operating system despite Flatcar 3374.2.x+ being a supported operating system

image

@bignay2000 bignay2000 added the kind/bug Something isn't working label Nov 14, 2023
@bignay2000 bignay2000 changed the title Azure Multi-disk vm successful boot but with errors Azure Multi-Disk VM successful boot but with errors Nov 15, 2023
@bignay2000
Copy link
Author

bignay2000 commented Nov 15, 2023

@sayanchowdhury big thanks for getting the latest images available with the Mircrosoft Azure Market Place, but appears I may need your help with getting Microsoft Azure Linux Agent to be updated to be compatible with these latest Flatcar images

@bignay2000 bignay2000 changed the title Azure Multi-Disk VM successful boot but with errors Azure Multi-Disk VM successful boot with errors Nov 15, 2023
@bignay2000 bignay2000 changed the title Azure Multi-Disk VM successful boot with errors Azure Multi-Disk VM successful boot with errors Azure Linux Agent Version Nov 15, 2023
@bignay2000 bignay2000 changed the title Azure Multi-Disk VM successful boot with errors Azure Linux Agent Version Azure Multi-Disk VM successful boot with errors Azure Linux Agent Nov 15, 2023
@jepio
Copy link
Member

jepio commented Nov 15, 2023

2023-11-14T18:11:49.067684Z ERROR Daemon Daemon Failed to mount resource disk [ResourceDiskError] unable to detect disk topology
2023-11-14T18:11:49.071412Z ERROR Daemon Daemon Event: name=WALinuxAgent, op=ActivateResourceDisk, message=[ResourceDiskError] unable to detect disk topology, duration=0

This is an expected message when you're not using an Azure instance with a temporary disk (small 'd' in instance size). See also this: Azure/WALinuxAgent#2110

2023-11-14T18:11:51.093788Z ERROR EnvHandler ExtHandler Failed to get the PID of the DHCP client: invalid literal for int() with base 10: 'MainPID=1640'

Fixed in Azure/WALinuxAgent#2784. I'll need to update WALinuxAgent version.

2023-11-14T18:11:51.070247Z ERROR ExtHandler ExtHandler Unable to setup the persistent firewall rules: [Errno 30] Read-only file system: '/lib/systemd/system/waagent-network-setup.service'

get_systemd_unit_file_install_path needs fixing. This error is not concerning as this is only about persisting an OS config that would apply the firewall rule, WALinuxAgent applies it on every boot instead.

Not sure how this impacts Azure metrics, monitoring, and security

Unfortunately not every Azure extension works on Flatcar.

@bignay2000
Copy link
Author

@jepio Let me know when I can expect your changes available from the kinvolk:flatcar-container-linux:stable-gen2:latest image and I can retest. Thanks! Very helpful!

@bignay2000
Copy link
Author

bignay2000 commented Oct 3, 2024

@jepio did this get fully implemented into the stable release? - if the new code is nows in stable then I would think this is resolved.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working platform/Azure
Projects
Status: 📝 Needs Triage
Development

No branches or pull requests

3 participants