-
-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] EC2 AMIs do not start user-data provisioning or amazon-init.service #137
Comments
first of all could you check if the same problem occurs on 24.05beta? https://nixos.github.io/amis/ 23.11 will be EOL in a few days
This part I don't understand. However in my experience those are quite eventually consistent. It can takes minutes before they populate (Even on non-NixOS) A better way is to directly connect to the serial console instead in the AWS Console or through the aws cli |
Note that the ssh key is set up by https://github.com/NixOS/nixpkgs/blob/95bdd7fc6cb47a43ef1716bf4e9beb30c6ca364b/nixos/modules/virtualisation/ec2-data.nix#L21 not Both are ordered after but have no strict requirement https://github.com/NixOS/nixpkgs/blob/95bdd7fc6cb47a43ef1716bf4e9beb30c6ca364b/nixos/modules/virtualisation/amazon-image.nix#L72 So I think what might be happening is that https://github.com/NixOS/nixpkgs/blob/95bdd7fc6cb47a43ef1716bf4e9beb30c6ca364b/nixos/modules/virtualisation/ec2-metadata-fetcher.sh is failing for some reason. |
the output of the |
To get debug logs programatically (Only works for Nitro instances!) this command can be used:
Output of that command on a failing instance would be very useful. In the meantime I'm gonna spawn 10 instances and see if I can reproduce
|
Understood, thank you for explaining. Yes, I'll give it a try on the new 24.05 beta. FWIW we did do a test where user-data was this script:
on 23.11 and saw the same issue. I'll try this out on the newer AMI version 24.05 and see how it behaves. |
I spawned 10 instances in a row and non of this reproduce what you're reporting. Can you provide more info about your deployment? The commands I used#!/bin/sh
id=$(aws ec2 run-instances --region us-west-2 --key-name [email protected] --image-id ami-0f1471461676aaa00 --instance-type t3.nano --security-groups launch-wizard-1 --query 'Instances[0].InstanceId' --output text)
ip=$(aws ec2 describe-instances --region us-west-2 --instance-ids $id --query 'Reservations[0].Instances[0].PublicIpAddress' --output text)
sleep 30
ssh root@$ip
aws ec2 get-console-output --region us-west-2 --latest --instance-id $id
aws ec2 terminate-instances --region us-west-2 --instance-ids $id
Did you set a NixOS config in |
@arianvp I was able to reproduce with a minimal user-data just We're using instance type: |
Also with that user data I'm not able to reproduce on a
Last thing I can think of. Did you maybe disable IMDSv2 on your instance and are using the legacy IMDSv1? Otherwise it must be something with that specific instance type. Which is kinda odd. |
So yeh please provide the output of the following command
On a machine that didn't get the ssh key set up properly. Shortly after booting up the instance . And share it here. It should give us a clue what is failing. |
@arianvp Here you go! No. We're requiring IMDSv2 via the launch template. Output: https://gist.github.com/hjkatz/d454f2ccb0f86e39217b9293d69dadf8 Command: |
Are you sure this is the log of a server that failed to set up? Because all three services succeeded in that run:
The only thing fishy in those logs is that it took all 3 retries to fetch the IMDSv2 Token. Though it did succeed on the last try.
I'm wondering if the amount of retries for some reason is not sufficient on your instance type and we need to bump up the amount of retries. And that in 70% of the cases this is failing. However it's super weird that the IMDS is so slow to start up. E.g. on my t3 instance the IMDS call succeeds immediately:
|
IMDS not being available 16 seconds into the boot sounds like the kind of thing I'd open an AWS Support Ticket for to be honest. This is extremely puzzling if this is indeed what is failing for you. |
Agreed that IMDS not being available immediately is odd and a huge smell. Actually after pulling those logs I found out that this server did setup successfully and was launched using my Here are the new interesting logs: https://gist.github.com/hjkatz/daa4935f532f8947e0ce48892e3955b9
It does appear to be an IMDS issue. I'll try to follow up with AWS support too. Though I don't think that NixOS should continue booting through this failure. It says it put the keys there, but in reality we still get a password prompt for Also just for my case why not add a backoff + retry longer? I don't see a huge downside personally. BTW those keys' fingerprints as loaded/printed do not match the key pair we've uploaded and use for our servers. |
Yep that makes sense. The IMDS fetcher is defined here So you'll need to modify that and then build and upload a new AMI yourself. (Side-note; I wonder if the script can be simplified if we add a
Those are the host key fingerprints. Not the fingerprints of the client keys you upload. The host key is what you use to authenticate against the server (not the other way around) and they're generated fresh on startup. The idea is that you can retrieve the fingerprint from the console output and compare them with the fingerprint that is printed the first time you connect to the instance e.g. the
prompt. |
Beautiful, thank you for the information and pointers to the script. I'll find time this weekend to open a PR. |
Yeh I agree that |
Hey @mswilson long shot. Does having IMDS not being reachable up to 20 seconds into boot for |
That would be unexpected behavior. Are you sure that networking is getting
fully initialized?
Matt
…On Fri, May 24, 2024 at 2:57 PM Arian van Putten ***@***.***> wrote:
Hey @mswilson <https://github.com/mswilson> long shot. Does having IMDS
not being reachable up to 20 seconds into boot for c6id instances ring a
bell to you at all?
—
Reply to this email directly, view it on GitHub
<#137 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AAPE5OK4WXQ7RTO26Q4J3SDZD6ZVJAVCNFSM6AAAAABIH7TKQOVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCMZQGQYTQOBUGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
cc @arianvp, the PR: NixOS/nixpkgs#314427 I would love any advice for building the AMI to test this. This is my first PR for the nixpkgs ecosystem and I do not run a NixOS machine personally. |
I am pretty sure the network is properly initialized before we attempt to reach IMDS:
is printed before the imds fetcher script starts which means we successfully got a DHCP lease. I'd love to debug this more but the instance type is really expensive.
Could you join https://app.element.io/#/room/#aws:nixos.org ? I think Matrix is a better place to help you out with this. |
I tried to reproduce on your instance type and I can't. IMDS is always instantly responds. and I am always able to SSH. I spawned 15 instances of type All of them retrieved the IMDS data on first try like follows:
At this point I really don't know what is going on in your environment. But I can not reproduce any of the behaviour you're reporting I'm afraid. I suggest contacting AWS support and I am going to close this issue as "can not reproduce". The script that I used:
|
(Feel free to reopen if you have more info) |
I have successfully been able to reproduce your issue. It reproduces as soon as you launch an instance in an ipv6 dual-stack subnet. The reason why it's happening:
I think the easiest and most correct fix here is that we should adjust the script to try both the IPv6 and IPv4 IMDS address. |
@arianvp Hooray for us! We were successfully able to avoid the issue when turning off ipv6 IP assignment on our EC2 instances. We launched ~40 of them and all came up without issues. Would you like to take a stab at that ipv6 adjustments to the metadata script? |
Check out the PR here: NixOS/nixpkgs#314427 Happy to make adjustments and work back and forth there. |
Did not mean to close |
Given the complicated nature of #!/bin/sh
# /bin/sh should be set up by the NixOS image to point to bash-interactive in the Nix store.
# Regenerate the hardware configuration?
nixos-generate-config
# Overwrite the existing configuration.
cat > /etc/nixos/configuration.nix << END_OF_FILE
{ config, pkgs, ... }:
{
imports = [
./hardware-configuration.nix
];
nix = {
settings = {
experimental-features = [
"nix-command"
"flakes"
];
};
};
# ...
}
END_OF_FILE
# Rebuild the system.
nixos-rebuild switch
# Maybe create a flake.nix if they would rather a flake configuration instead. Alternatively, the user script uses a That way, the NixOS AMIs provide:
All of these properly wait for IMDS and are maintained by upstream (i.e. Here's the Amazon Linux 2023 package list for reference: https://docs.aws.amazon.com/linux/al2023/ug/image-comparison.html The only thing missing would be a systemd oneshot unit that looks for an EC2 key pair from IMDS and registers it with the SSH daemon. Personally, I don't really view that as high priority given that they can use SSM Session Manager with the SSM agent installed (also needs an SSM IAM role for the instance profile, but AWS provides a managed policy for that) or they can use EC2 Instance Connect to push an ephemeral SSH key through IAM-protected APIs. Long-lived credentials like EC2 key pairs are bad practice to begin with. The only other optional things I can think of are:
There's also the AWS CodeDeploy Agent, though the deploy pattern for that with mutable environments feels like something most NixOS users won't use. Instead, I imagine NixOS users are more likely to bake immutable AMIs ahead of time and then use EC2 Instance Refresh to replace ASG instances. I have a separate feature request to the AWS CloudFormation team to allow triggering that feature natively with a CloudFormation stack update. aws-cloudformation/cloudformation-coverage-roadmap#2119 To support setups that bake a NixOS disk image and upload those to S3 for EC2's |
On a separate note for the IPv4 and IPv6 lease race condition, it looks like systemd-networkd has some options around that. https://search.nixos.org/options?channel=unstable&show=systemd.network.wait-online.anyInterface I think even in IPv6-only subnets, IMDS is always available over IPv4? https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/configuring-instance-metadata-options.html
https://docs.aws.amazon.com/vpc/latest/userguide/configure-subnets.html#subnet-ip-address-range
|
Ah nice! We just have to configure networkd to wait on both ipv6 and ipv4 then I think.
|
While launching 23.11 AMIs in the
us-west-2
region we observe that not every instance launched starts theamazon-init.service
nor seems to set the key pair properly for theroot
user.We are presented with the following:
This happens ~70% of the time, other times the instance comes up just fine.
AMI ID:
ami-0f1471461676aaa00
This also means that we see no logs in the boot logs in the AWS console. Which makes debugging very difficult.
Another note: Rebooting the instance a few times seems to get past the problem.
Observed Heuristic: If the ssh keys are not set up by like ~30s after boot, then the instance never comes up properly at all.
The text was updated successfully, but these errors were encountered: