Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

localhost_is_ec2() == false on ECS running in AWS Batch #24

Closed
ianfiske opened this issue Feb 2, 2018 · 22 comments
Closed

localhost_is_ec2() == false on ECS running in AWS Batch #24

ianfiske opened this issue Feb 2, 2018 · 22 comments

Comments

@ianfiske
Copy link

ianfiske commented Feb 2, 2018

I am trying to get credentials working in jobs submitted by AWS Batch, which uses ECS under the hood. It seems like the intention of AWSCore.jl is to get creds from aws_config() or directly from ecs_instance_credentials(). However, these fail because localhost_is_ec2() assertions fail:

ERROR: LoadError: AssertionError: localhost_is_ec2()
Stacktrace:
[1] ecs_instance_credentials() at /root/.julia/v0.6/AWSCore/src/AWSCredentials.jl:214
[2] include_from_node1(::String) at ./loading.jl:576
[3] include(::String) at ./sysimg.jl:14
[4] process_options(::Base.JLOptions) at ./client.jl:305
[5] _start() at ./client.jl:371
while loading /testcreds.jl, in expression starting on line 4
@ianfiske
Copy link
Author

ianfiske commented Feb 2, 2018

Investigating and providing a minimal reproducible example. Running the following in AWS Batch:

print("isfile(\"/sys/hypervisor/uuid\") is $(isfile("/sys/hypervisor/uuid"))")

shows

isfile("/sys/hypervisor/uuid") is false

@samoconnor
Copy link
Contributor

HI @ianfiske
Thanks for reporting this. It is a while since I've used AWS Batch. I wonder if the EC2 container is now deliberately hiding the EC2 instance that is underneath it?

I've made a change 1e5964e to detect ECS by haskey(ENV, "AWS_CONTAINER_CREDENTIALS_RELATIVE_URI"). Let me know if that works.

Reference: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/identify_ec2_instances.html

@jademackay
Copy link

Hi,
I have just encountered this issue too. For me it is a consequence of how new EC2 ami types are virtualised. It appears that c5 and m5 instances don't use Xen and so don't use /sys/hypervisor/, the file /sys/hypervisor/uuid is not created. Other instance types still have /sys/hyporvisor/uuid.

@samoconnor
Copy link
Contributor

Do these instances have a /sys/devices/virtual/dmi/id/product_uuid file ?
See https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/identify_ec2_instances.html

@jademackay
Copy link

Yes.

@samoconnor
Copy link
Contributor

Could you try adding something like this to localhost_is_ec2 ... ?

        return isfile("/sys/hypervisor/uuid") &&
               String(read("/sys/hypervisor/uuid",3)) == "ec2" ||
               isfile("/sys/devices/virtual/dmi/id/product_uuid") &&
               String(read("/sys/devices/virtual/dmi/id/product_uuid",3)) == "EC2"

https://github.com/JuliaCloud/AWSCore.jl/blob/master/src/AWSCredentials.jl#L112
(note uppercase "EC2")

@jademackay
Copy link

I certainly can :-)

@samoconnor
Copy link
Contributor

Let me know if it works or if you need to tweak it a bit to get it to work...

@jademackay
Copy link

I will let you know how I get on.

@jademackay
Copy link

jademackay commented Mar 2, 2018

Unfortunately the /sys/devices/virtual/dmi/id/product_uuid is owned by root and has permissions -r--------. This explains why the documentation specifies sudo cat /sys/devices/virtual/dmi/id/product_uuid.
Another approach would be to make an http request to http://169.254.169.254/latest ... with a short time-out.

@samoconnor
Copy link
Contributor

@samoconnor
Copy link
Contributor

Some issues with a connect timeout:

So, the options seem to be:

  • Add a AWS_USE_EC2_INSTANCE_CREDENTIALS env var to override localhost_is_ec2(), or
  • file an issue requesting a timeout option for Base.connect (and wait), or
  • file an issue with Amazon asking for a reliable non-root way to detect EC2 (and wait), or
  • wait for https://patchwork.kernel.org/patch/6461521/, or
  • create and a local task to do the timeout (a bit ugly) something like this:
function localhost_has_instance_profile()
    result = Ref{Bool}(false)
    @schedule try
        info = JSON.parse(ec2_metadata("iam/info"))
        result[] = haskey(info, "InstanceProfileArn")
    catch e
    end
    sleep(0.5)
    return result[]
end

@jademackay
Copy link

I see. Thanks for your outline.
I think that a solution - or interim solution - will be needed well before the file an issue or wait for the kernel patch options are achieved. So that leaves setting an environment variable or introducing an ugly-ish timeout.
I strongly appreciate the preference for the cleaner, more principled solution of using the environment variable. Although, it's probably worth emphasizing that if the env var solution is chosen, most users will need to call something like localhost_has_instance_profile() above in their own packages to preserve any indifference to compute environment that they may have become accustomed to.

@samoconnor
Copy link
Contributor

Perhaps we could do this ping -c1 -W 1 169.254.169.254 before trying to connect,
and/or something like this: (the only thing we're trying to avoid here is a long timeout when you run this at the REPL on your laptop and you have no .awscredentials file.)

diff --git a/src/AWSCredentials.jl b/src/AWSCredentials.jl
index d2eed3a..99dec55 100644
--- a/src/AWSCredentials.jl
+++ b/src/AWSCredentials.jl
@@ -11,7 +11,7 @@
 
 export AWSCredentials,
        localhost_is_lambda,
-       localhost_is_ec2,
+       localhost_maybe_ec2,
        aws_user_arn,
        aws_account_number
 
@@ -75,7 +75,7 @@ function AWSCredentials()
 
         creds = ecs_instance_credentials()
 
-    elseif localhost_is_ec2()
+    elseif localhost_maybe_ec2()
 
         creds = ec2_instance_credentials()
 
@@ -103,7 +103,7 @@ localhost_is_lambda() = haskey(ENV, "LAMBDA_TASK_ROOT")
 Is Julia running on an EC2 virtual machine?
 """
 
-function localhost_is_ec2()
+function localhost_maybe_ec2()
 
     if localhost_is_lambda()
         return false
@@ -111,7 +111,8 @@ function localhost_is_ec2()
 
     @static if VERSION < v"0.7.0-DEV" ? is_unix() : Sys.isunix()
         return isfile("/sys/hypervisor/uuid") &&
-               String(read("/sys/hypervisor/uuid",3)) == "ec2"
+               String(read("/sys/hypervisor/uuid",3)) == "ec2" ||
+               isfile("/sys/devices/virtual/dmi/id/product_uuid")
     end
 
     return false
@@ -169,8 +170,6 @@ for `key`.
 
 function ec2_metadata(key)
 
-    @assert localhost_is_ec2()
-
     String(http_get("http://169.254.169.254/latest/meta-data/$key").body)
 end
 
@@ -185,8 +184,6 @@ for EC2 virtual machine.
 
 function ec2_instance_credentials()
 
-    @assert localhost_is_ec2()
-
     info  = ec2_metadata("iam/info")
     info  = JSON.parse(info, dicttype=Dict{String,String})
 

@jademackay
Copy link

Yes, you are right, that will solve all the problems without having to add any user package code.

@samoconnor
Copy link
Contributor

It looks like 169.254.169.254 does not respond to ping, so that's out.

@samoconnor
Copy link
Contributor

Please try this branch: #33

@jademackay
Copy link

I have successfully gotten credentials (AWSCore.AWSCredentials()) in the following circumstances using your branch localhost_maybe_ec2:

  • On my laptop
  • EC2 m5.xlarge (in docker container)
    And have been unsuccessful using AWSCore prior to your branch on
  • EC2 m5.xlarge (in docker container)

samoconnor added a commit that referenced this issue Mar 5, 2018
Improve EC2 detection to handle new non-Xen c5 and m5 instances, See #24
@samoconnor
Copy link
Contributor

Thanks for testing this @jademackay.
@ianfiske is this issue resolved for you now?

@ianfiske
Copy link
Author

@samoconnor I apologize for the long delay verifying your fix. I just tested the latest release of AWSCore, v0.3.8 both locally and in AWS Batch. Your fix is working great. A simple AWSCore.aws_config() successfully pulls credentials in both cases.

Thanks for the fix!

@samoconnor
Copy link
Contributor

HI @ianfiske, good to hear!
Note that v0.3.8 also has a change to check for an ExpiredTokenException error code (in addition to ExpiredToken) to trigger loading of new credentials. I would be interested to know if you have any long running jobs where the credentials expire, and are they automatically refreshing for you?
On EC2 it seems to take about 8 hours for the credentials to expire.

Another thing to be aware of: Many of the julia AWS API functions now allow you to leave out the 1st argument AWSConfig. Internally they use AWSCore.default_aws_config() in this case. This mechanism ensures that a single global credentials object is shared by all requests and can improve efficiency of credentials refresh.

@ianfiske
Copy link
Author

Hi @samoconnor,
I don't have jobs that run long enough to validate the credential expiration/refreshing functionality. My jobs tend to be around 5-10 minutes at this point.

Thank you for the heads up about the changes to the AWSConfig credential API. That will certainly help me remove some unneeded wrapper code on my application side.

phyatt-corp added a commit to Conning/AWSCore.jl that referenced this issue Jan 14, 2020
Also added check for isfile and isreadable, in case we aren't root.
See issue JuliaCloud#24
bors bot added a commit that referenced this issue Jan 14, 2020
104: Added check for sys_vendor file in localhost_is_ec2(), also added che… r=mattBrzezinski a=phyatt-corp

…ck for isfile and isreadable, in case we aren't root

Included comments for how to use Instance Metadata Service for checking a local_hostname

#24

This adds support for c5 and m5 instance types without root access, and will likely work with any future ec2 instances on Nitro Hypervisor.

Co-authored-by: Peter Hyatt <[email protected]>
bors bot added a commit that referenced this issue Jan 15, 2020
104: Added check for sys_vendor file in localhost_is_ec2 r=mattBrzezinski a=phyatt-corp

…ck for isfile and isreadable, in case we aren't root

Included comments for how to use Instance Metadata Service for checking a local_hostname

#24

This adds support for c5 and m5 instance types without root access, and will likely work with any future ec2 instances on Nitro Hypervisor.

Co-authored-by: Peter Hyatt <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants