Feature/allow holo clients #12951

joelthompson · 2017-03-22T05:16:30Z

This fixes #12704 by falling through on caller checks in case IAMInfo fails, and it refactors some of the testing code to support an added test for this condition in a cleaner fashion.

NOTE: This doesn't address potential regressions when SkipRequestingAccountId check is set, only when using something like AdRoll's Hologram.

This makes the metadata API routes more flexible in terms of which routes it supports, in order to support future commits that will need different metadata responses. It also eliminates the JSON-encoding of the routes representation.

This fixes hashicorp#12704 by having GetAccountInfo continue on to the GetCallerIdentity method of getting account info if the retrieval of account info via the EC2 instance metadata service fails. Not attempting to add resiliency when skip_requesting_account_id is set to true.

radeksimko · 2017-03-22T13:01:44Z

Hi @joelthompson
I'm happy for Terraform to support alternative implementations of the API but I'm not too inclined to break any existing functionality for users that are using the real AWS APIs (most likely majority) in favour of that.

Without going too much into the implementation details (for now) - is there any specific reason Hologram can't implement the iam/info endpoint?

Thanks.

joelthompson · 2017-03-22T13:18:52Z

I haven't asked them, but one simple answer is that the iam/info endpoint returns the arn of the instance profile, while when using hologram, there might not actually be an instance profile (just an assumed role for which you're getting credentials). Sure, an instance profile could be fabricated, but that seems like the dirtier solution than just falling through to sts:GetCallerIdentity as I'm proposing here.

And this shouldn't break any existing functionality for any users. It just adds a fall back in case anything existing fails.

I'm also curious why you labeled this an enhancement but the issue it fixes, #12704, was labeled (correctly, IMHO) as a bug.

copumpkin · 2017-04-05T17:45:02Z

Can we get this merged? Not having it is causing me and a few others quite a bit of pain, since I need to compile my own Terraform with this patch

akazakov · 2017-04-05T17:54:29Z

👍

joelthompson · 2017-04-14T05:05:39Z

@radeksimko -- are there any plans to address this?

joelthompson · 2017-04-27T21:59:43Z

/cc @catsby since you approved and merged #13992 which created a merge conflict with this PR. I'm more than happy to fix the merge conflict, but only if there's a likelihood that it'll get merged. Is this something you or @radeksimko would be willing to merge? If not, can somebody at least explain why not?

…clients

joelthompson · 2017-05-25T05:27:28Z

I've fixed the merge conflict, and I'd really appreciate a look at this, thanks! :)

radeksimko

Hi @joelthompson
sorry for the delay in reviewing this.

I'm generally not in favour of patches that change (and potentially break) the behaviour for users which use the real API. It is our priority to make it work well with the real AWS API.

I know the chance of breaking things here is low, but I'm not convinced it's a good enough justification.

With that said I think there's a (cleaner) solution and that's patching Holo to expose such "fake" endpoint. As Holo is aiming to act like the real API I believe that they'd be happy to accept a patch which brings them closer to the API, don't you think?

radeksimko · 2017-05-25T05:25:16Z

builtin/providers/aws/auth_helpers.go

+		// the other methods available.
+		// If we have creds from something that looks like an IAM instance profile, but
+		// we were unable to retrieve account info from the instance profile, it's probably
+		// a safe assumption that we're not an IAM user


I don't think this is a safe assumption we can make. The metadata API may be unavailable or broken and we'd like to inform the user in such case rather than falling through.

Why?

As a user who has experienced occasional errors that might or might not be attributable to the instance metadata service being flaky, I can say with confidence that I have never once said, "I'm glad that my process has died due to the instance metadata service returning an error instead of trying to recover through some alternative means." These types of errors are fairly infrequent and highly inconsistent and flaky (i.e., the worst kind to debug), so I'd always prefer something try to fall through.

Also, if you assume that AWS will never lie to you via the instance metadata service (which is a different assumption than the instance metadata service being always available), and that AWS will never let you map an instance profile to an IAM role in a different account from the instance profile, then there is NEVER a case where falling through to GetCallerIdentity is wrong. The worst case is that you'll get a different error because a node doesn't have network connectivity to the STS API endpoints (which seems extremely unlikely to me to be a problem because that node will need access to AWS API endpoints to make AWS API calls as part of TF provisioning), and the real error will be as extremely hard to track down as it is today (because errors here are basically swallowed up silently). Even if you invalidate the assumptions I'm making here, I would argue that GetCallerIdentity is a better source of truth than the EC2 instance metadata service because GetCallerIdentity tells you information about the IAM principal actually making the calls, not making a number of inferences.

And in the case of the particular bug I'm trying to fix, you're not even actually informing users of the error. Instead, you're letting obscure errors like being unable to destroy an SNS topic policy propagate to the end user. So really the choice is between letting very obscure and hard-to-understand error messages make their ways to users, or to try to recover first before letting those very same error messages make their way to users. It seems pretty clear to me that the latter is the right behavior.

joelthompson

I'm generally not in favour of patches that change (and potentially break) the behaviour for users which use the real API.... I know the chance of breaking things here is low, but I'm not convinced it's a good enough justification.

The chance of breaking anything that's not already broken is zero. I'm merely adding a recovery attempt in an existing error path. How is that dangerous? How can that break anything?

With that said I think there's a (cleaner) solution and that's patching Holo to expose such "fake" endpoint. As Holo is aiming to act like the real API I believe that they'd be happy to accept a patch which brings them closer to the API, don't you think?

No, I don't think so. As I explained earlier, the instance metadata service exposes the ARN of the instance profile. (Note that it's important to distinguish it from the IAM role ARN -- the instance profile ARN being a mapping to an IAM role ARN.) So, the question is, "What is the ARN of the instance profile hologram should expose?". There is no good answer because there just isn't an instance profile. Hologram just gives you IAM role credentials with no instance profile involved. If hologram tried to expose the ARN of a non-existent instance profile, it would break things that expect whatever is returned by this endpoint to be an actual instance profile ARN.

joelthompson · 2017-05-25T06:32:46Z

builtin/providers/aws/auth_helpers.go

+		// the other methods available.
+		// If we have creds from something that looks like an IAM instance profile, but
+		// we were unable to retrieve account info from the instance profile, it's probably
+		// a safe assumption that we're not an IAM user


Why?

As a user who has experienced occasional errors that might or might not be attributable to the instance metadata service being flaky, I can say with confidence that I have never once said, "I'm glad that my process has died due to the instance metadata service returning an error instead of trying to recover through some alternative means." These types of errors are fairly infrequent and highly inconsistent and flaky (i.e., the worst kind to debug), so I'd always prefer something try to fall through.

Also, if you assume that AWS will never lie to you via the instance metadata service (which is a different assumption than the instance metadata service being always available), and that AWS will never let you map an instance profile to an IAM role in a different account from the instance profile, then there is NEVER a case where falling through to GetCallerIdentity is wrong. The worst case is that you'll get a different error because a node doesn't have network connectivity to the STS API endpoints (which seems extremely unlikely to me to be a problem because that node will need access to AWS API endpoints to make AWS API calls as part of TF provisioning), and the real error will be as extremely hard to track down as it is today (because errors here are basically swallowed up silently). Even if you invalidate the assumptions I'm making here, I would argue that GetCallerIdentity is a better source of truth than the EC2 instance metadata service because GetCallerIdentity tells you information about the IAM principal actually making the calls, not making a number of inferences.

And in the case of the particular bug I'm trying to fix, you're not even actually informing users of the error. Instead, you're letting obscure errors like being unable to destroy an SNS topic policy propagate to the end user. So really the choice is between letting very obscure and hard-to-understand error messages make their ways to users, or to try to recover first before letting those very same error messages make their way to users. It seems pretty clear to me that the latter is the right behavior.

radeksimko · 2017-05-27T13:01:08Z

Hi Joel,
I really appreciate the time you spent implementing the PR and then waiting for the feedback. 🙈

I admit you're right in your arguments about near zero chance of breaking things.

Nonetheless I discussed this with another maintainer this week and we agreed that the best place to fix this is upstream (that is Hologram).

Hologram aims to imitate the real API (using their own words) and it does already expose certain fake data, like instance ID, AZ name, DNS server hostname, role name, etc.:
https://github.com/AdRoll/hologram/blob/master/agent/metadata_service.go#L94-L123

So exposing fake IAM instance role is in line with the current implementation.

In the worst case scenario the name of the instance profile could be configurable the same way as the the role and default to something sensible - e.g. role name. Per AWS docs:

An instance profile can contain only one role, and this limit cannot be increased.

This 1-1 mapping makes you name your instance profile according to the role name that is assigned to it anyway.

I would argue that GetCallerIdentity is a better source of truth than the EC2 instance metadata service because GetCallerIdentity tells you information about the IAM principal actually making the calls, not making a number of inferences.

That's true and the API method didn't exist when we were initially implementing this. It was carefully added later on as the last option. The expectation here however is that if we're running on EC2 we expect the metadata API to be available. If it is just flaky it's up to the upstream SDK to deal with such problem, like it already does with many other APIs. All the API methods may fail for number of different reasons, and yes - sometimes it does make sense to just fall through to another method. This however means the current implementation is already quite complex and if we can fix this problem elsewhere, we should - merely to avoid increased complexity.

TL;DR We don't think Terraform is the right place to fix this.

joelthompson · 2017-05-31T14:56:54Z

Hi @radeksimko -- technical arguments aside, just to be clear, I'm not trying to add a new feature. I'm trying to fix a regression introduced in #11359 and which I reported in #12704. Are you saying that you're OK with introducing regressions and breaking existing customers, then refusing a PR to fix the regressions? Because that's what's going on here, and it's a pretty poor way to treat customers.

I'd be open to other ways of fixing #12704 (e.g., I think setting a default partition of aws if GetAccountInfo fails would work, but I haven't tested it, and adding special cases increases complexity); if you can suggest an alternate implementation that you'd be willing to merge, I would probably be happy to write the code for it and open a PR.

The expectation here however is that if we're running on EC2 we expect the metadata API to be available.

Sure, but the reverse isn't true -- just because you have credentials at 169.254.169.254 doesn't necessarily mean you're running on an EC2 instance.

This however means the current implementation is already quite complex and if we can fix this problem elsewhere, we should - merely to avoid increased complexity.

I don't think my code increases the complexity -- it keeps the actual number of lines of code the same, except in the testing where I add a test for this case. In fact, this seems like the least complex way of fixing the regression in #12704.

radeksimko · 2017-06-05T06:03:47Z

Hi @joelthompson
I took some time to discuss this with other team members, once again - to avoid any potential biases and make sure we collectively agree on the solution of this problem.

I came to the conclusion that it's 👌 to pull this in after the discussion with a few tweaks:

We'd like to avoid loosing errors if things do fail - if we do return an error in any stage (sts, iam, metadata API) it should contain all errors - we can use go-multierror to collect those errors.
To make testing code a bit more readable would you mind avoiding function with 3 bool arguments? e.g. we could expose each route as constant and then pass slice of those constants as a single argument. That way it will be clearer what endpoints are being used for any given test without looking at the implementation.

awsTs := awsEnv(generateMetadataApiRoutes([]*endpoint{
    instanceIdEndpoint,
    iamInfoEndpoint,
    securityCredentialsEndpoint,
}))

After those 2 things are addressed I'm happy to merge it.

With all that said as maintainers we cannot guarantee Terraform will keep working with a 3rd party tool like Hologram in the future, we might introduce some other dependencies on the metadata API structure in some other contexts and that is why we agreed that the problem should be solved upstream either way. If you want to avoid any similar disruption in the future we'd highly recommend sending PR to Hologram.

joelthompson · 2017-06-06T04:30:01Z

Thanks @radeksimko! I'll update this PR in a couple days with your tweaks, as they seem like great additions.

As best as I understand your position on future changes, you're saying that future changes may break hologram clients because you don't explicitly test against hologram nor do you explicitly consider these clients, that you're willing to accept PRs which don't increase the code complexity to support hologram to fix previous breaks, but that you don't want to bend over backwards to support hologram. If so, that seems completely reasonable to me, thanks!

I'd also like to add that none of what we've discussed fixes the symptoms of the issue where skip_requesting_account_id is set. I believe that's still a regression, but one I'd be happy to work with you on fixing :)

radeksimko · 2017-06-06T05:32:09Z

future changes may break hologram clients because you don't explicitly test against hologram nor do you explicitly consider these clients, that you're willing to accept PRs which don't increase the code complexity to support hologram to fix previous breaks, but that you don't want to bend over backwards to support hologram

Yes, you're right in all points here. Just for completeness we did discuss official support & testing in the past: #6535

joelthompson · 2017-06-13T07:24:35Z

OK, I think I've addressed your feedback, thanks!

joelthompson · 2017-06-13T07:57:27Z

Looks like I need to move to the terraform-providers/terraform-provider-aws repo, so I've opened a new PR in hashicorp/terraform-provider-aws#17 Please confirm that's the right place to continue this discussion, and then feel free to close this.

This is a port of hashicorp/terraform#12951 into the new repository. Partially fixes hashicorp/terraform#12704 in the case of hologram clients, but doesn't fix the regression when SkipRequestingAccountId is set.

* Fix handling of AdRoll's hologram clients This is a port of hashicorp/terraform#12951 into the new repository. Partially fixes hashicorp/terraform#12704 in the case of hologram clients, but doesn't fix the regression when SkipRequestingAccountId is set. * Refactoring of tests

ghost · 2020-04-08T02:36:10Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

joelthompson added 2 commits March 17, 2017 01:50

Refactor aws auth_helpers_test metadata api routes

b83450b

This makes the metadata API routes more flexible in terms of which routes it supports, in order to support future commits that will need different metadata responses. It also eliminates the JSON-encoding of the routes representation.

radeksimko added enhancement provider/aws labels Mar 22, 2017

radeksimko added the waiting-response An issue/pull request is waiting for a response from the community label Mar 22, 2017

paddycarver removed the waiting-response An issue/pull request is waiting for a response from the community label May 8, 2017

Merge remote-tracking branch 'origin/master' into feature/allow_holo_…

802f795

…clients

radeksimko requested changes May 25, 2017

View reviewed changes

joelthompson commented May 25, 2017

View reviewed changes

radeksimko self-assigned this Jun 5, 2017

radeksimko added the waiting-response An issue/pull request is waiting for a response from the community label Jun 5, 2017

Respond to PR feedback

4d1b69c

joelthompson mentioned this pull request Jun 13, 2017

Support for iam/info endpoint? AdRoll/hologram#90

Open

joelthompson mentioned this pull request Jun 13, 2017

Fix handling of AdRoll's hologram clients hashicorp/terraform-provider-aws#17

Merged

radeksimko closed this Jun 15, 2017

ghost locked and limited conversation to collaborators Apr 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/allow holo clients #12951

Feature/allow holo clients #12951

joelthompson commented Mar 22, 2017

radeksimko commented Mar 22, 2017

joelthompson commented Mar 22, 2017

copumpkin commented Apr 5, 2017

akazakov commented Apr 5, 2017

joelthompson commented Apr 14, 2017

joelthompson commented Apr 27, 2017

joelthompson commented May 25, 2017

radeksimko left a comment

radeksimko May 25, 2017

joelthompson May 25, 2017

joelthompson left a comment

joelthompson May 25, 2017

radeksimko commented May 27, 2017

joelthompson commented May 31, 2017

radeksimko commented Jun 5, 2017

joelthompson commented Jun 6, 2017

radeksimko commented Jun 6, 2017

joelthompson commented Jun 13, 2017

joelthompson commented Jun 13, 2017

ghost commented Apr 8, 2020

Feature/allow holo clients #12951

Feature/allow holo clients #12951

Conversation

joelthompson commented Mar 22, 2017

radeksimko commented Mar 22, 2017

joelthompson commented Mar 22, 2017

copumpkin commented Apr 5, 2017

akazakov commented Apr 5, 2017

joelthompson commented Apr 14, 2017

joelthompson commented Apr 27, 2017

joelthompson commented May 25, 2017

radeksimko left a comment

Choose a reason for hiding this comment

radeksimko May 25, 2017

Choose a reason for hiding this comment

joelthompson May 25, 2017

Choose a reason for hiding this comment

joelthompson left a comment

Choose a reason for hiding this comment

joelthompson May 25, 2017

Choose a reason for hiding this comment

radeksimko commented May 27, 2017

joelthompson commented May 31, 2017

radeksimko commented Jun 5, 2017

joelthompson commented Jun 6, 2017

radeksimko commented Jun 6, 2017

joelthompson commented Jun 13, 2017

joelthompson commented Jun 13, 2017

ghost commented Apr 8, 2020