Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature/allow holo clients #12951

Closed

Conversation

joelthompson
Copy link
Contributor

This fixes #12704 by falling through on caller checks in case IAMInfo fails, and it refactors some of the testing code to support an added test for this condition in a cleaner fashion.

NOTE: This doesn't address potential regressions when SkipRequestingAccountId check is set, only when using something like AdRoll's Hologram.

This makes the metadata API routes more flexible in terms of which
routes it supports, in order to support future commits that will need
different metadata responses. It also eliminates the JSON-encoding of
the routes representation.
This fixes hashicorp#12704 by having GetAccountInfo continue on to the
GetCallerIdentity method of getting account info if the retrieval of
account info via the EC2 instance metadata service fails.  Not
attempting to add resiliency when skip_requesting_account_id is set to
true.
@radeksimko
Copy link
Member

Hi @joelthompson
I'm happy for Terraform to support alternative implementations of the API but I'm not too inclined to break any existing functionality for users that are using the real AWS APIs (most likely majority) in favour of that.

Without going too much into the implementation details (for now) - is there any specific reason Hologram can't implement the iam/info endpoint?

Thanks.

@radeksimko radeksimko added the waiting-response An issue/pull request is waiting for a response from the community label Mar 22, 2017
@joelthompson
Copy link
Contributor Author

I haven't asked them, but one simple answer is that the iam/info endpoint returns the arn of the instance profile, while when using hologram, there might not actually be an instance profile (just an assumed role for which you're getting credentials). Sure, an instance profile could be fabricated, but that seems like the dirtier solution than just falling through to sts:GetCallerIdentity as I'm proposing here.

And this shouldn't break any existing functionality for any users. It just adds a fall back in case anything existing fails.

I'm also curious why you labeled this an enhancement but the issue it fixes, #12704, was labeled (correctly, IMHO) as a bug.

@copumpkin
Copy link

Can we get this merged? Not having it is causing me and a few others quite a bit of pain, since I need to compile my own Terraform with this patch

@akazakov
Copy link

akazakov commented Apr 5, 2017

👍

@joelthompson
Copy link
Contributor Author

@radeksimko -- are there any plans to address this?

@joelthompson
Copy link
Contributor Author

/cc @catsby since you approved and merged #13992 which created a merge conflict with this PR. I'm more than happy to fix the merge conflict, but only if there's a likelihood that it'll get merged. Is this something you or @radeksimko would be willing to merge? If not, can somebody at least explain why not?

@paddycarver paddycarver removed the waiting-response An issue/pull request is waiting for a response from the community label May 8, 2017
@joelthompson
Copy link
Contributor Author

I've fixed the merge conflict, and I'd really appreciate a look at this, thanks! :)

Copy link
Member

@radeksimko radeksimko left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @joelthompson
sorry for the delay in reviewing this.

I'm generally not in favour of patches that change (and potentially break) the behaviour for users which use the real API. It is our priority to make it work well with the real AWS API.

I know the chance of breaking things here is low, but I'm not convinced it's a good enough justification.

With that said I think there's a (cleaner) solution and that's patching Holo to expose such "fake" endpoint. As Holo is aiming to act like the real API I believe that they'd be happy to accept a patch which brings them closer to the API, don't you think?

// the other methods available.
// If we have creds from something that looks like an IAM instance profile, but
// we were unable to retrieve account info from the instance profile, it's probably
// a safe assumption that we're not an IAM user
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is a safe assumption we can make. The metadata API may be unavailable or broken and we'd like to inform the user in such case rather than falling through.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

As a user who has experienced occasional errors that might or might not be attributable to the instance metadata service being flaky, I can say with confidence that I have never once said, "I'm glad that my process has died due to the instance metadata service returning an error instead of trying to recover through some alternative means." These types of errors are fairly infrequent and highly inconsistent and flaky (i.e., the worst kind to debug), so I'd always prefer something try to fall through.

Also, if you assume that AWS will never lie to you via the instance metadata service (which is a different assumption than the instance metadata service being always available), and that AWS will never let you map an instance profile to an IAM role in a different account from the instance profile, then there is NEVER a case where falling through to GetCallerIdentity is wrong. The worst case is that you'll get a different error because a node doesn't have network connectivity to the STS API endpoints (which seems extremely unlikely to me to be a problem because that node will need access to AWS API endpoints to make AWS API calls as part of TF provisioning), and the real error will be as extremely hard to track down as it is today (because errors here are basically swallowed up silently). Even if you invalidate the assumptions I'm making here, I would argue that GetCallerIdentity is a better source of truth than the EC2 instance metadata service because GetCallerIdentity tells you information about the IAM principal actually making the calls, not making a number of inferences.

And in the case of the particular bug I'm trying to fix, you're not even actually informing users of the error. Instead, you're letting obscure errors like being unable to destroy an SNS topic policy propagate to the end user. So really the choice is between letting very obscure and hard-to-understand error messages make their ways to users, or to try to recover first before letting those very same error messages make their way to users. It seems pretty clear to me that the latter is the right behavior.

Copy link
Contributor Author

@joelthompson joelthompson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm generally not in favour of patches that change (and potentially break) the behaviour for users which use the real API.... I know the chance of breaking things here is low, but I'm not convinced it's a good enough justification.

The chance of breaking anything that's not already broken is zero. I'm merely adding a recovery attempt in an existing error path. How is that dangerous? How can that break anything?

With that said I think there's a (cleaner) solution and that's patching Holo to expose such "fake" endpoint. As Holo is aiming to act like the real API I believe that they'd be happy to accept a patch which brings them closer to the API, don't you think?

No, I don't think so. As I explained earlier, the instance metadata service exposes the ARN of the instance profile. (Note that it's important to distinguish it from the IAM role ARN -- the instance profile ARN being a mapping to an IAM role ARN.) So, the question is, "What is the ARN of the instance profile hologram should expose?". There is no good answer because there just isn't an instance profile. Hologram just gives you IAM role credentials with no instance profile involved. If hologram tried to expose the ARN of a non-existent instance profile, it would break things that expect whatever is returned by this endpoint to be an actual instance profile ARN.

// the other methods available.
// If we have creds from something that looks like an IAM instance profile, but
// we were unable to retrieve account info from the instance profile, it's probably
// a safe assumption that we're not an IAM user
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?

As a user who has experienced occasional errors that might or might not be attributable to the instance metadata service being flaky, I can say with confidence that I have never once said, "I'm glad that my process has died due to the instance metadata service returning an error instead of trying to recover through some alternative means." These types of errors are fairly infrequent and highly inconsistent and flaky (i.e., the worst kind to debug), so I'd always prefer something try to fall through.

Also, if you assume that AWS will never lie to you via the instance metadata service (which is a different assumption than the instance metadata service being always available), and that AWS will never let you map an instance profile to an IAM role in a different account from the instance profile, then there is NEVER a case where falling through to GetCallerIdentity is wrong. The worst case is that you'll get a different error because a node doesn't have network connectivity to the STS API endpoints (which seems extremely unlikely to me to be a problem because that node will need access to AWS API endpoints to make AWS API calls as part of TF provisioning), and the real error will be as extremely hard to track down as it is today (because errors here are basically swallowed up silently). Even if you invalidate the assumptions I'm making here, I would argue that GetCallerIdentity is a better source of truth than the EC2 instance metadata service because GetCallerIdentity tells you information about the IAM principal actually making the calls, not making a number of inferences.

And in the case of the particular bug I'm trying to fix, you're not even actually informing users of the error. Instead, you're letting obscure errors like being unable to destroy an SNS topic policy propagate to the end user. So really the choice is between letting very obscure and hard-to-understand error messages make their ways to users, or to try to recover first before letting those very same error messages make their way to users. It seems pretty clear to me that the latter is the right behavior.

@radeksimko
Copy link
Member

Hi Joel,
I really appreciate the time you spent implementing the PR and then waiting for the feedback. 🙈

I admit you're right in your arguments about near zero chance of breaking things.

Nonetheless I discussed this with another maintainer this week and we agreed that the best place to fix this is upstream (that is Hologram).

Hologram aims to imitate the real API (using their own words) and it does already expose certain fake data, like instance ID, AZ name, DNS server hostname, role name, etc.:
https://github.com/AdRoll/hologram/blob/master/agent/metadata_service.go#L94-L123

So exposing fake IAM instance role is in line with the current implementation.

In the worst case scenario the name of the instance profile could be configurable the same way as the the role and default to something sensible - e.g. role name. Per AWS docs:

An instance profile can contain only one role, and this limit cannot be increased.

This 1-1 mapping makes you name your instance profile according to the role name that is assigned to it anyway.

I would argue that GetCallerIdentity is a better source of truth than the EC2 instance metadata service because GetCallerIdentity tells you information about the IAM principal actually making the calls, not making a number of inferences.

That's true and the API method didn't exist when we were initially implementing this. It was carefully added later on as the last option. The expectation here however is that if we're running on EC2 we expect the metadata API to be available. If it is just flaky it's up to the upstream SDK to deal with such problem, like it already does with many other APIs. All the API methods may fail for number of different reasons, and yes - sometimes it does make sense to just fall through to another method. This however means the current implementation is already quite complex and if we can fix this problem elsewhere, we should - merely to avoid increased complexity.

TL;DR We don't think Terraform is the right place to fix this.

@joelthompson
Copy link
Contributor Author

Hi @radeksimko -- technical arguments aside, just to be clear, I'm not trying to add a new feature. I'm trying to fix a regression introduced in #11359 and which I reported in #12704. Are you saying that you're OK with introducing regressions and breaking existing customers, then refusing a PR to fix the regressions? Because that's what's going on here, and it's a pretty poor way to treat customers.

I'd be open to other ways of fixing #12704 (e.g., I think setting a default partition of aws if GetAccountInfo fails would work, but I haven't tested it, and adding special cases increases complexity); if you can suggest an alternate implementation that you'd be willing to merge, I would probably be happy to write the code for it and open a PR.

The expectation here however is that if we're running on EC2 we expect the metadata API to be available.

Sure, but the reverse isn't true -- just because you have credentials at 169.254.169.254 doesn't necessarily mean you're running on an EC2 instance.

This however means the current implementation is already quite complex and if we can fix this problem elsewhere, we should - merely to avoid increased complexity.

I don't think my code increases the complexity -- it keeps the actual number of lines of code the same, except in the testing where I add a test for this case. In fact, this seems like the least complex way of fixing the regression in #12704.

@radeksimko
Copy link
Member

Hi @joelthompson
I took some time to discuss this with other team members, once again - to avoid any potential biases and make sure we collectively agree on the solution of this problem.

I came to the conclusion that it's 👌 to pull this in after the discussion with a few tweaks:

  • We'd like to avoid loosing errors if things do fail - if we do return an error in any stage (sts, iam, metadata API) it should contain all errors - we can use go-multierror to collect those errors.
  • To make testing code a bit more readable would you mind avoiding function with 3 bool arguments? e.g. we could expose each route as constant and then pass slice of those constants as a single argument. That way it will be clearer what endpoints are being used for any given test without looking at the implementation.
awsTs := awsEnv(generateMetadataApiRoutes([]*endpoint{
    instanceIdEndpoint,
    iamInfoEndpoint,
    securityCredentialsEndpoint,
}))

After those 2 things are addressed I'm happy to merge it.


With all that said as maintainers we cannot guarantee Terraform will keep working with a 3rd party tool like Hologram in the future, we might introduce some other dependencies on the metadata API structure in some other contexts and that is why we agreed that the problem should be solved upstream either way. If you want to avoid any similar disruption in the future we'd highly recommend sending PR to Hologram.

@radeksimko radeksimko self-assigned this Jun 5, 2017
@radeksimko radeksimko added the waiting-response An issue/pull request is waiting for a response from the community label Jun 5, 2017
@joelthompson
Copy link
Contributor Author

Thanks @radeksimko! I'll update this PR in a couple days with your tweaks, as they seem like great additions.

As best as I understand your position on future changes, you're saying that future changes may break hologram clients because you don't explicitly test against hologram nor do you explicitly consider these clients, that you're willing to accept PRs which don't increase the code complexity to support hologram to fix previous breaks, but that you don't want to bend over backwards to support hologram. If so, that seems completely reasonable to me, thanks!

I'd also like to add that none of what we've discussed fixes the symptoms of the issue where skip_requesting_account_id is set. I believe that's still a regression, but one I'd be happy to work with you on fixing :)

@radeksimko
Copy link
Member

future changes may break hologram clients because you don't explicitly test against hologram nor do you explicitly consider these clients, that you're willing to accept PRs which don't increase the code complexity to support hologram to fix previous breaks, but that you don't want to bend over backwards to support hologram

Yes, you're right in all points here. Just for completeness we did discuss official support & testing in the past: #6535

@joelthompson
Copy link
Contributor Author

OK, I think I've addressed your feedback, thanks!

@joelthompson
Copy link
Contributor Author

Looks like I need to move to the terraform-providers/terraform-provider-aws repo, so I've opened a new PR in hashicorp/terraform-provider-aws#17 Please confirm that's the right place to continue this discussion, and then feel free to close this.

joelthompson added a commit to joelthompson/terraform-provider-aws that referenced this pull request Jun 15, 2017
This is a port of hashicorp/terraform#12951 into the new repository.

Partially fixes hashicorp/terraform#12704 in the case of hologram
clients, but doesn't fix the regression when SkipRequestingAccountId is
set.
radeksimko pushed a commit to hashicorp/terraform-provider-aws that referenced this pull request Jun 15, 2017
* Fix handling of AdRoll's hologram clients

This is a port of hashicorp/terraform#12951 into the new repository.

Partially fixes hashicorp/terraform#12704 in the case of hologram
clients, but doesn't fix the regression when SkipRequestingAccountId is
set.

* Refactoring of tests
@radeksimko radeksimko closed this Jun 15, 2017
bflad pushed a commit to hashicorp/aws-sdk-go-base that referenced this pull request Feb 18, 2019
* Fix handling of AdRoll's hologram clients

This is a port of hashicorp/terraform#12951 into the new repository.

Partially fixes hashicorp/terraform#12704 in the case of hologram
clients, but doesn't fix the regression when SkipRequestingAccountId is
set.

* Refactoring of tests
@ghost
Copy link

ghost commented Apr 8, 2020

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@ghost ghost locked and limited conversation to collaborators Apr 8, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement provider/aws waiting-response An issue/pull request is waiting for a response from the community
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AWSClient.partition and AWSClient.accountid not set on clients using AdRoll's hologram
5 participants