-
Notifications
You must be signed in to change notification settings - Fork 9.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Feature/allow holo clients #12951
Feature/allow holo clients #12951
Conversation
This makes the metadata API routes more flexible in terms of which routes it supports, in order to support future commits that will need different metadata responses. It also eliminates the JSON-encoding of the routes representation.
This fixes hashicorp#12704 by having GetAccountInfo continue on to the GetCallerIdentity method of getting account info if the retrieval of account info via the EC2 instance metadata service fails. Not attempting to add resiliency when skip_requesting_account_id is set to true.
Hi @joelthompson Without going too much into the implementation details (for now) - is there any specific reason Hologram can't implement the Thanks. |
I haven't asked them, but one simple answer is that the And this shouldn't break any existing functionality for any users. It just adds a fall back in case anything existing fails. I'm also curious why you labeled this an enhancement but the issue it fixes, #12704, was labeled (correctly, IMHO) as a bug. |
Can we get this merged? Not having it is causing me and a few others quite a bit of pain, since I need to compile my own Terraform with this patch |
👍 |
@radeksimko -- are there any plans to address this? |
/cc @catsby since you approved and merged #13992 which created a merge conflict with this PR. I'm more than happy to fix the merge conflict, but only if there's a likelihood that it'll get merged. Is this something you or @radeksimko would be willing to merge? If not, can somebody at least explain why not? |
I've fixed the merge conflict, and I'd really appreciate a look at this, thanks! :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @joelthompson
sorry for the delay in reviewing this.
I'm generally not in favour of patches that change (and potentially break) the behaviour for users which use the real API. It is our priority to make it work well with the real AWS API.
I know the chance of breaking things here is low, but I'm not convinced it's a good enough justification.
With that said I think there's a (cleaner) solution and that's patching Holo to expose such "fake" endpoint. As Holo is aiming to act like the real API I believe that they'd be happy to accept a patch which brings them closer to the API, don't you think?
// the other methods available. | ||
// If we have creds from something that looks like an IAM instance profile, but | ||
// we were unable to retrieve account info from the instance profile, it's probably | ||
// a safe assumption that we're not an IAM user |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is a safe assumption we can make. The metadata API may be unavailable or broken and we'd like to inform the user in such case rather than falling through.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
As a user who has experienced occasional errors that might or might not be attributable to the instance metadata service being flaky, I can say with confidence that I have never once said, "I'm glad that my process has died due to the instance metadata service returning an error instead of trying to recover through some alternative means." These types of errors are fairly infrequent and highly inconsistent and flaky (i.e., the worst kind to debug), so I'd always prefer something try to fall through.
Also, if you assume that AWS will never lie to you via the instance metadata service (which is a different assumption than the instance metadata service being always available), and that AWS will never let you map an instance profile to an IAM role in a different account from the instance profile, then there is NEVER a case where falling through to GetCallerIdentity is wrong. The worst case is that you'll get a different error because a node doesn't have network connectivity to the STS API endpoints (which seems extremely unlikely to me to be a problem because that node will need access to AWS API endpoints to make AWS API calls as part of TF provisioning), and the real error will be as extremely hard to track down as it is today (because errors here are basically swallowed up silently). Even if you invalidate the assumptions I'm making here, I would argue that GetCallerIdentity is a better source of truth than the EC2 instance metadata service because GetCallerIdentity tells you information about the IAM principal actually making the calls, not making a number of inferences.
And in the case of the particular bug I'm trying to fix, you're not even actually informing users of the error. Instead, you're letting obscure errors like being unable to destroy an SNS topic policy propagate to the end user. So really the choice is between letting very obscure and hard-to-understand error messages make their ways to users, or to try to recover first before letting those very same error messages make their way to users. It seems pretty clear to me that the latter is the right behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm generally not in favour of patches that change (and potentially break) the behaviour for users which use the real API.... I know the chance of breaking things here is low, but I'm not convinced it's a good enough justification.
The chance of breaking anything that's not already broken is zero. I'm merely adding a recovery attempt in an existing error path. How is that dangerous? How can that break anything?
With that said I think there's a (cleaner) solution and that's patching Holo to expose such "fake" endpoint. As Holo is aiming to act like the real API I believe that they'd be happy to accept a patch which brings them closer to the API, don't you think?
No, I don't think so. As I explained earlier, the instance metadata service exposes the ARN of the instance profile. (Note that it's important to distinguish it from the IAM role ARN -- the instance profile ARN being a mapping to an IAM role ARN.) So, the question is, "What is the ARN of the instance profile hologram should expose?". There is no good answer because there just isn't an instance profile. Hologram just gives you IAM role credentials with no instance profile involved. If hologram tried to expose the ARN of a non-existent instance profile, it would break things that expect whatever is returned by this endpoint to be an actual instance profile ARN.
// the other methods available. | ||
// If we have creds from something that looks like an IAM instance profile, but | ||
// we were unable to retrieve account info from the instance profile, it's probably | ||
// a safe assumption that we're not an IAM user |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why?
As a user who has experienced occasional errors that might or might not be attributable to the instance metadata service being flaky, I can say with confidence that I have never once said, "I'm glad that my process has died due to the instance metadata service returning an error instead of trying to recover through some alternative means." These types of errors are fairly infrequent and highly inconsistent and flaky (i.e., the worst kind to debug), so I'd always prefer something try to fall through.
Also, if you assume that AWS will never lie to you via the instance metadata service (which is a different assumption than the instance metadata service being always available), and that AWS will never let you map an instance profile to an IAM role in a different account from the instance profile, then there is NEVER a case where falling through to GetCallerIdentity is wrong. The worst case is that you'll get a different error because a node doesn't have network connectivity to the STS API endpoints (which seems extremely unlikely to me to be a problem because that node will need access to AWS API endpoints to make AWS API calls as part of TF provisioning), and the real error will be as extremely hard to track down as it is today (because errors here are basically swallowed up silently). Even if you invalidate the assumptions I'm making here, I would argue that GetCallerIdentity is a better source of truth than the EC2 instance metadata service because GetCallerIdentity tells you information about the IAM principal actually making the calls, not making a number of inferences.
And in the case of the particular bug I'm trying to fix, you're not even actually informing users of the error. Instead, you're letting obscure errors like being unable to destroy an SNS topic policy propagate to the end user. So really the choice is between letting very obscure and hard-to-understand error messages make their ways to users, or to try to recover first before letting those very same error messages make their way to users. It seems pretty clear to me that the latter is the right behavior.
Hi Joel, I admit you're right in your arguments about near zero chance of breaking things. Nonetheless I discussed this with another maintainer this week and we agreed that the best place to fix this is upstream (that is Hologram). Hologram aims to imitate the real API (using their own words) and it does already expose certain fake data, like instance ID, AZ name, DNS server hostname, role name, etc.: So exposing fake IAM instance role is in line with the current implementation. In the worst case scenario the name of the instance profile could be configurable the same way as the the role and default to something sensible - e.g. role name. Per AWS docs:
This 1-1 mapping makes you name your instance profile according to the role name that is assigned to it anyway.
That's true and the API method didn't exist when we were initially implementing this. It was carefully added later on as the last option. The expectation here however is that if we're running on EC2 we expect the metadata API to be available. If it is just flaky it's up to the upstream SDK to deal with such problem, like it already does with many other APIs. All the API methods may fail for number of different reasons, and yes - sometimes it does make sense to just fall through to another method. This however means the current implementation is already quite complex and if we can fix this problem elsewhere, we should - merely to avoid increased complexity. TL;DR We don't think Terraform is the right place to fix this. |
Hi @radeksimko -- technical arguments aside, just to be clear, I'm not trying to add a new feature. I'm trying to fix a regression introduced in #11359 and which I reported in #12704. Are you saying that you're OK with introducing regressions and breaking existing customers, then refusing a PR to fix the regressions? Because that's what's going on here, and it's a pretty poor way to treat customers. I'd be open to other ways of fixing #12704 (e.g., I think setting a default partition of
Sure, but the reverse isn't true -- just because you have credentials at 169.254.169.254 doesn't necessarily mean you're running on an EC2 instance.
I don't think my code increases the complexity -- it keeps the actual number of lines of code the same, except in the testing where I add a test for this case. In fact, this seems like the least complex way of fixing the regression in #12704. |
Hi @joelthompson I came to the conclusion that it's 👌 to pull this in after the discussion with a few tweaks:
awsTs := awsEnv(generateMetadataApiRoutes([]*endpoint{
instanceIdEndpoint,
iamInfoEndpoint,
securityCredentialsEndpoint,
})) After those 2 things are addressed I'm happy to merge it. With all that said as maintainers we cannot guarantee Terraform will keep working with a 3rd party tool like Hologram in the future, we might introduce some other dependencies on the metadata API structure in some other contexts and that is why we agreed that the problem should be solved upstream either way. If you want to avoid any similar disruption in the future we'd highly recommend sending PR to Hologram. |
Thanks @radeksimko! I'll update this PR in a couple days with your tweaks, as they seem like great additions. As best as I understand your position on future changes, you're saying that future changes may break hologram clients because you don't explicitly test against hologram nor do you explicitly consider these clients, that you're willing to accept PRs which don't increase the code complexity to support hologram to fix previous breaks, but that you don't want to bend over backwards to support hologram. If so, that seems completely reasonable to me, thanks! I'd also like to add that none of what we've discussed fixes the symptoms of the issue where |
Yes, you're right in all points here. Just for completeness we did discuss official support & testing in the past: #6535 |
OK, I think I've addressed your feedback, thanks! |
Looks like I need to move to the terraform-providers/terraform-provider-aws repo, so I've opened a new PR in hashicorp/terraform-provider-aws#17 Please confirm that's the right place to continue this discussion, and then feel free to close this. |
This is a port of hashicorp/terraform#12951 into the new repository. Partially fixes hashicorp/terraform#12704 in the case of hologram clients, but doesn't fix the regression when SkipRequestingAccountId is set.
* Fix handling of AdRoll's hologram clients This is a port of hashicorp/terraform#12951 into the new repository. Partially fixes hashicorp/terraform#12704 in the case of hologram clients, but doesn't fix the regression when SkipRequestingAccountId is set. * Refactoring of tests
* Fix handling of AdRoll's hologram clients This is a port of hashicorp/terraform#12951 into the new repository. Partially fixes hashicorp/terraform#12704 in the case of hologram clients, but doesn't fix the regression when SkipRequestingAccountId is set. * Refactoring of tests
I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues. If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further. |
This fixes #12704 by falling through on caller checks in case IAMInfo fails, and it refactors some of the testing code to support an added test for this condition in a cleaner fashion.
NOTE: This doesn't address potential regressions when SkipRequestingAccountId check is set, only when using something like AdRoll's Hologram.