Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fails to renew STS token with "Credential expiration ... is less than10 minutes in the future. Disabling auto-refresh" #138

Closed
alexmbird opened this issue Feb 1, 2021 · 13 comments
Assignees
Labels
bug Something isn't working

Comments

@alexmbird
Copy link

I'm trying to get FluentBit up and running for an EKS cluster, with the intention of replacing a creaking Fluentd setup. Presently I'm running public.ecr.aws/aws-observability/aws-for-fluent-bit:2.10.1, configured with [OUTPUT] sections for five CloudWatch log groups. Each looks like this:

[OUTPUT]
    Name                cloudwatch_logs
    Match               dataplane.*
    region              ${AWS_REGION}
    log_group_name      /aws/eks/mycluster-v1/dataplane
    log_stream_prefix   ${HOSTNAME}-
    extra_user_agent    container-insights

The rest of the config is based on Amazon's sample and the application itself is installed with the fluent-bit chart from https://fluent.github.io/helm-charts. For access to CloudWatch I'm using Kiam, which has been running well with other applications on the cluster (including fluentd, writing to the exact same log groups) for a year.

FluentBit is struggling to use the STS token correctly, however. It starts happily enough but after running for a few minutes it emits this error:

[2021/02/01 12:10:56] [ info] [output:cloudwatch_logs:cloudwatch_logs.1] Sent 1 events to CloudWatch
[2021/02/01 12:10:56] [ info] [output:cloudwatch_logs:cloudwatch_logs.4] Sent 9 events to CloudWatch
...
[2021/02/01 12:10:56] [ warn] [aws_credentials] Credential expiration '2021-02-01T12:17:56Z' is less than10 minutes in the future. Disabling auto-refresh.
[2021/02/01 12:10:56] [ warn] [aws_credentials] 'Expiration' was invalid or could not be parsed. Disabling auto-refresh of credentials.

Another few minutes pass, then the credential expires and the output is a stream of errors like:

[2021/02/01 13:46:19] [error] [output:cloudwatch_logs:cloudwatch_logs.5] Failed to send events
[2021/02/01 13:46:19] [ warn] [engine] chunk '1-1612187172.216188278.flb' cannot be retried: task_id=6, input=tail.6 > output=cloudwatch_logs.5
[2021/02/01 13:46:20] [error] [output:cloudwatch_logs:cloudwatch_logs.5] PutLogEvents API responded with error='ExpiredTokenException', message='The security token included in the request is expired'

With an expiring credential, disabling auto-refresh is the absolute last thing I want it to do :)

I've tried modifying my Kiam config so the session-duration is 30m rather than the default 15m, but all that happens is that FluentBit takes a while longer before emitting the error.

Hence a question - do I have it misconfigured, or have I run into a bug with FluentBit's handling of session token renewal?

@PettitWesley PettitWesley self-assigned this Feb 1, 2021
@PettitWesley
Copy link
Contributor

[2021/02/01 13:46:20] [error] [output:cloudwatch_logs:cloudwatch_logs.5] PutLogEvents API responded with error='ExpiredTokenException', message='The security token included in the request is expired'

After this error, does it retry and succeed?

The code is supposed to "auto-refresh" when it knows the credentials will expire in a few minutes. Unless the expiration can not be parsed or is only a few minutes in the future, in which case it just waits till the creds have actually expired.

But once a real auth error occurs, it should refresh the creds.

@alexmbird
Copy link
Author

After this error, does it retry and succeed?

No, it continues outputting errors to say it failed to send events. Here's the latest output from one of my fluentbit pods, which has been in the state for a couple of hours:

[2021/02/01 14:50:23] [error] [output:cloudwatch_logs:cloudwatch_logs.4] Failed to send events
[2021/02/01 14:50:23] [ warn] [engine] chunk '1-1612191017.214682030.flb' cannot be retried: task_id=8, input=tail.7 > output=cloudwatch_logs.5
[2021/02/01 14:50:23] [ warn] [engine] chunk '1-1612191007.352821208.flb' cannot be retried: task_id=3, input=tail.6 > output=cloudwatch_logs.4
[2021/02/01 14:50:24] [error] [output:cloudwatch_logs:cloudwatch_logs.4] PutLogEvents API responded with error='ExpiredTokenException', message='The security token included in the request is expired'

@PettitWesley
Copy link
Contributor

I see. Okay, I'll put up a pull request to fix the refresh behavior. I remember when I wrote this code I was a little uncertain of the logic.

I guess there's no real reason why it needs to "disable auto-refresh".

I'll check the logic for refreshing after errors too- that seems like a bug. Once it hits an auth error it should try refresh credentials.

@PettitWesley PettitWesley added the bug Something isn't working label Feb 1, 2021
@PettitWesley
Copy link
Contributor

@alexmbird while you wait for my fix, you can switch from the cloudwatch_logs plugin to the older cloudwatch plugin (which is slower, has higher memory usage, and lower max throughput, but uses the AWS SDK Go instead of Wesley's unofficial AWS SDK for C, and thus won't have this bug).

@alexmbird
Copy link
Author

Many thanks for responding so promptly - it's appreciated.

As a stopgap measure I've followed your suggestion of switching from cloudwatch_logs to cloudwatch. Other configuration remains the same. I've restarted FluentBut and it's happily transferring logs to CloudWatch. I'll keep an eye on the system to see if it's happy renewing tokens also.

@PettitWesley
Copy link
Contributor

@alexmbird I put up a PR to the upstream repo to fix this (linked above). I also built a container image with the code: 094266400487.dkr.ecr.us-west-2.amazonaws.com/refresh-issue:latest

That image can be pulled from any aws account (its repo policy trusts *). I was wondering if you could test out the code to verify that it fixes the issue?

The code built in that image is from the latest master commit, which is unreleased as of now. It works as far as I can tell. You might see some warning messages in the fluent bit logs about socket status, ignore those.

@alexmbird
Copy link
Author

@PettitWesley thanks for this. I've updated my development cluster to run that image.

I'm afraid all it does at present is go into an infinite loop printing:

[2021/02/09 15:17:23] [error] [socket] could not validate socket status for #158
[2021/02/09 15:17:23] [error] [socket] could not validate socket status for #158
[2021/02/09 15:17:23] [error] [socket] could not validate socket status for #158
...

@PettitWesley
Copy link
Contributor

@alexmbird I shouldn't have tried to build off of master, it's too unstable right now. Please re-pull that image; I just updated it with a build based on 1.6 which should be stable.

@alexmbird
Copy link
Author

No worries. I've deployed the new image - it starts correctly and is transmitting events to cloudwatch. It's the end of the working day here so I'll leave it overnight to see if it correctly renews the token.

@alexmbird
Copy link
Author

Good morning! The new build has been running happily all night on my dev cluster. I don't see any messages in the (info-level) logs about renewing the tokens but perhaps that's expected.

One tiny (& somewhat offtopic) request: could the regular "Sent 8 events to CloudWatch" log messages be debug rather than info level? I ask because with five separate outputs they get spammy and I think that's the behaviour the old cloudwatch plugin followed.

@alexmbird
Copy link
Author

Hey there. The special image you baked is still running well on my test cluster. Do you happen to have an ETA before the fix will hit an official release?

@PettitWesley
Copy link
Contributor

@alexmbird If you're comfortable using the upstream image distro, then fluent/fluent-bit:1.7.1 has this change.

We'll do a release of AWS for Fluent Bit probably next week once 1.7.2 comes out.

@alexmbird
Copy link
Author

I've seen the AWS for Fluent Bit 1.7.2 release and updated our clusters to that. I can confirm that the STS renewal bug is now fixed.

We have discovered another problem, but it probably isn't related so I'll open a new ticket.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants