Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ECS reporter throttled by AWS API #2050

Open
2opremio opened this issue Dec 1, 2016 · 23 comments
Open

ECS reporter throttled by AWS API #2050

2opremio opened this issue Dec 1, 2016 · 23 comments
Labels
accuracy Incorrect information is being shown to the user; usually a bug bug Broken end user or developer functionality; not working as the developers intended it ecs Pertains to integration with Amazon Elastic Container Service
Milestone

Comments

@2opremio
Copy link
Contributor

2opremio commented Dec 1, 2016

@2opremio 2opremio added the bug Broken end user or developer functionality; not working as the developers intended it label Dec 1, 2016
@2opremio 2opremio added this to the November2016 milestone Dec 1, 2016
@2opremio
Copy link
Contributor Author

2opremio commented Dec 1, 2016

This is the Sock Shop, run with the CloudFormation template (3x m4.xlarge instances) in Weave Cloud

@2opremio
Copy link
Contributor Author

2opremio commented Dec 1, 2016

It seems AWS is throttling us:

<probe> WARN: 2016/11/30 19:23:17.144169 Error listing ECS services, ECS service report may b
e incomplete: ThrottlingException: Rate exceeded
        status code: 400, request id: 7210cbfd-b732-11e6-a879-0b8af0abb45a
<probe> ERRO: 2016/11/30 19:23:17.154211 error applying tagger: ThrottlingException: Rate exc
eeded
        status code: 400, request id: 721252a0-b732-11e6-a879-0b8af0abb45a
<probe> WARN: 2016/11/30 19:23:18.549526 Error describing some ECS services, ECS service repo
rt may be incomplete: ThrottlingException: Rate exceeded

@2opremio
Copy link
Contributor Author

2opremio commented Dec 1, 2016

Also we should correct the printf string format of some warnings:

<probe> WARN: 2016/11/30 20:54:25.202746 Failed to describe ECS task %!s(*string=0xc42121bf10
), ECS service report may be incomplete: %!s(*string=0xc42121bf30)

@2opremio
Copy link
Contributor Author

2opremio commented Dec 1, 2016

First 1000 lines of the logs: http://sprunge.us/SNIb

@ekimekim
Copy link
Contributor

ekimekim commented Dec 1, 2016

my short-term thoughts on a long-term solution: we may need to get clever here with caching and careful use of immutable fields. For example, StartedBy for a task isn't going to change, which means we don't need to DescribeTasks every time - except this means any other metadata we may want to collect will also get stale. :S

@2opremio
Copy link
Contributor Author

2opremio commented Dec 1, 2016

I've worked around it for now by creating the cluster in a separate region (AWS rate limits per region)

@2opremio
Copy link
Contributor Author

2opremio commented Dec 3, 2016

A robust way to fix this should be using the ECS event stream https://aws.amazon.com/blogs/compute/monitor-cluster-state-with-amazon-ecs-event-stream/

However, I am not sure whether or how easily we can plug Scope in.

@ekimekim
Copy link
Contributor

ekimekim commented Dec 6, 2016

I think it's too much to ask to need the user to create CloudWatch rules and an SQS topic as part of setup - or even if we go that route later, it'd be nice if it was optional.
So for now I'm moving forward with caching. My thoughts so far:

  • We can largely treat Tasks as immutable. Tasks always progress in state PENDING -> RUNNING -> STOPPED (there's a few error-handling paths in there too), never backwards - ie. you can't start a stopped task, only a pending one - so fields like StartedBy and StartedAt can be considered immutable by the time we see them (when state = RUNNING).
  • Services are harder. They have many mutable fields, both ones that are important for maintaining the tasks->services map (deployment list) and ones that we just don't want to be stale for display (eg. running count). I've mapped out a solution where we still re-fetch all services that we report on (with at least one task of that service running on the local machine) each report, but we only need to re-scan other services when a new task appears that doesn't map to any known services.

Taken together, these improvements will cut down at least 50% of all queries, and likely more in most situations (since there's more tasks than services, and we won't be fetching services that aren't present on the machine).

@ekimekim
Copy link
Contributor

ekimekim commented Dec 6, 2016

We could cut down on requests further by allowing our data to be stale out to some refresh rate (say, 1 minute), but still doing a shortcut refresh if needed to find the correct task for a service. But I'd like to avoid stale data in the details panel if at all possible - even a single instance of that can undermine user confidence in its accuracy in all cases.

@2opremio 2opremio added the ecs Pertains to integration with Amazon Elastic Container Service label Dec 12, 2016
@2opremio 2opremio modified the milestones: December2016, EOY 2016, Next Dec 13, 2016
@pidster
Copy link
Contributor

pidster commented Jan 3, 2017

I think it's too much to ask to need the user to create CloudWatch rules and an SQS topic as part of setup - or even if we go that route later, it'd be nice if it was optional.

Would this be something the launch-generator to take care of? cc @lukemarsden @errordeveloper

@2opremio
Copy link
Contributor Author

2opremio commented Jan 3, 2017

Would this be something the launch-generator to take care of?

I don't see how, at least not in the way we are currently using the launch-generator. You cannot create CloudWatch rules from cluster resources (be it Kubernetes, ECS or whathaveyou)

@pidster
Copy link
Contributor

pidster commented Jan 3, 2017

The AWS Blox project purportedly provides a CFN template for doing this. The launch-generator could do the same, no? Or at least provide a fragment.

@2opremio
Copy link
Contributor Author

2opremio commented Jan 3, 2017

Sure, we could create the AWS resources through a CFN template but I don't think the launch-generator would be involved.

Scope could detect whether the ECS SQS queue is available at start through the presence of a parameter in /etc/weave.scope.config (e.g. the SQS credentials) and otherwise fallback to using the AWS API directly.

In order to propagate the SQS credetails to Scope, I guess we could:

  • When using the Weave AMI: add them to /etc/scope.conf through User Data (I am not sure this would be secure enough)
  • When using the CFN template: simply extend the template with the creation of the extra resources and a AWS::CloudFormation::Init to add the SQS credentials to /etc/scope.conf

@errordeveloper Does this make sense? If it does, let's create separate issues forit here and https://github.com/weaveworks/integrations (we still need a minimally performant solution when SQS is not available and I would like to use this issue for that).

@rade rade added the accuracy Incorrect information is being shown to the user; usually a bug label Jan 11, 2017
@2opremio
Copy link
Contributor Author

2opremio commented Feb 9, 2017

A user is experiencing this even after #2065 (Scope 1.2) in a 5-node cluster: https://weaveworks.slack.com/archives/weave-users/p1486634036001678 . Reopening.

@2opremio 2opremio reopened this Feb 9, 2017
@2opremio 2opremio modified the milestones: Next, EOY 2016 Feb 9, 2017
@pidster
Copy link
Contributor

pidster commented Feb 14, 2017

See also: prometheus/prometheus#2309

@pecigonzalo
Copy link

pecigonzalo commented Sep 1, 2017

@2opremio I really like the idea of using CW for keeping a state of the resources if an SQS parameter is provided, otherwise fallback to API+Cache.
Maybe something like:

  1. An initial scan to get state and clear the SQS queue (as EG in a reboot it could contain outdated events)
  2. Subscribe to the SQS queue
  3. Periodically query API directly and clean SQS, to ensure reconciliation.

Something things to keep in mind:

  • For this is that CW events can be duplicates and you can verify this by keeping a record of the last version you processed.
    On a similar project (using CW Events to automate other stuff) we do this by keeping a record with a 1m ttl of the last version for a task ARN (as you dont expect older or duplicates after 1m), this could very well use a similar implementation of what is used for current cache.

  • CW Events do not contain the full task description, although I think they contain enough information for Scope.

@bboreham
Copy link
Collaborator

Is every probe polling the API and getting the same information?
If so, could we configure just one probe to poll?

@errordeveloper
Copy link
Contributor

errordeveloper commented Jan 22, 2018

@bboreham I'd agree, however currently nothing stops you from running scope probes in different clusters, in which case how can we tell which probe is in which cluster?

it'd make a lot of sense to externalise Kubernetes and ECS code into plugins, and deploy just one of those per cluster.

@bboreham
Copy link
Collaborator

By “just one” I meant one per cluster.

@rade
Copy link
Member

rade commented Jan 22, 2018

could we configure just one probe to poll?

And if the node on which that runs gets removed, we'd have to enable the polling on another one. Not easy.

@errordeveloper
Copy link
Contributor

And if the node on which that runs gets removed, we'd have to enable the polling on another one. Not easy.

The only sensible thing I can image would be to enable these integration outside of the probe process, run them as containers and let orchestrator take care of where it would run. It's probably a little easier then doing some kind of election among probes, but a big change nevertheless, although it could help the plugin story.

@pidster
Copy link
Contributor

pidster commented Feb 19, 2018

And if the node on which that runs gets removed, we'd have to enable the polling on another one. Not easy.

Can Kubernetes run this as a deployment of 1? Or do the plugins need to be sidecars?

Obvs, not going to work for ECS...

@errordeveloper
Copy link
Contributor

Can Kubernetes run this as a deployment of 1? Or do the plugins need to be sidecars?

Yes it can, as long as there is also a probe pod on the same node (which should be the case under normal conditions).

Obvs, not going to work for ECS...

I think it could be made to work...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
accuracy Incorrect information is being shown to the user; usually a bug bug Broken end user or developer functionality; not working as the developers intended it ecs Pertains to integration with Amazon Elastic Container Service
Projects
None yet
Development

No branches or pull requests

7 participants