ECS reporter throttled by AWS API #2050

2opremio · 2016-12-01T16:06:48Z

2opremio · 2016-12-01T16:58:38Z

This is the Sock Shop, run with the CloudFormation template (3x m4.xlarge instances) in Weave Cloud

2opremio · 2016-12-01T17:04:54Z

It seems AWS is throttling us:

<probe> WARN: 2016/11/30 19:23:17.144169 Error listing ECS services, ECS service report may b
e incomplete: ThrottlingException: Rate exceeded
        status code: 400, request id: 7210cbfd-b732-11e6-a879-0b8af0abb45a
<probe> ERRO: 2016/11/30 19:23:17.154211 error applying tagger: ThrottlingException: Rate exc
eeded
        status code: 400, request id: 721252a0-b732-11e6-a879-0b8af0abb45a
<probe> WARN: 2016/11/30 19:23:18.549526 Error describing some ECS services, ECS service repo
rt may be incomplete: ThrottlingException: Rate exceeded

2opremio · 2016-12-01T17:06:21Z

Also we should correct the printf string format of some warnings:

<probe> WARN: 2016/11/30 20:54:25.202746 Failed to describe ECS task %!s(*string=0xc42121bf10
), ECS service report may be incomplete: %!s(*string=0xc42121bf30)

2opremio · 2016-12-01T17:18:53Z

First 1000 lines of the logs: http://sprunge.us/SNIb

ekimekim · 2016-12-01T17:22:05Z

my short-term thoughts on a long-term solution: we may need to get clever here with caching and careful use of immutable fields. For example, StartedBy for a task isn't going to change, which means we don't need to DescribeTasks every time - except this means any other metadata we may want to collect will also get stale. :S

2opremio · 2016-12-01T17:38:40Z

I've worked around it for now by creating the cluster in a separate region (AWS rate limits per region)

2opremio · 2016-12-03T22:20:57Z

A robust way to fix this should be using the ECS event stream https://aws.amazon.com/blogs/compute/monitor-cluster-state-with-amazon-ecs-event-stream/

However, I am not sure whether or how easily we can plug Scope in.

ekimekim · 2016-12-06T03:13:09Z

I think it's too much to ask to need the user to create CloudWatch rules and an SQS topic as part of setup - or even if we go that route later, it'd be nice if it was optional.
So for now I'm moving forward with caching. My thoughts so far:

We can largely treat Tasks as immutable. Tasks always progress in state PENDING -> RUNNING -> STOPPED (there's a few error-handling paths in there too), never backwards - ie. you can't start a stopped task, only a pending one - so fields like StartedBy and StartedAt can be considered immutable by the time we see them (when state = RUNNING).
Services are harder. They have many mutable fields, both ones that are important for maintaining the tasks->services map (deployment list) and ones that we just don't want to be stale for display (eg. running count). I've mapped out a solution where we still re-fetch all services that we report on (with at least one task of that service running on the local machine) each report, but we only need to re-scan other services when a new task appears that doesn't map to any known services.

Taken together, these improvements will cut down at least 50% of all queries, and likely more in most situations (since there's more tasks than services, and we won't be fetching services that aren't present on the machine).

ekimekim · 2016-12-06T03:15:07Z

We could cut down on requests further by allowing our data to be stale out to some refresh rate (say, 1 minute), but still doing a shortcut refresh if needed to find the correct task for a service. But I'd like to avoid stale data in the details panel if at all possible - even a single instance of that can undermine user confidence in its accuracy in all cases.

pidster · 2017-01-03T15:11:54Z

I think it's too much to ask to need the user to create CloudWatch rules and an SQS topic as part of setup - or even if we go that route later, it'd be nice if it was optional.

Would this be something the launch-generator to take care of? cc @lukemarsden @errordeveloper

2opremio · 2017-01-03T15:32:19Z

Would this be something the launch-generator to take care of?

I don't see how, at least not in the way we are currently using the launch-generator. You cannot create CloudWatch rules from cluster resources (be it Kubernetes, ECS or whathaveyou)

pidster · 2017-01-03T15:49:53Z

The AWS Blox project purportedly provides a CFN template for doing this. The launch-generator could do the same, no? Or at least provide a fragment.

2opremio · 2017-01-03T17:17:38Z

Sure, we could create the AWS resources through a CFN template but I don't think the launch-generator would be involved.

Scope could detect whether the ECS SQS queue is available at start through the presence of a parameter in /etc/weave.scope.config (e.g. the SQS credentials) and otherwise fallback to using the AWS API directly.

In order to propagate the SQS credetails to Scope, I guess we could:

When using the Weave AMI: add them to /etc/scope.conf through User Data (I am not sure this would be secure enough)
When using the CFN template: simply extend the template with the creation of the extra resources and a AWS::CloudFormation::Init to add the SQS credentials to /etc/scope.conf

@errordeveloper Does this make sense? If it does, let's create separate issues forit here and https://github.com/weaveworks/integrations (we still need a minimally performant solution when SQS is not available and I would like to use this issue for that).

2opremio · 2017-02-09T11:04:08Z

A user is experiencing this even after #2065 (Scope 1.2) in a 5-node cluster: https://weaveworks.slack.com/archives/weave-users/p1486634036001678 . Reopening.

pidster · 2017-02-14T11:39:58Z

See also: prometheus/prometheus#2309

pecigonzalo · 2017-09-01T21:18:30Z

@2opremio I really like the idea of using CW for keeping a state of the resources if an SQS parameter is provided, otherwise fallback to API+Cache.
Maybe something like:

An initial scan to get state and clear the SQS queue (as EG in a reboot it could contain outdated events)
Subscribe to the SQS queue
Periodically query API directly and clean SQS, to ensure reconciliation.

Something things to keep in mind:

For this is that CW events can be duplicates and you can verify this by keeping a record of the last version you processed.
On a similar project (using CW Events to automate other stuff) we do this by keeping a record with a 1m ttl of the last version for a task ARN (as you dont expect older or duplicates after 1m), this could very well use a similar implementation of what is used for current cache.
CW Events do not contain the full task description, although I think they contain enough information for Scope.

bboreham · 2018-01-16T18:28:31Z

Is every probe polling the API and getting the same information?
If so, could we configure just one probe to poll?

errordeveloper · 2018-01-22T18:56:38Z

@bboreham I'd agree, however currently nothing stops you from running scope probes in different clusters, in which case how can we tell which probe is in which cluster?

it'd make a lot of sense to externalise Kubernetes and ECS code into plugins, and deploy just one of those per cluster.

bboreham · 2018-01-22T19:02:13Z

By “just one” I meant one per cluster.

rade · 2018-01-22T19:39:26Z

could we configure just one probe to poll?

And if the node on which that runs gets removed, we'd have to enable the polling on another one. Not easy.

errordeveloper · 2018-02-15T14:54:49Z

And if the node on which that runs gets removed, we'd have to enable the polling on another one. Not easy.

The only sensible thing I can image would be to enable these integration outside of the probe process, run them as containers and let orchestrator take care of where it would run. It's probably a little easier then doing some kind of election among probes, but a big change nevertheless, although it could help the plugin story.

pidster · 2018-02-19T10:30:00Z

And if the node on which that runs gets removed, we'd have to enable the polling on another one. Not easy.

Can Kubernetes run this as a deployment of 1? Or do the plugins need to be sidecars?

Obvs, not going to work for ECS...

errordeveloper · 2018-02-19T22:09:03Z

Can Kubernetes run this as a deployment of 1? Or do the plugins need to be sidecars?

Yes it can, as long as there is also a probe pod on the same node (which should be the case under normal conditions).

Obvs, not going to work for ECS...

I think it could be made to work...

2opremio added the bug Broken end user or developer functionality; not working as the developers intended it label Dec 1, 2016

2opremio added this to the November2016 milestone Dec 1, 2016

2opremio assigned ekimekim Dec 1, 2016

2opremio modified the milestones: December2016, Backlog Dec 5, 2016

ekimekim mentioned this issue Dec 5, 2016

ecs reporter: Fix some log lines that were passing *string instead of string #2060

Merged

ekimekim mentioned this issue Dec 7, 2016

ECS reporter: Minimize API calls by caching task and service data #2065

Merged

2opremio added the ecs Pertains to integration with Amazon Elastic Container Service label Dec 12, 2016

2opremio modified the milestones: December2016, EOY 2016, Next Dec 13, 2016

rade added the accuracy Incorrect information is being shown to the user; usually a bug label Jan 11, 2017

2opremio reopened this Feb 9, 2017

2opremio modified the milestones: Next, EOY 2016 Feb 9, 2017

2opremio mentioned this issue Feb 14, 2017

AWS ECS discovery prometheus/prometheus#2309

Closed

2opremio modified the milestones: Next, 1.3 Feb 16, 2017

rade modified the milestones: 1.3, Next, Backlog Jul 12, 2017

pecigonzalo mentioned this issue Jul 20, 2017

ECS Probe - SIGSEGV: segmentation violation #2508

Closed

rade mentioned this issue Sep 1, 2017

ECS Probe - Rate Limit #2844

Open

da3mon-01 mentioned this issue Nov 3, 2017

ECS Rate limiting teralytics/prometheus-ecs-discovery#2

Closed

bboreham unassigned ekimekim Jan 16, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ECS reporter throttled by AWS API #2050

ECS reporter throttled by AWS API #2050

2opremio commented Dec 1, 2016

2opremio commented Dec 1, 2016 •

edited

Loading

2opremio commented Dec 1, 2016 •

edited

Loading

2opremio commented Dec 1, 2016

2opremio commented Dec 1, 2016

ekimekim commented Dec 1, 2016

2opremio commented Dec 1, 2016

2opremio commented Dec 3, 2016 •

edited

Loading

ekimekim commented Dec 6, 2016

ekimekim commented Dec 6, 2016

pidster commented Jan 3, 2017

2opremio commented Jan 3, 2017

pidster commented Jan 3, 2017 •

edited

Loading

2opremio commented Jan 3, 2017

2opremio commented Feb 9, 2017 •

edited

Loading

pidster commented Feb 14, 2017

pecigonzalo commented Sep 1, 2017 •

edited

Loading

bboreham commented Jan 16, 2018

errordeveloper commented Jan 22, 2018 •

edited

Loading

bboreham commented Jan 22, 2018

rade commented Jan 22, 2018

errordeveloper commented Feb 15, 2018

pidster commented Feb 19, 2018

errordeveloper commented Feb 19, 2018

ECS reporter throttled by AWS API #2050

ECS reporter throttled by AWS API #2050

Comments

2opremio commented Dec 1, 2016

2opremio commented Dec 1, 2016 • edited Loading

2opremio commented Dec 1, 2016 • edited Loading

2opremio commented Dec 1, 2016

2opremio commented Dec 1, 2016

ekimekim commented Dec 1, 2016

2opremio commented Dec 1, 2016

2opremio commented Dec 3, 2016 • edited Loading

ekimekim commented Dec 6, 2016

ekimekim commented Dec 6, 2016

pidster commented Jan 3, 2017

2opremio commented Jan 3, 2017

pidster commented Jan 3, 2017 • edited Loading

2opremio commented Jan 3, 2017

2opremio commented Feb 9, 2017 • edited Loading

pidster commented Feb 14, 2017

pecigonzalo commented Sep 1, 2017 • edited Loading

bboreham commented Jan 16, 2018

errordeveloper commented Jan 22, 2018 • edited Loading

bboreham commented Jan 22, 2018

rade commented Jan 22, 2018

errordeveloper commented Feb 15, 2018

pidster commented Feb 19, 2018

errordeveloper commented Feb 19, 2018

2opremio commented Dec 1, 2016 •

edited

Loading

2opremio commented Dec 1, 2016 •

edited

Loading

2opremio commented Dec 3, 2016 •

edited

Loading

pidster commented Jan 3, 2017 •

edited

Loading

2opremio commented Feb 9, 2017 •

edited

Loading

pecigonzalo commented Sep 1, 2017 •

edited

Loading

errordeveloper commented Jan 22, 2018 •

edited

Loading