Ingester flush queue is backing up again #254

tomwilkie · 2017-01-31T11:57:39Z

Things we should do to help:

add table name as dimension to the the dynamodb errors metrics (Limit number of retires when we get ProvisionedThroughputExceededException #256)
limit number of retries for chunk flush on provisioned throughput errors (Limit number of retires when we get ProvisionedThroughputExceededException #256)
add alert when avg number of chunks/series > 1.5 (https://github.com/weaveworks/service-conf/pull/564)
make the table manage provision current table with one read capacity, old tables with different. (Have table manager accept flags for read/write throughput on inactive tables. #257)
add a histogram of write rate per hash (Track distribution of row write rates with new HashBucketHistogram. #261)
add flush rate graph somewhere (https://github.com/weaveworks/monitoring/pull/144)
move label name to hash key (Make the schemas declarative, add a new schema for better load balancing. #262)
stop reading 2x due to base64 encoding (Make the schemas declarative, add a new schema for better load balancing. #262)
shard flush queues in ingesters by hash key (Shard ingest flush queue by username and metric name #271)

Noticed when deploying to prod. Slack logs:

[11:14 AM]  
tom hmm cortex in prod seems unhealthy: https://cloud.weave.works/admin/grafana/dashboard/file/cortex-chunks.json?panelId=6&fullscreen&from=1485746034561&to=1485861234561
huge flush queue

[11:14 AM]  
tom 6 chunks per series
massive backlog
we should have an alert for this...
in the meantime, I’m going to up the dynamodb capacity to see if that helps

[11:15 AM]  
jml do we have any way of measuring for hotspotting?

[11:16 AM]  
tom not really
turned off table manager for now
Okay our graphs aren’t wrong:
From amazon

[11:19 AM]  
tom no where near provisioned capacity
upped by 3x to 15k
this is the thing to reread: http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/GuidelinesForTables.html#GuidelinesForTables.Partitions
> A single partition can support a maximum of 3,000 read capacity units or 1,000 write capacity units
since we’re seeing about 1k writes, I guess thats it
> A single partition can hold approximately 10 GB of data
a week for us is about 50GB
so we’re at about 5 shards (5k write throughout, 50GB data)
and we’re not getting balance....
ridiculous
so this flush is clearly going to fail
that doesn’t mean we’ll loose data
as its replicated, and only one ingester is flushing
the deployment will stop after one ingester fails
to give us some time to figure this out
this has been going on since friday 27th
interesting

[11:33 AM]  
tom effect of daily buckets
at midnight, a bunch of chunks have to be written to both buckets
so on thursday at midnight we moved to a new table
and ever since then, we’ve been failing to flush
actually our flush rate seems to have been okay, this does seem to just be load from users
right I think I might have a hypothesis
I think they’re stuck flushing to an old table
that we’ve reduced to 1 qps

The text was updated successfully, but these errors were encountered:

tomwilkie · 2017-01-31T12:28:35Z

Running theory is/was that the ingesters were “stuck” flushing a chunk to last weeks table, however this graph seems to disprove that:

tomwilkie · 2017-01-31T12:29:04Z

Also worth noting only two ingesters seem "stuck" in prod (but all are in dev):

tomwilkie · 2017-01-31T13:38:06Z

Those two ingesters have very high s3 latency:

tomwilkie · 2017-01-31T15:41:54Z

I think s3 latency was a red herring - those instances were doing more writes.

We're also seeing this on dev. Took a stack dump (https://gist.github.com/tomwilkie/a1159d8974d231965fd04a6c26d0105b) and all the flush goroutines were in backoff sleeps.

Upped read throughput on dev to match write throughput, and progress started to be made:

Have upped read throughput on prod to match.

tomwilkie · 2017-02-01T20:00:16Z

Looks like this might be self-inflicted - flux has some very high cardinality metrics (see fluxcd/flux#417)

jml · 2017-02-02T09:55:10Z

Still happening even though we've disabled flux scraping.

tomwilkie · 2017-02-02T10:28:48Z

The error was ResourceNotFoundException, which indicated the table we're writing too didn't exist. This is because I had turned down the table manage so I could tweak the tables manually. I have turned it back up, and it is flushing again.

tomwilkie · 2017-02-08T19:24:25Z

This is resolve in dev, will deploy to prod tomorrow:

tomwilkie · 2017-02-09T10:35:53Z

This is in prod now.

tomwilkie mentioned this issue Jan 31, 2017

Limit number of retires when we get ProvisionedThroughputExceededException #256

Merged

tomwilkie mentioned this issue Jan 31, 2017

Have table manager accept flags for read/write throughput on inactive tables. #257

Merged

jml assigned tomwilkie Feb 2, 2017

tomwilkie closed this as completed Feb 9, 2017

bboreham mentioned this issue Mar 4, 2018

Hotspotting on keys #733

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingester flush queue is backing up again #254

Ingester flush queue is backing up again #254

tomwilkie commented Jan 31, 2017 •

edited

Loading

tomwilkie commented Jan 31, 2017

tomwilkie commented Jan 31, 2017

tomwilkie commented Jan 31, 2017

tomwilkie commented Jan 31, 2017

tomwilkie commented Feb 1, 2017

jml commented Feb 2, 2017

tomwilkie commented Feb 2, 2017

tomwilkie commented Feb 8, 2017

tomwilkie commented Feb 9, 2017

Ingester flush queue is backing up again #254

Ingester flush queue is backing up again #254

Comments

tomwilkie commented Jan 31, 2017 • edited Loading

tomwilkie commented Jan 31, 2017

tomwilkie commented Jan 31, 2017

tomwilkie commented Jan 31, 2017

tomwilkie commented Jan 31, 2017

tomwilkie commented Feb 1, 2017

jml commented Feb 2, 2017

tomwilkie commented Feb 2, 2017

tomwilkie commented Feb 8, 2017

tomwilkie commented Feb 9, 2017

tomwilkie commented Jan 31, 2017 •

edited

Loading