Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hotspotting on keys #733

Closed
bboreham opened this issue Mar 4, 2018 · 2 comments
Closed

Hotspotting on keys #733

bboreham opened this issue Mar 4, 2018 · 2 comments
Labels
postmortem An issue arising out of a serious production issue

Comments

@bboreham
Copy link
Contributor

bboreham commented Mar 4, 2018

Similar to #254, we often see periods where the achieved throughput is much lower than provisioned capacity on DynamoDB. This issue is a bit of an umbrella / brain-dump.

We could use some better tools to investigate this. E.g. some logging of keys that suffer multiple retries. The histogram for retries tells me 99.9% are 0, which is nice but not very helpful.

I wonder if the "series index" added in #442 is causing trouble - the hash (partition) key is the same for every chunk for a particular user (instance). [EDIT] This index is only used to iterate through timeseries for queries that don't have a metric name. It's unusably slow.

Maybe add some more diversity to the hash key, e.g. add a hex digit derived from the sha, then you have to do 16 reads instead of 1 to scan the whole row, but those 16 will go much faster.

It looks like writes from ingester_flush.go to the chunk store do exponential back-off up to the timeout (1 minute), then error out and go back onto the flush queue, whereupon we start the exponential backoff again at 100ms. And when we start again we re-write all the keys even though just one was outstanding. So it would be better to keep trying for longer.

Related: #724

@bboreham
Copy link
Contributor Author

bboreham commented Mar 9, 2018

Update: we have moved our biggest environment back from v8 schema to v6, removing the most-contended keys. Throughput is much better.

The next most-contended (revealed by #734) keys include things like
2:d17599:kube_replicaset_status_fully_labeled_replicas
3:d17599:container_cpu_usage_seconds_total:image

@bboreham
Copy link
Contributor Author

bboreham commented Jun 4, 2018

Having addressed the worst problem, the remainder of this issue ("better tools") was addressed by #755, #761 and #772

@bboreham bboreham closed this as completed Jun 4, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
postmortem An issue arising out of a serious production issue
Projects
None yet
Development

No branches or pull requests

1 participant