Auto-scale DynamoDB provision based on Prometheus metrics #841

bboreham · 2018-06-07T15:48:26Z

Fixes #735

To use this feature, set -dynamodb....scale.enabled on in command-line args but do not supply an AWS autoscaler -applicationautoscaling.url.
Also set -metrics.url to point at a Prometheus API-compatible server which will serve the metrics.

(at the time of writing this can't be a multi-tenant server because I didn't put in an option for the tenant)

PR also includes much refactoring of existing tests because I couldn't stand so much repetition.

Outstanding things I should do:

allow the parameters like queueLengthAcceptable to be changed
use the min/max capacity if configured
implement out-cooldown and in-cooldown
scale last week's table to zero even though this week's may be queuing
protect against scaling-down when the ingesters hiccup (e.g. all crash at once)
protect against scaling-down when Prometheus restarts and has no data

For extra credit:

expose some metrics - e.g. requested throughput
more aggressive scale-up when the queue is really big
look at the recent rate of usage when scaling down instead of a percentage
automatically increase capacity when an ingester rollout is under way (or fix Mysterious flush of underutilised chunks 1hr after ingester rollout #467)
support use of Cortex to supply the metrics

bboreham · 2018-06-14T09:05:13Z

At the weekly table changeover (which happens Wednesday night for me) it initially does a scale-down of the new table, because the error rate is zero.

Having thought about it a bit, I think this is best addressed by setting the minimum write capacity to a level which is enough to handle the expected traffic at midnight. I like the idea that the core algorithm should operate without knowing anything specific about the individual tables.

bboreham · 2018-06-21T10:06:43Z

Scaling up by a fixed fraction of current provision isn't very good.
At the low end it doesn't give you very much, while at the high end it gets expensive.
I find it hard to come up with a formula based on queue growth - it doesn't give information about which tables need most.

As a simple improvement I might floor the increment to a fraction of the max, e.g. if max is 10K then min increment is 1K.

tomwilkie · 2018-06-21T11:36:54Z

pkg/chunk/aws/dynamodb_table_client.go

+func chunkTagsToDynamoDB(ts chunk.Tags) []*dynamodb.Tag {
+	if ts == nil {
+		return nil
+	}


Is this redundant? Looks like result will be nil if len(ts) == 0 (or ts is nill)

yes; it's cut/pasted from what was there before

tomwilkie · 2018-06-21T11:40:52Z

pkg/chunk/aws/dynamodb_table_client.go

-			err = d.disableAutoScaling(ctx, expected)
-		} else if current.WriteScale != expected.WriteScale {
-			err = d.enableAutoScaling(ctx, expected)
+	if expected.WriteScale.Enabled && d.metrics != nil {


Perhaps the check expected.WriteScale.Enabled && d.metrics != nil should be put in the constructor so we can fail early? Then we can just check expected.WriteScale.Enabled here.

Actually, in what situation is metrics nil?

metrics is nil when you want AWS auto-scaling

I don't see how - must be missing something. Would you mind explaining a little further for me?

plus or minus a bug...

I fixed the bug.

tomwilkie · 2018-06-21T11:42:07Z

pkg/chunk/aws/dynamodb_table_client.go

@@ -319,7 +353,11 @@ func (d dynamoTableClient) UpdateTable(ctx context.Context, current, expected ch
 				return err
 			})
 		}); err != nil {
-			return err
+			if awsErr, ok := err.(awserr.Error); ok && awsErr.Code() == "LimitExceededException" {
+				level.Warn(util.Logger).Log("msg", "update limit exceeded", "err", err)


Perhaps export a counter for this too?

tomwilkie · 2018-06-21T11:54:41Z

pkg/chunk/aws/metrics_autoscaling.go

+		if err != nil {
+			return nil, err
+		}
+		promAPI = promV1.NewAPI(client)


If I read this right, this will return a non-functioning but non-nil metricsData if the addr is empty? That seems odd to me; I'd expect either an error or a fully-functioningmetricsData - perhaps this logic should be pushed up to the called?

tomwilkie · 2018-06-21T11:58:55Z

pkg/chunk/aws/metrics_autoscaling.go

+}
+
+func extractRates(matrix model.Matrix) (map[string]float64, error) {
+	ret := make(map[string]float64)


IIRC throughout the codebase we tend to prefer map[string]float64, and only use make when we know the size.

tomwilkie · 2018-06-21T12:01:41Z

pkg/chunk/aws/metrics_autoscaling.go

+	}
+
+	// fetch write error rate per DynamoDB table
+	deMatrix, err := promQuery(ctx, m.promAPI, `sum(rate(cortex_dynamo_failures_total{error="ProvisionedThroughputExceededException",operation=~".*Write.*"}[1m])) by (table) > 0`, 0, time.Second)


I think we should be be able to inject extra selectors into these queries, for instance we run multiple cortex cluster monitored by the same prometheus.

I have made the whole promql expression configurable, on the basis that other things like rate periods might need tweaking too.

tomwilkie · 2018-06-21T12:16:56Z

pkg/chunk/cassandra/table_client.go

 	return chunk.TableDesc{
 		Name: name,
-	}, dynamodb.TableStatusActive, nil
+	}, true, nil


tomwilkie · 2018-06-21T12:17:50Z

pkg/chunk/schema_config.go

-	f.Int64Var(&cfg.OutCooldown, argPrefix+".out-cooldown", 3000, "DynamoDB minimum time between each autoscaling event that increases provision capacity.")
-	f.Int64Var(&cfg.InCooldown, argPrefix+".in-cooldown", 3000, "DynamoDB minimum time between each autoscaling event that decreases provision capacity.")
+	f.Int64Var(&cfg.OutCooldown, argPrefix+".out-cooldown", 1800, "DynamoDB minimum seconds between each autoscale up.")
+	f.Int64Var(&cfg.InCooldown, argPrefix+".in-cooldown", 1800, "DynamoDB minimum seconds between each autoscale down.")


All usages of this are as a time.Duration, so perhaps we should make this a DurationVar?

Good idea, but not backwards-compatible. I guess we could add two new flags and deprecate the old.

tomwilkie · 2018-06-21T12:18:21Z

pkg/chunk/table_manager.go

-		})
-	}
-	return result
-}


tomwilkie · 2018-06-21T12:20:32Z

pkg/chunk/aws/metrics_autoscaling.go

+	queueLengths      []float64
+	errorRates        map[string]float64
+	usageRates        map[string]float64
+}


There is quite a lot of coupling between metricsData and dynamoTableClient; I wonder if more work needs to be done to tease out a clean interface?

OK, I think much of this was simply the metrics methods were on the wrong struct, but I went ahead and created an abstract interface so the choice between old-style and new-style is in the constructor. Let me know what you think.

Looks good now!

tomwilkie · 2018-06-21T12:41:24Z

A few minor nits, main concern is about coupling between the metricsData and tableClient. I wonder if a better interface could be teased out?

tomwilkie · 2018-06-21T19:28:58Z

pkg/chunk/aws/metrics_autoscaling.go

+	}
+}
+
+func newMetrics(cfg DynamoDBConfig) (*metricsData, error) {


This probably belongs at the top of the file now?

tomwilkie · 2018-06-21T19:31:47Z

pkg/chunk/aws/dynamodb_table_client.go

-}, []string{"operation", "status_code"})
+// Pluggable auto-scaler implementation
+type autoscale interface {
+	CreateTable(ctx context.Context, desc chunk.TableDesc) error


This is really PostCreateTable, right?

yeah, I was wrestling with how similar it is to the higher-level interface

bboreham · 2018-07-27T13:13:38Z

Scaling up by a fixed fraction of current provision isn't very good.
At the low end it doesn't give you very much

If the current provision is 1, it's very bad.

and bump the failure counter too

so we can supply multiple sets of results

Don't want to scale down if all our metrics are returning zero, or all the ingesters have crashed, etc

50 minutes seems weird. Currently using 30 minutes in production.

Move old-style AWS autoscaling to a separate file; make metrics autoscaling implement the common interface.

If the current capacity is very low, adding 20% doesn't get us very far, so add minimum 10% of max capacity. Refactored to extract 'compute' functions for readability.

tomwilkie · 2018-08-03T12:20:02Z

LGTM!

bboreham force-pushed the dynamic-dynamodb branch from b7990d1 to 8497e4f Compare June 8, 2018 13:59

tomwilkie reviewed Jun 21, 2018

View reviewed changes

pkg/chunk/cassandra/table_client.go

return chunk.TableDesc{

Name: name,

}, dynamodb.TableStatusActive, nil

}, true, nil

Copy link

Contributor

tomwilkie Jun 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

tomwilkie reviewed Jun 21, 2018

View reviewed changes

pkg/chunk/table_manager.go

})

}

return result

}

Copy link

Contributor

tomwilkie Jun 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

tomwilkie reviewed Jun 21, 2018

View reviewed changes

bboreham added 13 commits July 28, 2018 12:07

Refactor: Move AWS-specific function into aws directory

7dc58c4

Refactor: push DynamoDB-specific check into aws directory

9edb50f

Refactor: pull DynamoDB expected table results out to functions

d39ca13

Refactor: pull DynamoDB fixture config out to functions

f61ec92

Refactor: Simplify test fixtures

f945f12

Extract test function

907623d

Auto-scale DynamoDB provision based on Prometheus metrics

ee97484

Vendor required prometheus libraries

8005779

Improve logging of DynamoDB settings updates

2d3c342

Respect max and min capacity

6985ed2

Warn but don't fail on limit-exceeded error

672cfa8

and bump the failure counter too

Allow scaling parameters to be changed from command-line

c6b7cc0

Make test trigger max-capacity check

b2489e6

bboreham added 18 commits July 28, 2018 13:13

Respect cooldown periods in metrics autoscaling

eb457a4

Refactor: extract function to extract per-table rates from matrix

0c85d76

Change promQL test fixture to take slices

325f9fd

so we can supply multiple sets of results

don't hard-code name of table in test

c9b1712

Base scale-down on actual capacity usage

8c48093

Scale back on zero errors

fe774f0

Reject small scale-downs since AWS rate-limits them

1d5dbfb

Smooth out metrics used to control DynamoDB

ef5a54a

Check that the ingesters are doing some work before scale-down

208ccac

Don't want to scale down if all our metrics are returning zero, or all the ingesters have crashed, etc

Bail-out of scale-up in simple cases

541c85f

Make scale-up factor configurable

194c504

Change cooldown default from 50 to 30 minutes

110fcc6

50 minutes seems weird. Currently using 30 minutes in production.

Ensure metrics is nil if no MetricsURL set

d2c4d8d

Remove redundant check on nil map

e21e2ae

Move methods from dynamoTableClient to metricsData

c9e88d9

Put DynamoDB autoscaling behind an interface

655345a

Move old-style AWS autoscaling to a separate file; make metrics autoscaling implement the common interface.

Review feedback - move and rename

1cb8877

Scale up more when current setting is low.

0b8e0ac

If the current capacity is very low, adding 20% doesn't get us very far, so add minimum 10% of max capacity. Refactored to extract 'compute' functions for readability.

bboreham force-pushed the dynamic-dynamodb branch from 47eaed8 to 0dd6116 Compare July 28, 2018 13:48

bboreham added 2 commits July 30, 2018 09:16

Move metrics autoscaling flags to metrics_autoscaling.go

dc1a5ca

Make metrics query expressions configurable

f93ace9

bboreham force-pushed the dynamic-dynamodb branch from 0dd6116 to f93ace9 Compare July 30, 2018 09:18

bboreham merged commit 7007f01 into master Aug 7, 2018

bboreham mentioned this pull request Aug 7, 2018

DynamoDB "metrics" autoscaler enhancements #921

Closed

5 tasks

tomwilkie deleted the dynamic-dynamodb branch August 29, 2018 15:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto-scale DynamoDB provision based on Prometheus metrics #841

Auto-scale DynamoDB provision based on Prometheus metrics #841

bboreham commented Jun 7, 2018 •

edited

Loading

bboreham commented Jun 14, 2018

bboreham commented Jun 21, 2018

tomwilkie Jun 21, 2018

bboreham Jun 21, 2018

tomwilkie Jun 21, 2018

tomwilkie Jun 21, 2018

bboreham Jun 21, 2018

tomwilkie Jun 21, 2018

bboreham Jun 21, 2018

bboreham Jun 21, 2018

tomwilkie Jun 21, 2018

tomwilkie Jun 21, 2018

tomwilkie Jun 21, 2018

tomwilkie Jun 21, 2018

bboreham Jul 28, 2018

tomwilkie Jun 21, 2018

tomwilkie Jun 21, 2018

bboreham Jun 21, 2018

tomwilkie Jun 21, 2018

tomwilkie Jun 21, 2018

bboreham Jun 21, 2018

tomwilkie Jun 24, 2018

tomwilkie commented Jun 21, 2018

tomwilkie Jun 21, 2018

tomwilkie Jun 21, 2018

bboreham Jun 22, 2018

bboreham commented Jul 27, 2018

tomwilkie commented Aug 3, 2018

Auto-scale DynamoDB provision based on Prometheus metrics #841

Auto-scale DynamoDB provision based on Prometheus metrics #841

Conversation

bboreham commented Jun 7, 2018 • edited Loading

bboreham commented Jun 14, 2018

bboreham commented Jun 21, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tomwilkie commented Jun 21, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bboreham commented Jul 27, 2018

tomwilkie commented Aug 3, 2018

bboreham commented Jun 7, 2018 •

edited

Loading