Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

A case for etcd controller. #135

Closed
wants to merge 1 commit into from
Closed
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
132 changes: 132 additions & 0 deletions doc/controller.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
# A case for an etcd controller

## Abstract

This document tries to show that introducing a Kubernetes controller would increase the options for solving the various problems encountered in the life-cycle management of `etcd` instances.

## Goal

To justify the introduction of a Kubernetes controller into the life-cycle management of `etcd` instances in terms of the increased number of options to solve the known problems in the domain.

## Non-goal

It is not a goal of this document to recommend rewriting of all the current functionality in etcd-backup-restore as a Kubernetes [operator](https://coreos.com/operators/).
The options highlighted below are just that. Options.

Even if there is a consensus on adopting the controller way to implement some subset of the functionality below, it is not the intention of this document to advocate for implementing all those features right away.

## Content

* [A case for an etcd controller](#a-case-for-an-etcd-controller)
* [Abstract](#abstract)
* [Goal](#goal)
* [Non\-goal](#non-goal)
* [Content](#content)
* [Single\-node and Multi\-node](#single-node-and-multi-node)
* [Database Restoration](#database-restoration)
* [Database Incremental Backup Compaction](#database-incremental-backup-compaction)
* [Database Verification](#database-verification)
* [Autoscaling](#autoscaling)
* [Non\-disruptive Maintenance](#non-disruptive-maintenance)
* [Co\-ordination outside of Gardener Reconciliation](#co-ordination-outside-of-gardener-reconciliation)
* [Backup Health Verification](#backup-health-verification)
* [Major Version Upgrades](#major-version-upgrades)
* [Summary](#summary)

## Single-node and Multi-node

We currently support only single-node `etcd` instances.
There might be valid reasons to also support multi-node `etcd` clusters including the non-disruptive maintenance and support for Gardener Ring.

A multi-node `etcd` cluster requires co-ordination between the `etcd` nodes not just for consensus management but also for life-cycle management tasks.
This means that co-ordination is required across nodes for some (or all) tasks such as backups (full and delta), verification, restoration and scaling.
Some of that co-ordination could be done with the current sidecar approach.
But many of them, especially, restoration and scaling are done better with a controller.

For backups, a better option might be to separate the backup sidecar as a separate pod for the whole etcd cluster rather than a sidecar container for each etcd instance in the etcd cluster.
This could be optionally used only in the multi-node scenario and we can perhaps continue to use the sidecar container approach for the single-node scenario.
This could be the first step and we can think of introducing the controller for co-ordination for other life-cycle operations later.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, multi-node is definitely something that will require a coordinator/controller. So far I see three use cases:

  • Cross-cluster Gardener (whether it’s a self-managed ring is a separate topic, but not relevant for this question)
  • Smooth rolling update with hand-over between two etcd instances (what you call shorter/better just “non-disruptive maintenance”)
  • Multi-zonal control planes (desired from customer PoV, but less so from technical/our PoV, so with the lowest priority of the three at present)

## Database Restoration

Database restoration is also currently done on startup (or a restart) (if database verification fails) within the same backup-restore sidecar's main process.

Introducing a controller enables the option to perform database restoration as a separate job.
The main advantage of this approach is to decouple the memory requirement of a database restoration from the regular backup (full and delta) tasks.
This could be especially of interest because the delta snapshot restoration requires an embedded `etcd` instance which might mean that the memory requirement for database restoration is almost certain to be proportionate to the database size. However, the memory requirement for backup (full and delta) need not be proportionate to the database size at all. In fact, it is very realistic to expect that the memory requirement for backup be more or less independent of the database size.

The current backup-restore component already supports a subcommand for restoration. So, the existing functionality for restoration can be fully reused even at the level of the binary docker image.
The additional co-ordination steps to scale down the statefulset, trigger the restoration job, wait for the completion of restoration and scaling back the statefulset would have to be encoded in a controller.

This could be a second step because we can introduce only a lean controller for restoration and it might even be possible to do this without introducing a CRD.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You argue that the restore operation, if executed in one pod, will result in high memory demand (etcd plus full restoration happening all in one pod). Yes, I see that if we can decouple that it’s better for us, i.e. detect the need before having scheduled etcd, restore when necessary and only then schedule etcd. By that we can decouple everything more nicely and solve the PV-can-only-be-mounted-once issue. Then again, it feels to me as if this could be ranked lower, wouldn’t you agree? The effort is very high and we have a solution that works today and that’s a rather rare operation anyways.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The effort is not really that high. In fact, the current backup-restore image already supports subcommand for restoration. So, technically, we can use the existing backup-restore image as a job to perform restoration. All that would be required would be a controller to co-ordinate sequence of scaling down the statefulset, triggering the job, waiting for completion and then scaling the statefulset back. The existing functionality could be fully re-used even in terms of the binary docker image.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that restoration is a rare, but with the likely hood that our customers might hit this problem before us and the impact being quite high when the problem is encountered, I feel it is important to optimise the database restoration both in terms of memory and time. At least, prio-2 if not prio-1.

## Database Incremental Backup Compaction

Incremental/continuous backup is used for finer granularity backup (in the order of minutes) with full snapshots being taken at a much larger intervals (in the order of hours). This makes the backup efficient both in terms of disk, network bandwidth and backup storage space utilization as well as compute resource utilization during backup.

However, if the proportion of changes in the incremental backup is large then this impacts the restoration times because incremental backups can only be restored in sequence as is mentioned in [#61](https://github.com/gardener/etcd-backup-restore/issues/61).

A controller can be used to periodically perform compaction of the incremental backups in the backup storage. This can optimize both the restoration times as well as the backup storage space utilization while not affecting the regular backup performance because such compaction would be done asynchronously.

This feature could be implemented as an enhancement of the [database restoration](#database-restoration) functionality.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optimization for incremental backup compaction and backup health verification could be a use case, though an expensive one (needs disk, compute and network to compact what is already in store for the rare case of a restoration). E.g. I am not sure whether restore optimisation is (already) that important. Also shoot cluster migration to another seed should be a rare operation, it’s basically a DR measure. And as for the backup health verification, I have not heard that people have reason not to trust S3 or similar services. We also write new full backups again and again. That I really don’t see as an arg until we learn, that it’s not trustworthy.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The incremental backup compaction need not be that expensive at all. Yes, it would require disks, compute and network. But since it would be done in a separate job, it would be possible to do this with low priority jobs with throttled resources and possibly on dedicated low-cost hardware (that could even be spot instances). They could be done serially (effectively due to the throttled resources).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There might be alternative solutions to the problem of optimising the restoration time (other than compaction in the background). E.g., increasing the full-snapshot frequency to reduce incremental backup files. These should be pursued too. But my (not very deep) calculations indicate such approaches may not fully address the issue. I could be wrong, of course.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have addressed the rareness of a restoration above

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Health check and major version upgrade were minor points. I agree they are not urgent at all. Health check, we might not want to do at all. I have updated the document to reflect this more clearly.

## Database Verification

Database verification is currently done on startup (or a restart) within the same backup-restore sidecar's main process.
There is a co-ordination shell script that makes sure that the etcd process is not started until verification is completed successfully.

Introducing a controller enables the option to perform the database verification only when needed as a separate job.
This has two advantages.
1. Decouple database verification from `etcd` restart. This has the potential to avoid unnecessary delays during every single `etcd` restart.
1. Decouple the memory requirement of a database verification from the regular backup (full and delta) tasks. This has the potential to reduce the memory requirement of the backup sidecar and isolate the memory spike of a database verification.

Of course, it is possible to decouple database verification from `etcd` restart without introducing the controller but would need the co-ordination shell script to be more complicated as can be seen in [#93](https://github.com/gardener/etcd-backup-restore/pull/93). Such complicated shell scripts are generally better avoided. Especially, when they are not even part of any docker image.

Another alternative is to create a custom image for etcd container to include the co-ordination logic. This is also not very desirable.

As mentioned in the case of [database restoration](#database-restoration), database verification is a part of a subcommand of the existing backup-restore component. So, the existing verification functionality can be fully reused even at the level of the binary docker image.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I mentioned database verification also on top, but how can you decouple it from etcd restart? You need the first before you can do the latter. So I only see the memory (like above) and only that as a pro argument. Or do you mean an internal optimisation because of the current hand-over between the two? Then it is no runtime, but just an implementation improvement. OK, agreed, still not a compelling reason to change everything right now, especially since we have too much on our plate with etcd at present (the pressing needs, but also the etcd VPA topic).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the priority. But I would like to address the effort aspect. Like in the case of restoration, this is already a part of a subcommand in the backup-restore. So, we can re-use practically everything in the side-car as is even at the binary docker image level. The only thing that would change is the shell script that is doing the co-ordination which would have to be moved to the controller.

## Autoscaling

The [`VerticalPodAutoscaler`](https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler) supports multiple update policies including `recreate`, `initial` and `off`.
The `recreate` policy is clearly not suitable for a single-node `etcd` instances because of the implications on frequent and unpredictable down-time.
The `initial` policy makes more sense when coupled with unlimited resource limits (but very clear autoscaled resource requests).

With a controller, even the `off` option becomes feasible because the time for applying `VerticalPodAutoscaler`'s could be decided in a very custom way while still relying on the recommendations from the `VerticalPodAutoscaler`.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As for VPA, if it lacks, shouldn’t we rather help it have the missing feature? It needs to be crystal clear, what’s missing and that we won’t get it, before deciding to replicate/redo what VPA already has. So today this is no arg, at least for me.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not advocating for an alternative for VPA. VPA is being evaluated with the updatePolicy off to begin with (just like for prometheus) and the intention is to use it with updatePolicy initial as the next step. There is no question of replication or redoing VPA whatsoever.

But that is for now. It is very clear what is missing in VPA for the non-disruptive autoscaling for etcd. I see no way to make the current MutableWebhook approach of VPA work for the non-disruptive autoscaling. The recommender part of VPA would work very well still.

This is the line of reasoning. VPA with anything other than off or initial (i.e. recreate or auto) would kill the pod directly and rely on the MutableWebhook to inject the right resource recommendations. But this would be disruptive unless we have multi-node support. If we go for always multi-node etcd approach, this would not be such a problem. But our inclination has so far been to stick to single-node etcd for most scenarios and go for multi-node only if necessary. So, if we want to go temporarily multi-node during VPA scaling then we need to sit between the VPA’s recommendation and the updater. One way to do this would be to use a CRD for etcd with pod template (and hence, resource recommendations). If we avoid the MutableWebhook of VPA, we can still rely on VPA and its recommender (pretty much fully) but use a custom updater (or enhance VPA’s own updater) to update our CRD’s resource recommendations. The controller can then scale out the statefulset temporarily with the newer resource requirements (or permanently if we opt for multi-node always approach) in a rolling manner and scale in back similarly. This is all theory, of course with a big if. Practice will have its say when we try it.

## Non-disruptive Maintenance

We currently support only single-node `etcd` instances.
So, any change to the `etcd` `StatefulSet` or its pods have a disruptive impact with down-time implication.

One way to make such changes less disruptive could be to temporarily scale the cluster into a multi-node (3) `etcd` cluster, perform the disruptive change on a rolling manner on each of the individual nodes of the multi-node `etcd` clusters and then scale down the `etcd` back to single-instance.
This kind of multi-step process is better implemented as a controller.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor comment, the non-disruptive maintenance can also be implemented with two etcds and a controller. I would not invoke three etcd cluster nodes if not necessary.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That could be possible. But as you mentioned, it would require a controller still. 2 nodes or 3.

## Co-ordination outside of Gardener Reconciliation

Currently, the `etcd` `StatefulSet` is provisioned by Gardener and this is the only point of co-ordination for the `etcd` life-cycle management.
This couples the `etcd` life-cycle management with Gardener Reconciliation.

Because of the disruptive nature of scaling of single-node `etcd` instances, it might make sense to restrict some of the low priority life-cycle operations (for example, scaling down) to the maintenance window of the `Shoot` cluster which is backed by the given `etcd` instance.
It would be possible to implement this with the current sidecar approach but might be cleaner to do it as controller (also, possibly a `CustomResourceDefinition` to capture co-ordination information such as maintenance window).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, more advanced functions like scale-down during maintenance windows could be implemented by means of a controller better. Then again, not sure whether this alone would be necessary. We will roll seed cluster nodes fairly regularly, so there are natural windows of opportunities.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. But it might be better to make this possible even independent of such events.

## Backup Health Verification

Currently, we rely on the database backups in the storage provider to remain healthy. There are no additional checks to verify if the backups are still healthy after upload.
A controller can be used to perform such backup health verification asynchronously.

This is a minor point which may not be important to implement.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See above, comment on "Database Incremental Backup Compaction".

## Major Version Upgrades

So far, `etcd` versions we have been using have provided backward compatibility for the databases. But it is possible that in some future version there is a break in compatibility and some data migration is required. Without a controller, this would have to be done in an ad hoc manner. A controller can provide a good place to encode such logic.

This feature could be implemented on demand when a new etcd version requires such data migration.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, migration will most likely always require a coordinator of sorts.

## Summary

As seen above, many of the problems encountered in the life-cycle management of `etcd` clusters can be addressed with just the current sidecar approach without introducing a controller.
However, introducing a controller provides many more alternative approaches to implement solutions to these problems while not removing the possibility of using the existing sidecar approaches.
It can be argued that some of the problems such as restoration, backup compaction, non-disruptive maintenance, multi-node and co-ordination outside of Gardener reconciliation are done better with a controller.

To be clear, introducing a controller need not preclude the continued use of the existing sidecar approach. For example, backups (full and delta) are probably done better via a sidecar.