-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Kubernetes topo implementation #5703
Conversation
@derekperkins feedback and discussion appreciated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A couple of small comments.
I think this is awesome and I think all the getting started docs should be updated after this merges to remove the etcd-operator dependency. I'm very unfamiliar with the topo, so I'm not the best person to comment on the actual implementation, but I've got a couple of questions:
|
be868ab
to
1819c87
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To answer @derekperkins question its ~20MB to all binaries that include client-go. Here are the size increases in detail.
COMPONENT | BEFORE | AFTER | INCREASE |
---|---|---|---|
bin/vtcombo | 42M | 63M | 67% |
bin/vtctl | 41M | 61M | 67% |
bin/vtctld | 47M | 67M | 70% |
bin/vtgate | 30M | 50M | 60% |
bin/vttablet | 41M | 62M | 66% |
bin/vtworker | 33M | 53M | 62% |
b5acdb9
to
0a27a91
Compare
go/vt/topo/kubernetestopo/file.go
Outdated
node.value = string(contents) | ||
|
||
// Create "file" object | ||
return &corev1.ConfigMap{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than using ConfigMaps, I think you should probably define a CustomResourceDefinition for these. Doing it via a custom resource definition means that:
- You can easily define separate ACLs for ConfigMaps and the Vitess Topology objects
- You could create a schema to define your data if it is well-defined
- You could make kubectl display useful default columns for vitess topology objects
- Cluster administrators can shard each object type into separate etcd clusters behind the scenes (ex: massive kube clusters may actually put
events
objects in a different etcd cluster due to the update volume) - You don't pollute the output of
kubectl get configmaps
with thousands of topology-related objects
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For context, there's a separate Vitess operator that uses CRDs to manage the whole cluster, so not using them is a choice, not a lack of understanding. @carsonoid actually spoke about operators at Kubecon in December.
The purpose of this is to have a dead simple topo available for free, especially for quick onboarding and smaller Vitess installations. If you care to shard etcd or do anything else, I think you would then just use the direct etcd topo implementation and manage it yourself, which is what the current helm implementation uses via etcd-operator
.
vtctlclient
is the Vitess way to connect to the topo, so managing resources via kubectl
seems unnecessary here. Maybe I'm incorrect, but I believe that using CRDs would require us to deploy a separate controller to manage them, which is what this PR is specifically trying to avoid.
All that being said, maybe there is a good enough reason to use CRDs or change how we're using ConfigMaps, I just wanted to step back to make sure we're looking at the overall reasons for adding a new topo implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than using ConfigMaps, I think you should probably define a CustomResourceDefinition for these. ...
I 100% agree that there are a lot of benefits that come with using a CRD for the resource instead of a ConfigMap. In fact, I have gone back and forth on using them here. As @derekperkins stated: the goal is to allow for a dead-simple topology server for small Kubernetes-based Vitess clusters. But we do need to balance that with scale and stability concerns.
There is added overhead when using a CRD for the objects:
- We would have to use https://github.com/kubernetes/code-generator to create the API and client code. Making sure to fit that into the current build system.
- We would need to figure out how to the CRD definition when deploying Vitess to a new cluster. This may be added work for the end-user or for the code maintainers, depending on who ends up owning this process. I.E. do we have a vitess component make and maintain the CRD def automatically? Or do we require that it be done as part of the first-time work of running Vitess in a cluster?
TLDR: The CM/CRD decision is still up in the air
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For context, there's a separate Vitess operator that uses CRDs to manage the whole cluster, so not using them is a choice, not a lack of understanding. @carsonoid actually spoke about operators at Kubecon in December.
Yeah, for background on my end, I lead the team that owns the Kubernetes clusters over at HubSpot so I definitely keep up with vitess development a bit.
The purpose of this is to have a dead simple topo available for free, especially for quick onboarding and smaller Vitess installations. If you care to shard etcd or do anything else, I think you would then just use the direct etcd topo implementation and manage it yourself, which is what the current helm implementation uses via etcd-operator.
There's no real reason this implementation cannot scale, however, and even for larger clusters there is a lot of benefit of doing this via the kubernetes API versus etcd. The etcd-operator alone isn't suitable to give larger orgs production-grade etcd clusters with appropriate auditing, backups, and authentication. The SQL team must then either own their own topology server, or depend on an etcd, zookeeper, or consul team. Removing that requirement and using the kubernetes API directly eliminates that burden.
vtctlclient is the Vitess way to connect to the topo, so managing resources via kubectl seems unnecessary here.
It's not totally unprecedented for debugging to take us into the underlying topology store to debug or fix issues by hand, however. But yeah, the fields to display as kubectl columns would be pretty limited (like maybe just the keyspace and tablet id)
Maybe I'm incorrect, but I believe that using CRDs would require us to deploy a separate controller to manage them, which is what this PR is specifically trying to avoid.
Creating a new CRD definition in the cluster is just a kubectl apply away, and then it's just another object you can put data in, so no additional controller code is necessary.
We would need to figure out how to the CRD definition when deploying Vitess to a new cluster. This may be added work for the end-user or for the code maintainers, depending on who ends up owning this process. I.E. do we have a vitess component make and maintain the CRD def automatically? Or do we require that it be done as part of the first-time work of running Vitess in a cluster?
For the case of creating the CRD definitions in the API, it seems reasonable enough to make someone kubectl apply
it, but also seems pretty reasonable to automatically create it if it doesn't exist and tell the user to do it manually if you hit a permission error. They already have to apply other things, so it shouldn't be too much of a burden on the user's end (and far far less work than setting etcd or zk) so long as the error messaging is clear if when it hasn't been created yet.
Anyway, I think we probably agree that CRDs are the correct technical solution, but I do understand that ConfigMaps are slightly easier to use as they don't require the code generation bit or applying the definition into the cluster; but if you can nail those two pieces down, they're really the only cons.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no real reason this implementation cannot scale, however, and even for larger clusters there is a lot of benefit of doing this via the kubernetes API versus etcd. The etcd-operator alone isn't suitable to give larger orgs production-grade etcd clusters with appropriate auditing, backups, and authentication. The SQL team must then either own their own topology server, or depend on an etcd, zookeeper, or consul team. Removing that requirement and using the kubernetes API directly eliminates that burden.
That's great feedback, I hadn't expected someone at HubSpot scale to be interested in putting the topo into k8s, we can definitely revisit the expectations for this topo implementation. Questions I have around that:
- Curious how that scales when using GKE or another hosted system that controls the masters
- What about global vs separate zone topos? Unless you did something to shard those under the hood, you'd have a SPOF. (to be fair, that's how helm does it with everything in a single etcd-operator cluster)
- Where do we draw the line between this topo implementation and the actual operator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another important question - what happens when you're upgrading a cluster? How does that impact topo availability?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Curious how that scales when using GKE or another hosted system that controls the masters
What is your update frequency on these objects? Our kubernetes clusters push infinitely more QPS through etcd than vitess does to its topology backend. I doubt we'd even perceive a load increase. Consider that every pod or node status change or event incurs a write to the kubernetes API. We don't even run particularly large etcd servers and often push 100-400MiB/sec through them from kubernetes without breaking a sweat. A given pod is likely more load on the kubernetes API than your topology data for that pod. Best of all, in this case someone else owns your production grade etcd cluster and the ops aren't your problem.
What about global vs separate zone topos? Unless you did something to shard those under the hood, you'd have a SPOF. (to be fair, that's how helm does it with everything in a single etcd-operator cluster)
As it currently stands, we do not make use of global vs local as our current intention for multi-region is that every region is a fully independent shard of our entire stack. However, we actually fo operate a "global" kubernetes API which is distributed across regions - it's etcd + a kubernetes apiserver with no kubernetes nodes that we use to stash custom resources in that we then replicate into region-local kubernetes clusters. So in our case, we actually would be capable of using a global kubernetes api for topology if we wanted to, but as long as the global topology can use a different backend configuration than local topology, I think you'd cover all use cases.
Where do we draw the line between this topo implementation and the actual operator?
The topology backend should definitely be independent from the operator. You wouldn't want to force people onto the operator in order to use it. In general, I perceive the kubernetes API as a CRUD service on typed objects that all other components then use to stash their states in, so it's as good of a topology store as any other. When you create a pod (or anything) against the kubernetes API, nothing occurs synchronously - the scheduler, controller, and nodes all just watch the API to see if there are any declarative objects they are supposed to take action on.
Another important question - what happens when you're upgrading a cluster? How does that impact topo availability?
Kubernetes cluster upgrades are zero API downtime. The controller-manager and scheduler may have a blip in activity while all masters aren't on the same version, but the API is up the whole time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as long as the global topology can use a different backend configuration than local topology, I think you'd cover all use cases.
If someone wanted to use different topology types, I'm pretty sure that would require an upstream change since there's only the one topo flag I'm aware of that controls the global and local plugin: -topo_implementation
. In your case where you'd still be using the k8s topo for both global and local, that shouldn't require any extra work.
This all sounds great. I think I'm onto team CRD. Whether someone is getting started with helm or the operator when it's more available, we should be able to register the CRDs silently so that the user doesn't have to know about the topo configuration. For more advanced users, they can just apply them directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@enisoc any feedback on this discussion? Anything that we're not considering?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW I'm also convinced about the CRD change (I was already leaning there anyway). I will work on moving the current implementation over. Using ConfigMaps worked as a quick test of the concept. But the CRD path will be much better now that this is getting more traction.
b6dcc42
to
c238e28
Compare
I've got the |
c238e28
to
bb2e795
Compare
I've got it completely moved over to CRDs now. I just want to run through the local example and make sure everything works and then we can talk about next steps. |
bb2e795
to
7da40f3
Compare
The full test is done! I've been able to run through the entire local example using the k8s api as a topology store and all of the examples worked! |
db6bee2
to
6cf257a
Compare
@carsonoid can you update the copyright on all the files to 2020, and also fix the DCO ( |
86c592e
to
7b272ea
Compare
@deepthi updated as requested. It looks like some of the tests are currently failing due to an unrelated maven repo timeout though. I'll retry them in a bit. |
3385ff2
to
29a0fe1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice work.
I understand that we need to run some kind of k8s server to act as the topo, and k3s is the most lightweight choice, though it is limited to Linux.
It's great that you have included all the topo tests that are available for the other flavors.
My comments are mostly nits and/or naming except for the incorrect copyright notices.
4cb91aa
to
eec1c82
Compare
@deepthi thanks for all the nits :) I have updated nearly all of them as requested. I have also explicitly set up the tests to report as skipped when running the tests on a non-linux machine. |
eec1c82
to
4e64414
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good with the changes (except for one copyright notice).
I can approve once tests are all passing.
Signed-off-by: Carson Anderson <[email protected]>
Signed-off-by: Carson Anderson <[email protected]>
4e64414
to
72f1423
Compare
@deepthi updated as requested and the check page shows all green. |
Note to all: There was one check that was showing as yellow, which is local example on Ubuntu. This is because this test is required (and renamed in this PR to local example using etcd on ubuntu). I have temporarily disabled the required test. We can re-enable it a couple of days after this PR merges so that it gives in-flight PRs a chance to merge without having to rebase on master (and successfully pick up this rename). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I'm excited to have this merged!
@PaulFurtado Any last feedback before we merge this? |
This reverts commit f97eef3, reversing changes made to 9e74b07. Signed-off-by: Andrew Mason <[email protected]>
I wanted to get some initial feedback on my implementation of the Kubernetes topology service. This resolves #5701.
The code currently works to bring up a basic cluster on a linux host using:
Known Issues that warrant discussion:
k3s
running locally to provide a minimal Kubernetes API. This means that the topology is not available to anyone but Linux users.server
. This may not scale and could slow downvtctl
or other operations that are done against the topology with processes that are not long-lived if there is too much topology data.