Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EPIC: High-Level Technical Plan migration plan #10363

Open
yuval-k opened this issue Nov 19, 2024 · 11 comments
Open

EPIC: High-Level Technical Plan migration plan #10363

yuval-k opened this issue Nov 19, 2024 · 11 comments

Comments

@yuval-k
Copy link
Contributor

yuval-k commented Nov 19, 2024

Background

Solo.io is donating it's gloo-gateway project to the CNCF. As part of this process, we will need to restructure some of the code for the following reasons:

  • Currently the Gloo APIs contain some proprietary fields that should not be present in a CNCF project.
  • Currently Gloo supports a variety of APIs: knative, k8s ingress api, gloo's own gateway api, and the kubernetes' gateway api. With the donation, we'll want to focus on support just the kubernetes' gateway api.

Goals

  • Remove gloo's proprietary APIs. We also want to change the group and kind of the CRDs to KubeGateway
    • This will be done with care, to make sure not to break existing production users.
  • Support extension points to customize translation behavior using extension krt collections
  • Use as much as possible existing APIs (i.e. no Upstream of Kube type where we can use Services), to reduce cognitive load on users.
  • Use familiar development patterns (i.e. kube-builder for CRDs vs SoloKit) for easier DevEx for community developers.
  • Allow existing users a migration path

Non-Goals

  • No need to support all of gloo plugins if they are hard to port over; especially the ones that are not in wide use by the community (http tunneling comes to mind).

Plan

Overview:

  • Fix CI
  • Bisect gloo and KubeGateway
  • Create IR that allows to opaquely extend the control plane
  • Helm chart

Order of operations:

  • The CI should be done first, the Helm chart is independent and can be done in parallel with the rest.
  • After CI is done, "Bisecting gloo", "abstracting the api snapshot" can be done in parallel.
  • "translation logic using IR" can start after the "abstracting the api snapshot" is done.
  • "Iterate on the IR" represents multiple tasks that can be done in parallel after "translation logic using IR" is done.

Fix CI (should be done first)

  • We probably don't need bullduzer bot, as GH added the features it gave us - we can replace it with merge-queues and merge-when-ready button.
  • Changelog bot. we need to figure something out here, but we can start without it.
  • consolidate on GH actions (no gcloud build)

Ideally this should be done first, so we have CI for the next steps.

Bisect gloo and KubeGateway

Goal:

  • In KubeGateway - we keep translation part of gloo, and remove gloo's SetupSync,EnvoySyncer, and most other gloo code that is not related to translation and plugins and that this project does not depend on. If easier, we can initially just disable the gloo syncer before we delete it anything.
  • In gloo - we remove the k8s gateway support (this is the easy part). Initially, keep the plugins and translator. Later on, re-write the plguins to fit k8sgateway plugin model, and have k8sgateway be an extension for gloo.

Steps:

  • move the xds-server serving to k8s gateway (should be easy. we should consider using upstream go-control-plane - as it now supports delta xDS)
  • we can probably drop some of the snapshot sanitizers, in favor of envoy options (i.e. no need to handle non existent cluster in snapshot)
  • disable calling gloo's setup function on our Main function.
  • At this point KubeGateway can still read gloo's upstream and RouteOptions if the CRDs are installed
    • KubeGateway will not install them on its own.
    • see below if we should write statuses.
  • Separating the xds-server:
    • today, KubeGateway uses gloo's xds-server. now KubeGateway will have its own xds-server. see more in questions section below.
  • Admin page:
    • KubeGateway currently uses gloo's admin page. we will need an admin page for it instead.

Opaque extendablity

Goal:

  • Stop using solo-kit's ApiSnapshot object, as it is coupled to gloo's APIs (this step also enables performance benefits).
  • Create an extension model, such that things that targetRef a route (like RouteOptions), can mutate without the translation being aware of their underlying type.

Steps:

  • TBD, but probably something like a a collection of mutation functions.

Helm chart (can be done in parallel)

  • I think we can start a new helm chart for KubeGateway - it will be simpler as it won't include the proxy deployment parts. I don't believe we need code-gen for the helm chart.

Questions / Discussion

  • Do we want KubeGateway to write Status to gloo's routeOptions? given that they are "deprecated" in KubeGateway, i think not. Upstream status writing also requires some thought
  • Regard control plane and snapshopt cache: currently, KubeGateway re-uses Gloo's xds-server. To keep things simple, I suggest we don't support re-use KubeGateway's xds-server in gloo. Instead, gloo will have 2 control planes (on different ports)
  • How do we handle extensions that require the old proxy? cc @lgadban (Note: that we just need to allow for it here; I think eventually we will need to re-write this logic so it doesn't rely on proxy object)
  • I suspect these plugins are not worth porting over (lack of use). They can still remain in gloo:
    • http tunnel
    • gloo's rest/grpc (though we may want to keep the regular json->grpc plugin)
    • consul/nomad support
    • ec2
    • service_spec \ subset_spec on upstream
  • Any reason to not use upstream go-control-plane? it seems to support more features now than the solo-kit one.
@yuval-k
Copy link
Contributor Author

yuval-k commented Nov 20, 2024

Some more issues to handle:
How do we do routeoptions/upstream statuses
Settings crd has gloo subsection in it
Stats names that have gloo in them

@yuval-k
Copy link
Contributor Author

yuval-k commented Nov 21, 2024

re: migration doesn't seem likely that we can share plugin code. we can export the complex parts of the plugin code as utility functions so that these can be re-used.

@nfuden
Copy link
Contributor

nfuden commented Nov 25, 2024

I think that kubegateway likely shouldnt write statuses to deprecated objects. If implementors want to wrap kubegateway then they should be able to do so in a way where there is an extension point to handle additional status setting if needed

@ilrudie
Copy link
Contributor

ilrudie commented Nov 25, 2024

I think that kubegateway likely shouldnt write statuses to deprecated objects. If implementors want to wrap kubegateway then they should be able to do so in a way where there is an extension point to handle additional status setting if needed

It seems like it may be problematic to read a deprecated type causing internal state and resulting proxy config to change but not write any status noting that we did this. 🤔

@yuval-k
Copy link
Contributor Author

yuval-k commented Nov 30, 2024

Proposal for plugins

This proposal is not very different than what we have today in ggv2; it needed to be updated, as we now
translate directly to envoy resources, and not to gloo resources.

Base interfaces:

type ObjectSource struct {
	Group     string `json:"group,omitempty"`
	Kind      string `json:"kind,omitempty"`
	Namespace string `json:"namespace,omitempty"`
	Name      string `json:"name"`
}
type Upstream struct {
	// Ref to source object. sometimes the group and kind are not populated from api-server, so
	// set them explicitly here, and pass this around as the reference.
	ObjectSource `json:",inline"`

	// prefix the cluster name with this string to distringuish it from other GVKs.
	// here explicitly as it shows up in stats. each (group, kind) pair should have a unique prefix.
	GvPrefix string
	// for things that integrate with destination rule, we need to know what hostname to use.
	CanonicalHostname string
	// original object. Opaque to us other than metadata.
	Obj metav1.ObjectMetaAccessor
}

type UpstreamCtx struct {
	Upstream Upstream
    KrtCtx  KCtx
    Client Ucc
}

type TranslationCtx struct {
    KrtCtx  KCtx
    Client Ucc
}

// Upstream plugin, directly translates an upstream to a cluster
type UpstreamPlugin interface {
	ApplyForUpstream(context.Context, UpstreamCtx, Cluster)
}

// Proxy plugin, translates a proxy IR to a gateway and its listeners
type ProxyPlugin interface {
    // Returns new context for proxy translation.
    // see definition of ProxyTranslationPass interface below.
    // the object returned as ProxyTranslationPass may hold internal only state. It should not have global state, or 
    // state tied to the parent proxy plugin. it may read state from the ProxyPlugin (i.e. krt collections), but not modify it.
    // could be called from multiple go-routines in parallel.
	NewTranslationPass(context.Context, TranslationCtx) ProxyTranslationPass
}

A translation pass of a "proxy". A proxy is the flattened representation of all our route rules in a listener.

A translation pass must not do any IO (like dns resolution) or spawn any goroutines.
Translation must be O(1) of input size (must not do any lookups that are not indexed).
All heavy operations should have already been done in the plugin lifecycle by
building collections that pre-process things. We can preprocess things that we currently do in the
plugin itself.

For example:

  • Validation can be done ahead of time.
  • Creating perFilterConfig can be done ahead of time.
  • Merging multiple CRDs can be done ahead of time.

Idealy: No errors returned in translation - translation semantics is that it always completes. Each
possible error needs to have clear sematics (likely removal of offending object). Each
plugin reports errors to the reporter. the reporter can debug log them. (this needs further validation)

type ProxyTranslationPass interface {
    // called 1 time for each listener 
	ApplyListenerPlugin(
		ctx context.Context,
		pCtx *plugins.ListenerContext,
		out *envoy.Listener,
	)

	ApplyVhostPlugin(
		ctx context.Context,
		pCtx *plugins.VirtualHostContext,
		out *envoy.VirtualHost,
	)
	// called for each route rule
	ApplyForRoute(
		ctx context.Context,
		pCtx *plugins.RouteContext,
		out *envoy.Route)
	// called exactly 1 time
	// if a plugin emits new filters, they must be with a unique name (so different plugins can emit the same filters).
	HttpFilters(ctx, ...)
}

@yuval-k
Copy link
Contributor Author

yuval-k commented Nov 30, 2024

looking the plugins from the open source repo:
most of them are easy to port, i have notes on the following:

  • consul: no need to port
  • Deprecated ciphers passthrough: this one uses matcher framework for filterchains - we need to make a decision here, do we plan to use matcher framework for all of our system? how do we abstract the filter chain matches so that plugins can modify it?
  • dynamic-forward-proxy: currently this is implemented as destination, so not usable in with gateway api I believe.
  • grpc: no need to port
  • kubernetes - no need to port
  • local-ratelimit - combines config from listener and route. shouldn't be hard to port over; but route plugins needs access to listener, so can't preprocess it.
  • ratelimit - need to figure out the interaction with the rate limit server and the inline rate limits (perhaps deprecate inline server config?) a basic version with BYO server should be easy to port.
  • rest - no need to port
  • tcp - Not sure how this interacts with GW api. needs further analasys
  • tunneling - can't port this over, need refactor

@lgadban
Copy link
Contributor

lgadban commented Dec 2, 2024

	ApplyVhostPlugin(
		ctx context.Context,
		pCtx *plugins.VirtualHostContext,
		out *envoy.VirtualHost,
	)

why do we want to keep VHost around? IMO we should do everything possible to get rid of this concept and map only to the Gateway API concepts, i.e.:

  • Gateway/Listener
    • Probably just Listener, where Gateway attachment means all Listeners in the Gateway
  • Route

@lgadban
Copy link
Contributor

lgadban commented Dec 2, 2024

I think we can start a new helm chart for KubeGateway - it will be simpler as it won't include the proxy deployment parts.

Initially this could be fine but we will need some way for users to create a static proxy to enable static gateway provisioning. I also think we should revisit our implementation in general here, related to https://github.com/solo-io/solo-projects/issues/7302

@yuval-k
Copy link
Contributor Author

yuval-k commented Dec 2, 2024

	ApplyVhostPlugin(
		ctx context.Context,
		pCtx *plugins.VirtualHostContext,
		out *envoy.VirtualHost,
	)

why do we want to keep VHost around? IMO we should do everything possible to get rid of this concept and map only to the Gateway API concepts, i.e.:

* Gateway/Listener
  
  * Probably just Listener, where Gateway attachment means all Listeners in the Gateway

* Route

this is more of a sketch; i don't mind the specifics; i'm ok with doing the gw api semantics

@yuval-k
Copy link
Contributor Author

yuval-k commented Dec 2, 2024

I think we can start a new helm chart for KubeGateway - it will be simpler as it won't include the proxy deployment parts.

Initially this could be fine but we will need some way for users to create a static proxy to enable static gateway provisioning. I also think we should revisit our implementation in general here, related to solo-io/solo-projects#7302

makes sense; maybe we can do 2 helm charts? so users can deploy gateway manually if they want to; and we will use the same chart for auto-deploy?

@yuval-k
Copy link
Contributor Author

yuval-k commented Dec 2, 2024

	ApplyVhostPlugin(
		ctx context.Context,
		pCtx *plugins.VirtualHostContext,
		out *envoy.VirtualHost,
	)

why do we want to keep VHost around? IMO we should do everything possible to get rid of this concept and map only to the Gateway API concepts, i.e.:

* Gateway/Listener
  
  * Probably just Listener, where Gateway attachment means all Listeners in the Gateway

* Route

this is more of a sketch; i don't mind the specifics; i'm ok with doing the gw api semantics

to clarify a bit more: the idea here is to tweak the plugin model a bit so that each translation means a new call to NewTranslationPass (the sketch part is what happens inside a translation pass). This way translation can happen concurrently (which cannot happen today).
this melds together the ggv2 and gloo plugin models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Epics
Development

No branches or pull requests

4 participants