EPIC: High-Level Technical Plan migration plan #10363

yuval-k · 2024-11-19T18:38:32Z

Background

Solo.io is donating it's gloo-gateway project to the CNCF. As part of this process, we will need to restructure some of the code for the following reasons:

Currently the Gloo APIs contain some proprietary fields that should not be present in a CNCF project.
Currently Gloo supports a variety of APIs: knative, k8s ingress api, gloo's own gateway api, and the kubernetes' gateway api. With the donation, we'll want to focus on support just the kubernetes' gateway api.

Goals

Remove gloo's proprietary APIs. We also want to change the group and kind of the CRDs to KubeGateway
- This will be done with care, to make sure not to break existing production users.
Support extension points to customize translation behavior using extension krt collections
Use as much as possible existing APIs (i.e. no Upstream of Kube type where we can use Services), to reduce cognitive load on users.
Use familiar development patterns (i.e. kube-builder for CRDs vs SoloKit) for easier DevEx for community developers.
Allow existing users a migration path

Non-Goals

No need to support all of gloo plugins if they are hard to port over; especially the ones that are not in wide use by the community (http tunneling comes to mind).

Plan

Overview:

Fix CI
Bisect gloo and KubeGateway
Create IR that allows to opaquely extend the control plane
Helm chart

Order of operations:

The CI should be done first, the Helm chart is independent and can be done in parallel with the rest.
After CI is done, "Bisecting gloo", "abstracting the api snapshot" can be done in parallel.
"translation logic using IR" can start after the "abstracting the api snapshot" is done.
"Iterate on the IR" represents multiple tasks that can be done in parallel after "translation logic using IR" is done.

Fix CI (should be done first)

We probably don't need bullduzer bot, as GH added the features it gave us - we can replace it with merge-queues and merge-when-ready button.
Changelog bot. we need to figure something out here, but we can start without it.
consolidate on GH actions (no gcloud build)

Ideally this should be done first, so we have CI for the next steps.

Bisect gloo and KubeGateway

Goal:

In KubeGateway - we keep translation part of gloo, and remove gloo's SetupSync,EnvoySyncer, and most other gloo code that is not related to translation and plugins and that this project does not depend on. If easier, we can initially just disable the gloo syncer before we delete it anything.
In gloo - we remove the k8s gateway support (this is the easy part). Initially, keep the plugins and translator. Later on, re-write the plguins to fit k8sgateway plugin model, and have k8sgateway be an extension for gloo.

Steps:

move the xds-server serving to k8s gateway (should be easy. we should consider using upstream go-control-plane - as it now supports delta xDS)
we can probably drop some of the snapshot sanitizers, in favor of envoy options (i.e. no need to handle non existent cluster in snapshot)
disable calling gloo's setup function on our Main function.
At this point KubeGateway can still read gloo's upstream and RouteOptions if the CRDs are installed
- KubeGateway will not install them on its own.
- see below if we should write statuses.
Separating the xds-server:
- today, KubeGateway uses gloo's xds-server. now KubeGateway will have its own xds-server. see more in questions section below.
Admin page:
- KubeGateway currently uses gloo's admin page. we will need an admin page for it instead.

Opaque extendablity

Goal:

Stop using solo-kit's ApiSnapshot object, as it is coupled to gloo's APIs (this step also enables performance benefits).
Create an extension model, such that things that targetRef a route (like RouteOptions), can mutate without the translation being aware of their underlying type.

Steps:

TBD, but probably something like a a collection of mutation functions.

Helm chart (can be done in parallel)

I think we can start a new helm chart for KubeGateway - it will be simpler as it won't include the proxy deployment parts. I don't believe we need code-gen for the helm chart.

Questions / Discussion

Do we want KubeGateway to write Status to gloo's routeOptions? given that they are "deprecated" in KubeGateway, i think not. Upstream status writing also requires some thought
Regard control plane and snapshopt cache: currently, KubeGateway re-uses Gloo's xds-server. To keep things simple, I suggest we don't support re-use KubeGateway's xds-server in gloo. Instead, gloo will have 2 control planes (on different ports)
How do we handle extensions that require the old proxy? cc @lgadban (Note: that we just need to allow for it here; I think eventually we will need to re-write this logic so it doesn't rely on proxy object)
I suspect these plugins are not worth porting over (lack of use). They can still remain in gloo:
- http tunnel
- gloo's rest/grpc (though we may want to keep the regular json->grpc plugin)
- consul/nomad support
- ec2
- service_spec \ subset_spec on upstream
Any reason to not use upstream go-control-plane? it seems to support more features now than the solo-kit one.

The text was updated successfully, but these errors were encountered:

yuval-k · 2024-11-20T01:27:07Z

Some more issues to handle:
How do we do routeoptions/upstream statuses
Settings crd has gloo subsection in it
Stats names that have gloo in them

yuval-k · 2024-11-21T00:11:08Z

re: migration doesn't seem likely that we can share plugin code. we can export the complex parts of the plugin code as utility functions so that these can be re-used.

nfuden · 2024-11-25T18:19:42Z

I think that kubegateway likely shouldnt write statuses to deprecated objects. If implementors want to wrap kubegateway then they should be able to do so in a way where there is an extension point to handle additional status setting if needed

ilrudie · 2024-11-25T19:29:27Z

I think that kubegateway likely shouldnt write statuses to deprecated objects. If implementors want to wrap kubegateway then they should be able to do so in a way where there is an extension point to handle additional status setting if needed

It seems like it may be problematic to read a deprecated type causing internal state and resulting proxy config to change but not write any status noting that we did this. 🤔

yuval-k · 2024-11-30T19:05:41Z

Proposal for plugins

This proposal is not very different than what we have today in ggv2; it needed to be updated, as we now
translate directly to envoy resources, and not to gloo resources.

Base interfaces:

type ObjectSource struct {
	Group     string `json:"group,omitempty"`
	Kind      string `json:"kind,omitempty"`
	Namespace string `json:"namespace,omitempty"`
	Name      string `json:"name"`
}
type Upstream struct {
	// Ref to source object. sometimes the group and kind are not populated from api-server, so
	// set them explicitly here, and pass this around as the reference.
	ObjectSource `json:",inline"`

	// prefix the cluster name with this string to distringuish it from other GVKs.
	// here explicitly as it shows up in stats. each (group, kind) pair should have a unique prefix.
	GvPrefix string
	// for things that integrate with destination rule, we need to know what hostname to use.
	CanonicalHostname string
	// original object. Opaque to us other than metadata.
	Obj metav1.ObjectMetaAccessor
}

type UpstreamCtx struct {
	Upstream Upstream
    KrtCtx  KCtx
    Client Ucc
}

type TranslationCtx struct {
    KrtCtx  KCtx
    Client Ucc
}

// Upstream plugin, directly translates an upstream to a cluster
type UpstreamPlugin interface {
	ApplyForUpstream(context.Context, UpstreamCtx, Cluster)
}

// Proxy plugin, translates a proxy IR to a gateway and its listeners
type ProxyPlugin interface {
    // Returns new context for proxy translation.
    // see definition of ProxyTranslationPass interface below.
    // the object returned as ProxyTranslationPass may hold internal only state. It should not have global state, or 
    // state tied to the parent proxy plugin. it may read state from the ProxyPlugin (i.e. krt collections), but not modify it.
    // could be called from multiple go-routines in parallel.
	NewTranslationPass(context.Context, TranslationCtx) ProxyTranslationPass
}

A translation pass of a "proxy". A proxy is the flattened representation of all our route rules in a listener.

A translation pass must not do any IO (like dns resolution) or spawn any goroutines.
Translation must be O(1) of input size (must not do any lookups that are not indexed).
All heavy operations should have already been done in the plugin lifecycle by
building collections that pre-process things. We can preprocess things that we currently do in the
plugin itself.

For example:

Validation can be done ahead of time.
Creating perFilterConfig can be done ahead of time.
Merging multiple CRDs can be done ahead of time.

Idealy: No errors returned in translation - translation semantics is that it always completes. Each
possible error needs to have clear sematics (likely removal of offending object). Each
plugin reports errors to the reporter. the reporter can debug log them. (this needs further validation)

type ProxyTranslationPass interface {
    // called 1 time for each listener 
	ApplyListenerPlugin(
		ctx context.Context,
		pCtx *plugins.ListenerContext,
		out *envoy.Listener,
	)

	ApplyVhostPlugin(
		ctx context.Context,
		pCtx *plugins.VirtualHostContext,
		out *envoy.VirtualHost,
	)
	// called for each route rule
	ApplyForRoute(
		ctx context.Context,
		pCtx *plugins.RouteContext,
		out *envoy.Route)
	// called exactly 1 time
	// if a plugin emits new filters, they must be with a unique name (so different plugins can emit the same filters).
	HttpFilters(ctx, ...)
}

yuval-k · 2024-11-30T19:22:17Z

looking the plugins from the open source repo:
most of them are easy to port, i have notes on the following:

consul: no need to port
Deprecated ciphers passthrough: this one uses matcher framework for filterchains - we need to make a decision here, do we plan to use matcher framework for all of our system? how do we abstract the filter chain matches so that plugins can modify it?
dynamic-forward-proxy: currently this is implemented as destination, so not usable in with gateway api I believe.
grpc: no need to port
kubernetes - no need to port
local-ratelimit - combines config from listener and route. shouldn't be hard to port over; but route plugins needs access to listener, so can't preprocess it.
ratelimit - need to figure out the interaction with the rate limit server and the inline rate limits (perhaps deprecate inline server config?) a basic version with BYO server should be easy to port.
rest - no need to port
tcp - Not sure how this interacts with GW api. needs further analasys
tunneling - can't port this over, need refactor

lgadban · 2024-12-02T17:39:28Z

	ApplyVhostPlugin(
		ctx context.Context,
		pCtx *plugins.VirtualHostContext,
		out *envoy.VirtualHost,
	)

why do we want to keep VHost around? IMO we should do everything possible to get rid of this concept and map only to the Gateway API concepts, i.e.:

Gateway/Listener
- Probably just Listener, where Gateway attachment means all Listeners in the Gateway
Route

lgadban · 2024-12-02T18:04:41Z

I think we can start a new helm chart for KubeGateway - it will be simpler as it won't include the proxy deployment parts.

Initially this could be fine but we will need some way for users to create a static proxy to enable static gateway provisioning. I also think we should revisit our implementation in general here, related to https://github.com/solo-io/solo-projects/issues/7302

yuval-k · 2024-12-02T19:49:46Z

	ApplyVhostPlugin(
		ctx context.Context,
		pCtx *plugins.VirtualHostContext,
		out *envoy.VirtualHost,
	)
why do we want to keep VHost around? IMO we should do everything possible to get rid of this concept and map only to the Gateway API concepts, i.e.:
* Gateway/Listener
  
  * Probably just Listener, where Gateway attachment means all Listeners in the Gateway

* Route

this is more of a sketch; i don't mind the specifics; i'm ok with doing the gw api semantics

yuval-k · 2024-12-02T19:51:13Z

I think we can start a new helm chart for KubeGateway - it will be simpler as it won't include the proxy deployment parts.

Initially this could be fine but we will need some way for users to create a static proxy to enable static gateway provisioning. I also think we should revisit our implementation in general here, related to solo-io/solo-projects#7302

makes sense; maybe we can do 2 helm charts? so users can deploy gateway manually if they want to; and we will use the same chart for auto-deploy?

yuval-k · 2024-12-02T19:59:21Z

	ApplyVhostPlugin(
		ctx context.Context,
		pCtx *plugins.VirtualHostContext,
		out *envoy.VirtualHost,
	)
why do we want to keep VHost around? IMO we should do everything possible to get rid of this concept and map only to the Gateway API concepts, i.e.:
* Gateway/Listener
  
  * Probably just Listener, where Gateway attachment means all Listeners in the Gateway

* Route
this is more of a sketch; i don't mind the specifics; i'm ok with doing the gw api semantics

to clarify a bit more: the idea here is to tweak the plugin model a bit so that each translation means a new call to NewTranslationPass (the sketch part is what happens inside a translation pass). This way translation can happen concurrently (which cannot happen today).
this melds together the ggv2 and gloo plugin models.

nfuden mentioned this issue Nov 21, 2024

Remove performance tests #10369

Open

timflannagan pushed a commit to timflannagan/kgateway that referenced this issue Nov 25, 2024

Up-port fix for hardcoded namespace in tracing test (kgateway-dev#10363)

3f99c52

nfuden mentioned this issue Nov 26, 2024

Programmtic Changelog Checking, Github #10381

Open

yuval-k mentioned this issue Dec 10, 2024

feat: initial impl of removing usage gloo proxy #10397

Merged

This was referenced Dec 13, 2024

Replace legacy build-bot automation with GHAs #10403

Open

Remove the docs-gen workflow #10404

Closed

yuval-k added this to Kgateway Planning Dec 16, 2024

This was referenced Dec 17, 2024

Remove the run-hashicorp-e2e-tests Makefile target and it's test suite #10408

Open

Evaluate changelog management gaps #10409

Closed

Implement a DCO check for PRs #10410

Open

timflannagan mentioned this issue Jan 9, 2025

Integrate the kgateway chart into the repository #10423

Open

lgadban mentioned this issue Jan 10, 2025

[Epic] Extensions Points #10428

Open

lgadban mentioned this issue Jan 24, 2025

[Epic] Cleanup Repo #10496

Open

timflannagan moved this to Epics in Kgateway Planning Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EPIC: High-Level Technical Plan migration plan #10363

EPIC: High-Level Technical Plan migration plan #10363

yuval-k commented Nov 19, 2024 •

edited by shashankram

Loading

yuval-k commented Nov 20, 2024

yuval-k commented Nov 21, 2024

nfuden commented Nov 25, 2024

ilrudie commented Nov 25, 2024

yuval-k commented Nov 30, 2024 •

edited

Loading

yuval-k commented Nov 30, 2024

lgadban commented Dec 2, 2024

lgadban commented Dec 2, 2024

yuval-k commented Dec 2, 2024 •

edited

Loading

yuval-k commented Dec 2, 2024

yuval-k commented Dec 2, 2024

EPIC: High-Level Technical Plan migration plan #10363

EPIC: High-Level Technical Plan migration plan #10363

Comments

yuval-k commented Nov 19, 2024 • edited by shashankram Loading

Background

Goals

Non-Goals

Plan

Fix CI (should be done first)

Bisect gloo and KubeGateway

Opaque extendablity

Helm chart (can be done in parallel)

Questions / Discussion

yuval-k commented Nov 20, 2024

yuval-k commented Nov 21, 2024

nfuden commented Nov 25, 2024

ilrudie commented Nov 25, 2024

yuval-k commented Nov 30, 2024 • edited Loading

Proposal for plugins

yuval-k commented Nov 30, 2024

lgadban commented Dec 2, 2024

lgadban commented Dec 2, 2024

yuval-k commented Dec 2, 2024 • edited Loading

yuval-k commented Dec 2, 2024

yuval-k commented Dec 2, 2024

yuval-k commented Nov 19, 2024 •

edited by shashankram

Loading

yuval-k commented Nov 30, 2024 •

edited

Loading

yuval-k commented Dec 2, 2024 •

edited

Loading