Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Wasm OCI image #3564

Merged
merged 36 commits into from
Jun 28, 2024
Merged

feat: Wasm OCI image #3564

merged 36 commits into from
Jun 28, 2024

Conversation

zhaohuabing
Copy link
Member

@zhaohuabing zhaohuabing commented Jun 7, 2024

This PR adds support for Wasm OCI image format to EG.
It allows EG to download Wasm images from remote registries and serve them to the Envoy fleet via a local HTTP server inside EG running on 18002.

  • The EG Wasm cache fetches Wasm modules from their original URLs and stores them locally. Currently, the cache directory resides within the EG container, it can be made configurable to use a mounted volume if necessary.
  • Gateway API translator updates the original Wasm downloading URLs in the EnvoyExtensionPolicy to point to the local EG HTTP server.
  • To ensure a consistent UI experience, HTTP Wasm modules are also cached and served locally.
  • To prevent unauthorized access to private images, the OIDC HMAC secret is used as a hash salt to generate unguessable downloading paths for Wasm modules.
  • The cached Wasm modules are purged periodically. A cached Wasm module will be deleted if it hasn't been requested by Envoy proxies within a certain period. Currently, defaults to 24 hours, which can be configurable if needed.
  • The total cache size is limited to 10G by default, which can be configurable if needed.

Implements: #3304
Design: #3313

@zhaohuabing zhaohuabing requested a review from a team as a code owner June 7, 2024 02:03
@zhaohuabing zhaohuabing marked this pull request as draft June 7, 2024 02:03
@zhaohuabing zhaohuabing force-pushed the wasm-oci-image branch 5 times, most recently from 7169309 to 3916bdf Compare June 7, 2024 23:27
Copy link

codecov bot commented Jun 8, 2024

Codecov Report

Attention: Patch coverage is 74.40758% with 216 lines in your changes missing coverage. Please review.

Project coverage is 68.75%. Comparing base (2a86997) to head (86016ee).

Files Patch % Lines
internal/wasm/imagefetcher.go 64.32% 35 Missing and 26 partials ⚠️
internal/gatewayapi/runner/runner.go 19.35% 50 Missing ⚠️
internal/gatewayapi/envoyextensionpolicy.go 74.03% 17 Missing and 10 partials ⚠️
internal/wasm/httpfetcher.go 74.76% 14 Missing and 13 partials ⚠️
internal/wasm/cache.go 89.76% 13 Missing and 9 partials ⚠️
internal/provider/kubernetes/controller.go 0.00% 13 Missing ⚠️
internal/wasm/httpserver.go 89.42% 5 Missing and 6 partials ⚠️
internal/provider/kubernetes/predicates.go 70.00% 2 Missing and 1 partial ⚠️
internal/provider/kubernetes/indexers.go 87.50% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3564      +/-   ##
==========================================
+ Coverage   68.55%   68.75%   +0.20%     
==========================================
  Files         170      175       +5     
  Lines       20690    21484     +794     
==========================================
+ Hits        14183    14771     +588     
- Misses       5492     5636     +144     
- Partials     1015     1077      +62     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@zhaohuabing zhaohuabing removed the request for review from a team June 8, 2024 07:47
@zhaohuabing zhaohuabing force-pushed the wasm-oci-image branch 10 times, most recently from 4164371 to a2f8b6c Compare June 11, 2024 21:57
@zhaohuabing zhaohuabing marked this pull request as ready for review June 11, 2024 21:57
@zhaohuabing zhaohuabing requested review from arkodg, guydc and zirain June 11, 2024 21:57
@zhaohuabing
Copy link
Member Author

zhaohuabing commented Jun 11, 2024

Hi @envoyproxy/gateway-reviewers my apologies for the large PR. I would have broken it into multiple smaller ones if possible :-). A lot of *.out.yaml test files have been updated due to the addition of the new wasm HTTP server static cluster to the Envoy bootstrap config.

If you're going to review it, the most significant changes are in the internal/wasm package, which implements the Wasm file cache and an HTTP server to serve these files to the Envoy fleet.

Additionally, the Gateway API runner has been modified to initialize the Wasm cache server.

@zhaohuabing zhaohuabing requested a review from a team June 11, 2024 22:25
@zhaohuabing zhaohuabing force-pushed the wasm-oci-image branch 5 times, most recently from db8908d to 236e316 Compare June 13, 2024 02:28
Signed-off-by: Huabing Zhao <[email protected]>
internal/wasm/options.go Outdated Show resolved Hide resolved
)

wasmRemoteFetchCount = metrics.NewCounter(
"wasm_remote_fetch_count",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prob needs to be split up into 2 - error, total
cc @shawnh2 any prior art here for naming

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are two different naming styles in EG:

xds_snapshot_creation_total
xds_snapshot_creation_failed
xds_snapshot_creation_success
status_update_total
status_update_failed_total
status_update_success_total
status_update_conflict_total

Prefer the first one. We also need to align "failed, failure, error" to a single term.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 looks more like envoy stats
2 looks like prom naming https://prometheus.io/docs/practices/naming/
@Xunzhuo @shawnh2 can you help drive/unify this naming

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tracked with #3684

Signed-off-by: Huabing Zhao <[email protected]>
Signed-off-by: Huabing Zhao <[email protected]>
Signed-off-by: Huabing Zhao <[email protected]>
Signed-off-by: Huabing Zhao <[email protected]>
internal/ir/xds.go Outdated Show resolved Hide resolved
transport := http.DefaultTransport.(*http.Transport).Clone()
// nolint: gosec
// This is only when a user explicitly sets a flag to enable insecure mode
transport.TLSClientConfig = &tls.Config{InsecureSkipVerify: true}
Copy link
Contributor

@arkodg arkodg Jun 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this configurable in the API ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not yet, we can expose it to API when needed.

arkodg
arkodg previously approved these changes Jun 26, 2024
Copy link
Contributor

@arkodg arkodg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall LGTM !

there are some non blocking comments added in the PR, that can be tackled in follow ups

long term, my vote would be to move the wasm server into its own runner (like ratelimit)
this eliminates the shared fate problem of delaying xDS due to one slow URL download.

this also creates a eventual consistency problem, which can be resolved by implementing retries in the wasm cluster in envoy

@arkodg arkodg requested review from a team June 26, 2024 22:50
Signed-off-by: Huabing Zhao <[email protected]>
Copy link
Contributor

@guydc guydc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, can we have some sort of a feature toggle here? Some components like the runner and the servers will execute even when the feature is not in use. LAter we can expose this by default, when we're confident and/or extracted to a different component.

func (r *Runner) Name() string {
return string(egv1a1.LogComponentGatewayAPIRunner)
}

// Start starts the gateway-api translator runner
func (r *Runner) Start(ctx context.Context) (err error) {
r.Logger = r.Logger.WithName(r.Name()).WithValues("runner", r.Name())

go r.startWasmCache(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if cache fails to start, should we fail EG or disable cache-related translation?

Copy link
Member Author

@zhaohuabing zhaohuabing Jun 27, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fail the Wasm translation in EEP gateway API translation if cache fails to start.

require (
fortio.org/fortio v1.65.0
fortio.org/log v1.12.2
github.com/Masterminds/semver/v3 v3.2.1
github.com/cncf/xds/go v0.0.0-20240423153145-555b57ec207b
github.com/davecgh/go-spew v1.1.1
github.com/davecgh/go-spew v1.1.2-0.20180830191138-d8f796af33cc
github.com/docker/cli v26.1.3+incompatible
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any way to avoid the docker dependency and use more generic oci libs here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docker lib is used to parse docker compatible config file to get auth info.

@zhaohuabing
Copy link
Member Author

zhaohuabing commented Jun 27, 2024

long term, my vote would be to move the wasm server into its own runner (like ratelimit)
this eliminates the shared fate problem of delaying xDS due to one slow URL download.

We probably could:

  • move the Wasm cache server into a standalone runner.
  • fail httproutes containing Wasm extensions if that Wasm module has not been downloaded yet. By this way, the downloading won't block the translation and only the httproutes with Wasm will fail.(Or just send out the route configuration to Envoy without Wasm, like what EG currently does if any policy translation fails?)
  • trigger the API translation again when a Wasm module is available in the cache.

Copy link
Contributor

@guydc guydc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seeing as most of the implementation that will execute regardless of EEP presence is taken from istio and already has sufficient production burn time, I take back my request for a feature gate.

@zhaohuabing zhaohuabing requested a review from arkodg June 28, 2024 17:45
@zhaohuabing zhaohuabing merged commit 9ebcfac into envoyproxy:main Jun 28, 2024
27 checks passed
guydc pushed a commit to guydc/gateway that referenced this pull request Jul 1, 2024
* support Wasm OCI image

Signed-off-by: Huabing Zhao <[email protected]>

* set up test registry

Signed-off-by: Huabing Zhao <[email protected]>

* add test for registry authn

Signed-off-by: Huabing Zhao <[email protected]>

* fix lint

Signed-off-by: Huabing Zhao <[email protected]>

* fix e2e

Signed-off-by: Huabing Zhao <[email protected]>

* fix e2e

Signed-off-by: Huabing Zhao <[email protected]>

* add test for unauthed private image

Signed-off-by: Huabing Zhao <[email protected]>

* fix e2e

Signed-off-by: Huabing Zhao <[email protected]>

* fix e2e

Signed-off-by: Huabing Zhao <[email protected]>

* fix lint

Signed-off-by: Huabing Zhao <[email protected]>

* refactor

Signed-off-by: Huabing Zhao <[email protected]>

* add max failed attempts limit

Signed-off-by: Huabing Zhao <[email protected]>

* remove retries

Signed-off-by: Huabing Zhao <[email protected]>

* clean up e2e tests

Signed-off-by: Huabing Zhao <[email protected]>

* add e2e test for wrong password

Signed-off-by: Huabing Zhao <[email protected]>

* Update api/v1alpha1/authorization_types.go

Co-authored-by: Arko Dasgupta <[email protected]>
Signed-off-by: Huabing Zhao <[email protected]>

* Update api/v1alpha1/wasm_types.go

Co-authored-by: Arko Dasgupta <[email protected]>
Signed-off-by: Huabing Zhao <[email protected]>

* remove unnecessary replace

Signed-off-by: Huabing Zhao <[email protected]>

* remove set package

Signed-off-by: Huabing Zhao <[email protected]>

* fix gen check

Signed-off-by: Huabing Zhao <[email protected]>

* add test for failed attempts

Signed-off-by: Huabing Zhao <[email protected]>

* address comments

Signed-off-by: Huabing Zhao <[email protected]>

* address comments

Signed-off-by: Huabing Zhao <[email protected]>

* minor wording

Signed-off-by: Huabing Zhao <[email protected]>

* move sha256 inside code source

Signed-off-by: Huabing Zhao <[email protected]>

* address comments

Signed-off-by: Huabing Zhao <[email protected]>

* fix e2e

Signed-off-by: Huabing Zhao <[email protected]>

* fix flaky test

Signed-off-by: Huabing Zhao <[email protected]>

* change comments

Signed-off-by: Huabing Zhao <[email protected]>

* address comments

Signed-off-by: Huabing Zhao <[email protected]>

* address comments

Signed-off-by: Huabing Zhao <[email protected]>

* fail the eep translation if the wasm cache failed to start

Signed-off-by: Huabing Zhao <[email protected]>

---------

Signed-off-by: Huabing Zhao <[email protected]>
Co-authored-by: Arko Dasgupta <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants