Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Converge [New]GrpcMuxImpl #11477

Open
htuch opened this issue Jun 5, 2020 · 8 comments
Open

Converge [New]GrpcMuxImpl #11477

htuch opened this issue Jun 5, 2020 · 8 comments
Assignees

Comments

@htuch
Copy link
Member

htuch commented Jun 5, 2020

We have two distinct gRPC mux impls for xDS, which impacts velocity as we need to do the same thing twice when working with the xDS transport and will be a source of inconsistency long term. This is some hangover from earlier work on delta xDS.

CC @wgallagher @fredlas

@mattklein123
Copy link
Member

What is left to do here? I thought we already merged most of the code?

@htuch
Copy link
Member Author

htuch commented Jun 5, 2020

That's not the case, there are distinct paths for delta and SotW, I'm making the same change in both places right now for #11362. @wgallagher has more complete state and a provisional plan here.

@wgallagher
Copy link

The remaining work is to merge GrpcMuxImpl and NewGrpcMuxImpl (the SoTW / Delta mux)

@fredlas
Copy link
Contributor

fredlas commented Jun 8, 2020

At this point @wgallagher is probably more plugged in here than I am. However, I think there's one bit of important information to share here. When #8974, which would have resolved this, caused the weird obscure hang with that one user, the reporter and I pored over it together for like a week. There was nothing in the behavior of the new SotW code going wrong in any way. Rather, Envoy was simply not even asking for the config (IIRC; it was definitely something in the vein of "when the problem happens, it's because the unchanged bulk of Envoy never even calls the changed code"). I suspect a pre-existing bug in the init manager, especially because the user found that the hang could be broken out of by intentionally having the xDS server send a rejection.

@htuch
Copy link
Member Author

htuch commented Jun 8, 2020

Yeah, there have been quite a few changes around init manager since then as well. @fredlas @rgs1 would it be easy to validate whether or not a revived #8974 would be safe to merge again? Is there a concrete enough reproducer we can build into a test?

@htuch
Copy link
Member Author

htuch commented Jul 13, 2020

Another thing to look at while doing this work; today we have some pretty different behaviors for how discovery requests are handled during warming in SotW and delta. In SotW, we send on each new subscription, regardless of existing state. In delta, we take into account whether an existing request is pending and don't send when a new cluster is added (for example). I think this has (to some extent) to do with the differences in GrpcMuxImpl and NewGrpcMuxImpl.

@dmitri-d
Copy link
Contributor

I'm picking this up as @wgallagher is unable to continue working on this. @fredlas & @rgs1 is there a way to reproduce the issue you were seeing in #8974?

@htuch htuch assigned dmitri-d and unassigned wgallagher Sep 21, 2020
@rgs1
Copy link
Member

rgs1 commented Sep 22, 2020

@dmitri-d it's been too long and even back then we couldn't get a repro. I am however happy to give a rebased branch a spin and see how it goes. The breakage was deterministic and quick.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants