Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor communication layer to use socket.io #10514

Merged
merged 4 commits into from
Feb 15, 2022
Merged

Conversation

msujew
Copy link
Member

@msujew msujew commented Dec 7, 2021

What it does

Closes #10403

Replaces most of our communication layer (previously using simple websockets + http-fallback) with the socket.io framework.

This removes the need for a custom long polling implementation, which is done by socket.io automatically. It works by first initiating a long-polling connection and after a successful handshake tries to create a websocket connection. If that fails, it will simply continue using the long-polling connection. This greatly increases the performance of the http-fallback, which now seems to be on par with the websocket solution (performance wise). For more info on the actual protocol used here, see socket.io-protocol repo.

There is some weird behavior in socket.io that strips origin headers for some reason. This is undocumented and seems to be used by the cors feature. I've worked around this for now using a fix-origin header that gets transmitted to the backend and can then be used for validation. Hopefully the comments around these parts explain it well enough.

I've tried to keep the APIs as backwards compatible as possible. Hopefully this change shouldn't break too much.

How to test

  1. Start the browser application and assert that the frontend connects and everything works as expected
  2. Do the same for electron
  3. Also execute the testing steps for the http-fallback (Implement http fallback for unavailable websockets #9731)

Review checklist

Reminder for reviewers

@msujew msujew added the messaging issues related to messaging label Dec 7, 2021
@msujew
Copy link
Member Author

msujew commented Dec 7, 2021

@paul-marechal Your changes over at https://github.com/eclipse-theia/theia/tree/mp/refact-connections look interesting. Are you interested in me taking some of these ideas to encapsulate the socket.io stuff? That would make the change probably quite a bit more breaking. Or should we do that as a second step? I don't expect a lot of stuff to break due to my changes, aside from the changes to the handleHttpUpgrade stuff.

@paul-marechal
Copy link
Member

paul-marechal commented Dec 7, 2021

I want to look at how much we should wrap the broken APIs bits with abstract interfaces instead of either the old ws type or the new Socket one. This to shield dependents from future internal changes/improvements. I'll take some time for this during this week.

edit: I couldn't come up with something yet.

@tsmaeder
Copy link
Contributor

I would like to point to #9514 (comment) and the lessons learned there.

I think we should make sure that we get a real good separation of concerns: there are multiple concerns that we're addressing, like

  • Establishing a communication channel to a remote process
  • Request batching
  • Remote function/method invocation
  • Multiplexing multiple connections over s single connection

I think we should take really good care to separate the different concerns into reusable, layered components that can be consumed separately.

@paul-marechal
Copy link
Member

@tsmaeder In this case only the transport layer is touched, so we should try to keep at that in this PR and only minimally touch the other parts if required.

@tsmaeder
Copy link
Contributor

tsmaeder commented Jan 5, 2022

@paul-marechal I agree as far as this PR is concerned. It's more a comment on future directions we should take.

@colin-grant-work
Copy link
Contributor

A propos the discussion in today's developer meeting, I think my inclination would be to merge this PR before we undertake further work on the messaging system. This PR cleans up the code a fair bit, and in some smoke tests, I didn't notice any regressions. I agree with @paul-marechal that the abstractions linking our front and back ends could use a fair bit of work - both to clean up the code and make things more navigable and to improve performance - but this PR is a step forward.

@paul-marechal
Copy link
Member

It's just that Socket.io isn't directly related to WebSockets but it is now retrofitted into our various WebSocket-centric abstractions. Feel free to merge before we come around to break and refactor the communication APIs.

@colin-grant-work
Copy link
Contributor

It's just that Socket.io isn't directly related to WebSockets.

That may overstate the case a little bit. As their website says:

In most cases, the connection will be established with WebSocket, providing a low-overhead communication channel between the server and the client.

So most of the time we will in fact be dealing with WebSockets, and they've also handled the cases when we aren't using a pretty slick interface that conceals the details.

@paul-marechal
Copy link
Member

paul-marechal commented Jan 26, 2022

they've also handled the cases when we aren't using a pretty slick interface that conceals the details.

My main issue being that we used to rely on those details to some extent: We have validation logic for when WebSockets upgrade, but this will not always happen now. Socket.io also supports multiplexing out of the box, but right now we don't make use of that, and I believe we should.

From the dependents of the communication API, using websockets or not shouldn't really matter. What the frontend wants are connections and proxies. How this is achieved should be an implementation detail, and if we better abstracted those use-cases I believe we could make a better use of Socket.io, or whatever other library could be used instead.

@tsmaeder
Copy link
Contributor

My main issue being that we used to rely on those details to some extent: We have validation logic for when WebSockets upgrade, but this will not always happen now.

How this is achieved should be an implementation detail, and if we better abstracted those use-cases I believe we could make a better use of Socket.io, or whatever other library could be used instead.

But these two goals are at loggerheads: you can't rely on implementation details and abstract from them at the same time. If we want connections and proxies at a higher level, we need to have an abstraction at that level that captures the relevant properties. I could imagine that we're conceptualize this as a connection being multiplexed over a main and a fallback connection. Maybe we need some "quality-of-service" attribute on the connections to express events like: "this connection has switched from being slow to being fast".

@paul-marechal I find it hard to know what exactly would be worse for the user if we lose the validation logic? Do you have a good handle on that? Might it be worth the sacrifice if we can improve the layering and simplify the code?

@paul-marechal
Copy link
Member

I could imagine that we're conceptualize this as a connection being multiplexed over a main and a fallback connection.

I think clients to our messaging API shouldn't care about whether they are running on a main or fallback connection, that's kind of the point of using Socket.io. Lower layers of course will need to deal with the details, but that's the goal of the abstraction: To hide the nitty-gritty details and expose something easily usable.

I find it hard to know what exactly would be worse for the user if we lose the validation logic?

Validation logic would be pulled down into the lower layers. Validation is required in absence of a solution to control the HTTP origin of WebSocket connection outside of Theia (with a proxy or whatever else) as WebSockets aren't affected by CORS.

[...] you can't rely on implementation details and abstract from them at the same time.

Agreed.

To illustrate some of the abstractions we would be missing: Right now if you want a proxy from the FE to the BE you have to go through WebSocketConnectionProvider.createProxy: Now that we are thinking of not exactly using WebSockets it doesn't make as much sense. This is just one example.

@colin-grant-work
Copy link
Contributor

To illustrate some of the abstractions we would be missing: Right now if you want a proxy from the FE to the BE you have to go through WebSocketConnectionProvider.createProxy: Now that we are thinking of not exactly using WebSockets it doesn't make as much sense. This is just one example.

I'd make a distinction here between missing abstractions and bad naming. The name WebSocketConnectionProvider indeed doesn't make much sense in a world where we're not using WebSockets, and it likely never made sense to use that specific a name in a world where it wasn't contrasting with a different kind of FE-BE connection provider: we should always have just called it ConnectionProvider, because it provided the only kind of connections we were using. Originally, that was WebSockets, but already in our current code it's actuall WebSockets with an HTTP fallback. The bad name hasn't stopped us from implementing useful functionality under the existing abstraction provided by the interface we defined for WebSocket connections. I'm certainly not opposed to rectifying our names, but that's a separate issue from whether our abstractions (interfaces) are useful and correct.

@paul-marechal
Copy link
Member

I'm certainly not opposed to rectifying our names, but that's a separate issue from whether our abstractions (interfaces) are useful and correct.

I'm not sure what to think of https://github.com/eclipse-theia/theia/blob/master/packages/core/src/node/messaging/messaging-service.ts#L22-L47 in that regard, I find most of this infrastructure confusing hence why I would be looking into a way to simplify those things.

@matthewfisch
Copy link

FWIW from a user perspective:

For what it's worth, I'm running a "Che Theia" cluster with people connecting from around the planet -- individuals on the opposite side of the planet (200-350ms) experience almost unusuable environments. The solution is a complete fail for them and probably reduces their working efficiency by a third or more.

Reminds me of the TCP/BDP issues I had to solve trans-pacific 20 years ago (that have mostly been resolved by OS TCP stack improvements).

  • long Theia load/initialization times (3-5 minutes, sometimes longer)
  • load/init failures / Task failure
  • Complete desynchronization (requiring reload at 3-5 minutes or more)
  • RPC failures that result in poor user experience / failed terminal opens / delays of up to a minute in some IDE actions

I may end up regionalizing their working environment, but it's worth pointing our their around-the-world video and voice connections are very reliable (rarely any disruptions during very long calls). They can also download directly from our http servers at better than 10mbit (despite the high BDP) no problem.

Thanks for any multiplexing improvements that may make their life better, and please do tag me if I can help with any testing. It's possible you my be able to simulate at least some of the problem space by artificially adding 300ms to your RTT in the lab.

I'm currently running Theia 1.22.1 from August 23 2021.

@tsmaeder
Copy link
Contributor

tsmaeder commented Feb 1, 2022

@matthewfisch 250-300ms of what? Ping time? Round trip latency? I regularly would run on a che-theia cluster (Red Hat) from the US and Theia being slow was never a problem. Not saying it ain't so, but I'm trying to understand what scenario we're looking at.

@matthewfisch
Copy link

matthewfisch commented Feb 1, 2022

@tsmaeder
RTT (round-trip time). Servers are in eastern US, problems reported in south asia and further east. I have not had issues reported from the EU (but that's 150ms closer).

Not all remote developers have reported the issue, and I'm still in diagnosis phase. I'm not sure if it is universal of affecting some individuals. I have traced the issue to websocket/rpc comms, but I don't know much about how it works in Theia (so investigation has been slow).

I'm reviewing #9731 now, looks like http fallback is available in a Theia that is one-day newer and I may try to get this working.

I presumed the issue was just that the BDP was a little too much and the RPC events were falling behind and eventually timing out. A multiplexed channel or multiple channels would resolve this. HTTP fallback may also help.

@tsmaeder
Copy link
Contributor

tsmaeder commented Feb 1, 2022

Round trip time should be less of a problem than throughput as message handling is async anyway. Theia already uses a single Websocket for all front-end/back-end service communications. Looking forward to seeing your analysis.

@matthewfisch
Copy link

matthewfisch commented Feb 1, 2022 via email

@JonasHelming JonasHelming added the decision_needed issues that require decisions on ways forward (ex: if they should be updated or closed) label Feb 15, 2022
Copy link
Member

@paul-marechal paul-marechal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The changes are minimal which is nice! Let's merge this now and look at improving our messaging abstractions later.

@paul-marechal paul-marechal merged commit 1b44c0f into master Feb 15, 2022
@paul-marechal paul-marechal deleted the msujew/socket-io branch February 15, 2022 16:55
@paul-marechal paul-marechal restored the msujew/socket-io branch February 15, 2022 16:55
@github-actions github-actions bot added this to the 1.23.0 milestone Feb 15, 2022
thegecko pushed a commit to ARMmbed/theia that referenced this pull request Feb 17, 2022
NexediGitlab pushed a commit to SlapOS/slapos that referenced this pull request Apr 8, 2022
Since 1.23.0 ( eclipse-theia/theia#10514 ),
Theia uses socket.io and it needs /socket.io to support websocket,
otherwise it falls back to HTTP polling, which is significantly slower.
@msujew msujew deleted the msujew/socket-io branch October 2, 2024 08:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
decision_needed issues that require decisions on ways forward (ex: if they should be updated or closed) messaging issues related to messaging
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Integrate socket.io for messaging
7 participants