-
Notifications
You must be signed in to change notification settings - Fork 187
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SSE durable consumer groups ? #7604
Comments
@wkloucek Is that a blue ticket? |
@micbar maybe we should tag this as release blocker until we know that it's a non issue or have a fix? |
okay, then p2 is ok because it blocks a release. |
Do the queues for the |
That's basically my question because I don't know how many durable groups can exist and what their retention is, ... It's basically a similar question like #7245 |
after some up- / down-scaling of the sse service:
|
I think we're lacking something to close the connection. From other systems, the client (nats in this case) might need to read or write on the connection in order to know whether the server (sse in this case) is alive, not sure if this is the case. The other option is that nats is stuck in a reconnection loop. The connection is closed, but nats keeps trying to reconnect, so it make sense that there is connection data still available. It seems nats some some "allowReconnect" options in the connection, defaulting to true, but neither go-micro nor reva seem to expose that option. On the other hand, I might be mixing things between nats and natsjs.... |
From what I know durable consumers are not related to open / closed connections. From https://docs.nats.io/nats-concepts/jetstream/consumers:
|
From what I understand, nats implements a queue (maybe not exactly that, but it's easier to visualize it) where events are published. The queue can mainly behave in 2 different ways (to be checked with Nats):
I guess the first option is what they call a "at most once delivery guarantee" (it's impossible to deliver the same event twice), while the second option is what they call "at least once delivery guarantee" Assuming that what we want is the second option, and the queue is persisted, we'll need to take into account retention policies for the events. As far as I know, it should be impossible to store the events for an unlimited amount of time, so the events must be deleted one way or another. Nats should have some policies to decide when an event is removed. I guess the common option is to remove events older than 1 or 2 weeks, for example. A policy based on the size of the stream might be also common. In any case, I expect that we actively remove processed events from the queue. This is done by sending a ACK message to the queue by the consumer. I think the easiest solution is to give fixed names to each SSE service replica. This is similar to what we have now, but each SSE replica must connect ALWAYS to its own Nats queue, not a random one. The main problem here is the naming: giving a random name will generate a new queue each time the service is reset, which is what's happening now. Note that we don't know when the queues will be deleted. Basically, we could have a replica 1 connected to the sse1 queue, the replica 2 connected to sse2 and the replica 3 connected to ss3. If replica 2 goes down, we can still publish messages to the sse2 queue that will be read once replica 2 is up again and reconnected to the sse2 queue. One of the problems I see with this solution (and likely the current one) is that any kind of event will be in the queue, even those that can't be delivered right now. For example, user1 could share a file with user2, and user2 should receive a notification, but he isn't connected at that moment; in that case the event might be processed, and since user2 isn't available, the event might be ignored event if user2 connects 10 seconds later. This could be considered as an event lost, and might be important, so we must have clear expectations of what events we can send through this mechanism and what could happen. Another thing to investigate on our side is that, even if Nats will keep a queue, when our sse server goes up, the queue might be drained because the server will read the events and send them nowhere (there might be no client connected). There might be alternatives such as having one queue per user, but it would likely require some communication between replicas to ensure the events are sent to all the connected clients (web, desktop or mobile) regardless of which replica they're connected to. |
I think you are missing the concept of a However this is a problem for We already tackled that by defining one consumer group for each sse instance. However since they are I think the proper solution here is to make the |
Setting an empty group name should make the consumer as ephemeral instead of durable. At least that's is how it seems to be wired up. I assume that the "consumer groups" feature won't be active with that setup, so all the SSE replicas should receive all the events (not sure if from the beginning of the queue, or from when they connected). On the other hand, for a subject messaging system such as Nats to have all the events sent to a "main-queue" subject / queue... I'm not an expert, but it doesn't seem a good idea. I hope we don't need to trash half of our SSE system in the future |
probably also possible yes 👍 would need to be tested though.
The reason is that we want to make all events available to all services. That way a service can just listen for a certain event if it wants to update its data asynchronously. We could also have one queue for each service. Then a
I don't see the connection. Can you elaborate whats wrong with sse system? |
I think one of the points of using a "big" message system is to persist the events so we can eventually process all the events even if one of the subscribers is down. There might be also a need to have a cluster of Nats servers (or at least multiple servers) in order to have fault tolerance. In this case, I don't know how having only one queue could affect the performance and / or reliability of the system. If there is a performance issue and we can't scale Nats properly. |
Tried to use an empty group name, but it seems we need more changes.
It seems reva uses the group name as a consumer group, and as such, the name can't be empty (target code in https://github.com/cs3org/reva/blob/edge/pkg/events/events.go#L83) I've tried to bypass that restriction with the following patch in reva, but there are still problems:
The next problem seems to be in go-micro, around https://github.com/go-micro/plugins/blob/6c2dd051b8004c679895363a6e0a842bef428902/v4/events/natsjs/nats.go#L206-L208 The options set there always include the "durable" option based on the group name. go-micro doesn't seem to have options to configure the consumer other than https://github.com/go-micro/go-micro/blob/ca6190f5f289e01b0792f678b3b52dbc07f691e3/events/options.go#L69 . In particular, the "durable" option isn't exposed. It seems we'll need changes in go-micro to proceed further through this path. |
We are already using a full nats in production. Isn't that what you would call a "big" message system? It is already running with multiple instances and persists events. So if services go down they will still get all events when they come back.
This is not correct. SSE service is not CREATING nats events it is just CONSUMING them! SSE service is only "translating" a nats event to a server sent event. NOTE: If the SSE service is down, there is no need to persist messages for it, because it will only sent events to subscribed users - and users cannot subscribe if the service is down. Please don't confuse
We already have that on our production server. So far we didn't encounter any issues regarding that.
Yes there will be a PR needed to go-micro to make the durable group configurable 👍 |
Creating non-durable is allowed now micro/plugins#131 We just need to configure them in sse service correctly |
Needs a follow up in ocis. Go-micro part was merged. |
#7871 contains some code, but either it wasn't working or I was looking at the wrong place. There is a pending refactor there because we don't want to expose the configuration, but that's another problem. |
There was still an issue with the natsjs package. Needs another roundtrip: micro/plugins#135 |
Describe the bug
From what I know our NATS consumers are durable (they have a "durable_name" specified).
#7382 added a functionality, that adds a random consumer group for every SSE server (restart).
If they are really durable, we may hit some limits.
cc @kobergj
The text was updated successfully, but these errors were encountered: