Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SSE durable consumer groups ? #7604

Closed
wkloucek opened this issue Oct 27, 2023 · 20 comments · Fixed by #7986
Closed

SSE durable consumer groups ? #7604

wkloucek opened this issue Oct 27, 2023 · 20 comments · Fixed by #7986
Assignees
Labels
Interaction:Release-blocker Priority:p2-high Escalation, on top of current planning, release blocker Type:Bug
Milestone

Comments

@wkloucek
Copy link
Contributor

Describe the bug

From what I know our NATS consumers are durable (they have a "durable_name" specified).

#7382 added a functionality, that adds a random consumer group for every SSE server (restart).

If they are really durable, we may hit some limits.

cc @kobergj

@micbar micbar added the Priority:p2-high Escalation, on top of current planning, release blocker label Oct 27, 2023
@micbar
Copy link
Contributor

micbar commented Oct 27, 2023

@wkloucek Is that a blue ticket?

@wkloucek
Copy link
Contributor Author

wkloucek commented Oct 27, 2023

@wkloucek Is that a blue ticket?

No, #7382 is actually not part of any release as far as I know. If there is something about this bug / if this is a bug, it would be probably nice if it does not hit any release. (I'm not talking about pre-releases)

@wkloucek
Copy link
Contributor Author

@micbar maybe we should tag this as release blocker until we know that it's a non issue or have a fix?

@micbar
Copy link
Contributor

micbar commented Oct 27, 2023

okay, then p2 is ok because it blocks a release.

@micbar micbar moved this from Qualification to Prio 2 in Infinite Scale Team Board Oct 27, 2023
@kobergj
Copy link
Collaborator

kobergj commented Oct 27, 2023

Do the queues for the sse service really need to be durable? The sse connection is anyways cut when the service restarts. Maybe the best solution would be to just make sse consumers groups not durable?

@wkloucek
Copy link
Contributor Author

Maybe the best solution would be to just make sse consumers groups not durable?

That's basically my question because I don't know how many durable groups can exist and what their retention is, ...

It's basically a similar question like #7245

@wkloucek
Copy link
Contributor Author

after some up- / down-scaling of the sse service:


~ # nats consumer ls
[default] ? Select a Stream main-queue
Consumers for Stream main-queue:

	audit
	clientlog
	dcfs
	evhistory
	frontend
	postprocessing
	search
	sse-009dc237-e979-4a71-8c33-dbe2b1676235
	sse-017a931f-b312-4c0a-9eec-eae27970d2f7
	sse-0247f0da-4f31-4c46-bb6b-6c895fb9e772
	sse-03bb5bd9-06e8-4a61-a21a-dd8ac5989250
	sse-04a8a318-6947-4560-9736-d95f4e56bd47
	sse-0ab44145-c016-43a8-bc62-a30c4f08c17c
	sse-0acec146-67ca-44bc-9482-3d27b6d78cf7
	sse-0bdc93cd-f5ef-4052-b4ab-381bb716f0ec
	sse-0e35f127-0e5d-4573-ba70-274a6bd9739c
	sse-0f2ba682-4e2a-4482-a50d-f0bb278dc60f
	sse-1028d48d-e0c0-49fc-8f06-124a47a4cb22
	sse-13d7747d-fb30-4535-8a33-7240e38e9cf9
	sse-1402133d-c6f0-4080-ac7b-02917c76a806
	sse-140c5c6b-3a51-46a8-9243-64b5e6257209
	sse-18687fb3-8ae9-4202-ae76-dc14df6c9905
	sse-19cb8253-2c31-4c7c-ac81-3351e57e7836
	sse-1a17272e-d1b5-4263-b515-ef1bc9b2cf4c
	sse-1baa37f4-322a-4d4e-bab6-d69b36a12ded
	sse-1e9c703a-9503-47ca-a29b-ceb69d23b03b
	sse-215afa79-12cb-4545-9a40-c6fffad2dcbe
	sse-220484bf-504d-4f52-84e6-255b5ff82e11
	sse-26b041be-48a9-4a21-bfea-e4cc56a3c40e
	sse-272662bc-56da-4d7f-b9b4-e60175e62b80
	sse-2973859c-3154-4f73-acd4-b63df4011448
	sse-2eb2865c-32da-4bd0-b939-d47807a9e760
	sse-2f418ebc-bc1f-49d4-be89-0a844f7bf4db
	sse-305ee5ea-ae8d-4f29-8e7f-ec38cb9a6fd9
	sse-307e1c29-61d5-4908-8d12-b5fd25b9297e
	sse-3222d01f-f1ed-437e-8e49-82666e12704f
	sse-3267885b-6d57-4499-9fc8-ec789675698c
	sse-339397ae-3d4a-4a63-af6f-41de8625f00e
	sse-36f474f4-f040-4e0a-b269-aad36323f86a
	sse-3b51b8b2-e987-4898-ac79-0f683dfb786f
	sse-3c62bae7-7f77-4de2-99b0-a47c670230df
	sse-40728367-3a53-445b-bd57-29c528e40f68
	sse-4359cc73-3f03-4436-b4cb-971cfc812284
	sse-4744199e-0d4a-498b-b719-7c42357fa355
	sse-4785cc97-a6df-40fa-bf38-f79cf284ce55
	sse-4b68b9fa-e8a9-4e9e-b2fc-32d5819089f3
	sse-4b7a844b-a74e-4182-b702-b2e4e0c3d780
	sse-4d796095-8972-4b2b-bcb6-ab253c2575d3
	sse-4ec6c7b5-3157-4108-bc21-46b48311e469
	sse-5098dde5-0a13-4169-8905-366f5c68f571
	sse-54a26c6b-0706-4159-be2b-2b7518e50aeb
	sse-5eaa230e-18e6-4933-9c83-3ac3012ddd4a
	sse-61febc08-3727-4874-8d66-c0b157aa332b
	sse-6603b337-bf2a-40c3-806a-e619303d75c3
	sse-698b3335-81e6-43f1-8eb5-fa1d0c569de2
	sse-71bd12ef-fb75-46dd-899e-510099746b96
	sse-71e6cd1f-5a2b-4ec5-9c06-a21177122a2a
	sse-7335141d-c7dc-4408-ac3a-94515ed7eb88
	sse-76daabf2-600a-4ee0-9ea0-a202d6705665
	sse-7905b630-ae8a-4683-a145-3ccb23e99145
	sse-7992d552-5c32-47fa-bc5e-57cd98ea63aa
	sse-7c162490-de46-4299-8b77-ce0d2e41b8f5
	sse-7c2074df-6c38-4868-a3d5-0366c1d7a9d5
	sse-7de6732a-87a5-43e4-aeb9-1e8ad4353286
	sse-7eb895e7-ab74-45fc-acca-36a18d617f57
	sse-7efd0b6c-df0a-4b9c-80ca-b6254c715058
	sse-7f85d263-c417-41fe-9e25-aae835418a91
	sse-7fad12d4-2999-4d4b-b019-5abedfcc4f1b
	sse-847418af-9d9e-4702-8ace-94750289cb54
	sse-87124767-8062-461f-85b1-933fcf5719bb
	sse-87aa1253-729d-4f94-a714-b960529657ba
	sse-8b5cd77d-7ffb-4bc7-a25e-9ba6526bf4b8
	sse-90e5fae4-17fa-4488-99f4-d9a48183a04d
	sse-91895811-934c-436f-9730-8f258480ac2b
	sse-91f38a3b-9750-42b2-871b-bebec2527b78
	sse-96838843-04b8-40f1-936e-a7e673d96ec9
	sse-988ae96c-e21a-44dc-89ed-8b7cf858d8ed
	sse-98ea0979-1c97-4493-911d-c3a02517b64c
	sse-99810a09-aa44-4e1e-b70a-f8de968723aa
	sse-99b1c5c9-fac6-4de9-be27-23e8e073ec01
	sse-9c3516cf-8c5b-4471-92e6-e7b671ba6195
	sse-9cc98598-10c1-43c8-90e1-d645bd8b06a3
	sse-9d04b8b8-907d-48d0-8459-454a5b87dfc9
	sse-a0dd351d-6ded-43c3-9269-bd59c4966ca2
	sse-a2fbae05-0fc8-4998-9945-2351aac7e33d
	sse-a5de8c57-62b9-4529-8557-14a6dae2a0e1
	sse-a5eb1294-2cf2-4824-b82a-7213b6f72f8f
	sse-a72320aa-c1ba-4cc4-a516-5c7f42baca14
	sse-a82992db-446f-48f4-9e46-c2042aa3ee8a
	sse-a87805be-7c76-4903-a12b-649a8cc2ee83
	sse-ad298d64-4584-4f84-bad5-95d3c86271f1
	sse-addb9b3f-8730-4222-bfbc-6b0ee7846724
	sse-aefc873c-bc0f-402c-9905-c89cf3c9b7b3
	sse-af67c211-4873-4ba2-88aa-6c90278d4466
	sse-afb4e750-6755-4bdf-b963-1debfe281938
	sse-afc803d1-cba7-4304-8f70-7a1eeb184c41
	sse-b46b807c-6d8e-444d-bb33-4f7ac483225a
	sse-bac1945f-3b24-4c89-9f8d-dfb82cd73754
	sse-be3430b6-f259-4e64-a198-37193d1bd036
	sse-bea52fce-e054-4395-8360-8ae01ad43146
	sse-c5bf1842-69ec-49a8-b9f6-8dd217611a62
	sse-c8b9ff40-5c61-4860-ac92-1e14b5941193
	sse-ca8c7ad0-e5f5-466c-ac48-115ec01059b6
	sse-cd2c0b7a-cd93-4ee5-917f-0c71b2daeb1e
	sse-cf641edd-2fea-4a17-a51d-8e4aa9e1c515
	sse-d0275b29-249e-488c-a7d0-576f66348217
	sse-d2fa1ff5-ea02-44e5-bee4-0736f612f63c
	sse-de2b36a7-5308-4c78-8bc5-2a06f92e9ea6
	sse-de33dccb-e399-47d9-aeb4-35b70a3b9e95
	sse-de69f1de-0715-486d-85bb-ec0404830904
	sse-e153082e-dc26-48c5-a1df-6c9640ff376a
	sse-e1d1e772-9265-4bd4-8b33-946fabd3b023
	sse-e27f6d75-9755-495e-8d46-2a99a5516e2b
	sse-e5e254af-da26-45c2-b6e0-cb54aace7983
	sse-e75ca8b6-1805-446f-b131-1aefb6283bc6
	sse-e7778018-ec13-46fe-8904-91b2cfcb99c6
	sse-eb950885-675e-4ead-9825-5a279d16631b
	sse-ec4a96ed-f2c1-439a-8c66-cae199b64871
	sse-ec5781a9-c4da-4919-a9d8-cf265d5e584d
	sse-ed02b01b-f758-47b8-bc79-ca52b370722a
	sse-ed1d2718-a60b-4b44-8894-ceb0eb5a1fa1
	sse-edb581d9-94f4-4c64-93b9-5bd97254c643
	sse-edf4bddd-93d3-4e5c-b3fe-8cdb5f65428c
	sse-ee7865b4-5fa0-4c67-8b85-d6ad95844db0
	sse-efbf11a8-49a0-44b6-8397-442cf0d60921
	sse-efef1e09-b610-4a4a-8bb4-e7420fa65548
	sse-f0131988-ca28-4626-a927-e8bf71ceb3d2
	sse-f24fe3ab-d57c-448b-a4e0-792c541c210c
	sse-f26c4405-7036-48ac-8891-3c156983e98e
	sse-f41de0ad-58f3-43df-88a3-b3ec2cafa8b4
	sse-f56f1f51-da00-468e-8a83-3597e0e23a6d
	sse-f6475936-e7f6-424f-9e2c-b74806fea141
	sse-f65c4e7b-ba8d-44f8-8d00-72d88f899531
	sse-fcdbe6f5-d4c8-42e5-8190-7f7cd6c165cd
	sse-fd1ad6e0-6254-4206-a4e2-91c8f9ff351b
	sse-fe9bb8bb-c910-4738-a58a-3f2af392c7ea
	storage-users
	userlog
~ # nats consumer info --json
[default] ? Select a Stream main-queue
[default] ? Select a Consumer sse-009dc237-e979-4a71-8c33-dbe2b1676235
{
  "stream_name": "main-queue",
  "name": "sse-009dc237-e979-4a71-8c33-dbe2b1676235",
  "config": {
    "ack_policy": "explicit",
    "ack_wait": 30000000000,
    "deliver_policy": "new",
    "deliver_subject": "_INBOX.4BvN1l41C3PjF4LX9ChFqo",
    "deliver_group": "sse-009dc237-e979-4a71-8c33-dbe2b1676235",
    "durable_name": "sse-009dc237-e979-4a71-8c33-dbe2b1676235",
    "name": "sse-009dc237-e979-4a71-8c33-dbe2b1676235",
    "filter_subject": "main-queue",
    "max_ack_pending": 1000,
    "max_deliver": -1,
    "replay_policy": "instant",
    "num_replicas": 0
  },
  "created": "2023-10-27T11:16:16.494526346Z",
  "delivered": {
    "consumer_seq": 0,
    "stream_seq": 0
  },
  "ack_floor": {
    "consumer_seq": 0,
    "stream_seq": 0
  },
  "num_ack_pending": 0,
  "num_redelivered": 0,
  "num_waiting": 0,
  "num_pending": 0,
  "cluster": {
    "name": "nats",
    "leader": "nats-1",
    "replicas": [
      {
        "name": "nats-0",
        "current": true,
        "active": 436664270
      },
      {
        "name": "nats-2",
        "current": true,
        "active": 436691997
      }
    ]
  },
  "push_bound": true,
  "ts": "2023-10-27T11:29:03.127697701Z"
}

@jvillafanez
Copy link
Member

I think we're lacking something to close the connection.
I'd expect that, when the sse microservice stops, it sends a close message to nats, or close the connection. So far, it doesn't happen, and it seems nats doesn't know whether the connection is active or not.

From other systems, the client (nats in this case) might need to read or write on the connection in order to know whether the server (sse in this case) is alive, not sure if this is the case.

The other option is that nats is stuck in a reconnection loop. The connection is closed, but nats keeps trying to reconnect, so it make sense that there is connection data still available.

It seems nats some some "allowReconnect" options in the connection, defaulting to true, but neither go-micro nor reva seem to expose that option. On the other hand, I might be mixing things between nats and natsjs....

@wkloucek
Copy link
Contributor Author

From what I know durable consumers are not related to open / closed connections.

From https://docs.nats.io/nats-concepts/jetstream/consumers:

durable: If set, clients can have subscriptions bind to the consumer and resume until the consumer is explicitly deleted. A durable name cannot contain whitespace, ., *, >, path separators (forward or backwards slash), and non-printable characters.

@jvillafanez
Copy link
Member

From what I understand, nats implements a queue (maybe not exactly that, but it's easier to visualize it) where events are published.

The queue can mainly behave in 2 different ways (to be checked with Nats):

  • A random consumer "attached" to the queue reads and remove the next event in the queue. This means that the event is delivered only to one consumer.
  • The queue is persisted, and any consumer can read from either the start or the end of the persisted queue.

I guess the first option is what they call a "at most once delivery guarantee" (it's impossible to deliver the same event twice), while the second option is what they call "at least once delivery guarantee"

Assuming that what we want is the second option, and the queue is persisted, we'll need to take into account retention policies for the events. As far as I know, it should be impossible to store the events for an unlimited amount of time, so the events must be deleted one way or another.

Nats should have some policies to decide when an event is removed. I guess the common option is to remove events older than 1 or 2 weeks, for example. A policy based on the size of the stream might be also common.
My point is that Nats will remove events from the stream eventually.

In any case, I expect that we actively remove processed events from the queue. This is done by sending a ACK message to the queue by the consumer.
I see some options in go-micro to auto-ack the events or do it manually. This "ACKing" should remove the processed event (the one we're "ACKing") from the queue, so the queue stays within the limits specified by Nats.
This is where the "at least once delivery guarantee" comes into play. Basically, "consumer-A" reads the first event, "consumer-B" also reads the same event, both of them ACK the event (meaning, both of them have processed the event), and then "consumer-C" can't read the first event (removed by the ACK message) but reads the next one.
In any case, we must NOT consider that all the events will be delivered to all the consumers of the queue.


I think the easiest solution is to give fixed names to each SSE service replica.

This is similar to what we have now, but each SSE replica must connect ALWAYS to its own Nats queue, not a random one.

The main problem here is the naming: giving a random name will generate a new queue each time the service is reset, which is what's happening now. Note that we don't know when the queues will be deleted.
In addition, I think the names must be known, because we'll need to publish the events in all known queues. Publishing in every queue might be bad because we don't know which queues are ours; and we could publish in 100 queues instead of 3, which could have a performance impact.

Basically, we could have a replica 1 connected to the sse1 queue, the replica 2 connected to sse2 and the replica 3 connected to ss3. If replica 2 goes down, we can still publish messages to the sse2 queue that will be read once replica 2 is up again and reconnected to the sse2 queue.
Note that I'm pretty sure we can publish to unknown queues, so creating a new queue implies that messages sent up until that moment won't be available to the new queue, which means that events will be lost for those connecting to the new queue.

One of the problems I see with this solution (and likely the current one) is that any kind of event will be in the queue, even those that can't be delivered right now. For example, user1 could share a file with user2, and user2 should receive a notification, but he isn't connected at that moment; in that case the event might be processed, and since user2 isn't available, the event might be ignored event if user2 connects 10 seconds later. This could be considered as an event lost, and might be important, so we must have clear expectations of what events we can send through this mechanism and what could happen.
The other option is that we mark that event as not ack so it's kept in the queue, but as far as I know we'll keep fetching new events without a chance to get back. There is no second consumer in that queue that could get the event in order to retry processing it.
I think that, right now, the event will be read from Nats and then delivered to all the connected sse clients (which might be none, or they might ignore the event because it isn't for them)

Another thing to investigate on our side is that, even if Nats will keep a queue, when our sse server goes up, the queue might be drained because the server will read the events and send them nowhere (there might be no client connected).

There might be alternatives such as having one queue per user, but it would likely require some communication between replicas to ensure the events are sent to all the connected clients (web, desktop or mobile) regardless of which replica they're connected to.

@kobergj
Copy link
Collaborator

kobergj commented Nov 27, 2023

I think you are missing the concept of a consumer group. In ocis we have only one queue (called main-queue). Each service connecting to it is part of a consumer group. This guarantees that each service (which might have multiple instances) only gets the event once.

However this is a problem for sse as all sse services need to get all events. (Only one of the service instances holds the connection to a specific user). Therefore the sse service instances must NOT be in a common consumer group.

We already tackled that by defining one consumer group for each sse instance. However since they are durable they will not be deleted.

I think the proper solution here is to make the durable flag of the nats consumers configurable and set it to false for the sse service only.

@jvillafanez
Copy link
Member

I think the proper solution here is to make the durable flag of the nats consumers configurable and set it to false for the sse service only.

Setting an empty group name should make the consumer as ephemeral instead of durable. At least that's is how it seems to be wired up. I assume that the "consumer groups" feature won't be active with that setup, so all the SSE replicas should receive all the events (not sure if from the beginning of the queue, or from when they connected).

On the other hand, for a subject messaging system such as Nats to have all the events sent to a "main-queue" subject / queue... I'm not an expert, but it doesn't seem a good idea. I hope we don't need to trash half of our SSE system in the future

@kobergj
Copy link
Collaborator

kobergj commented Nov 27, 2023

Setting an empty group name should make the consumer as ephemeral instead of durable. At least that's is how it seems to be wired up. I assume that the "consumer groups" feature won't be active with that setup, so all the SSE replicas should receive all the events (not sure if from the beginning of the queue, or from when they connected).

probably also possible yes 👍 would need to be tested though.

On the other hand, for a subject messaging system such as Nats to have all the events sent to a "main-queue" subject / queue... I'm not an expert, but it doesn't seem a good idea.

The reason is that we want to make all events available to all services. That way a service can just listen for a certain event if it wants to update its data asynchronously. We could also have one queue for each service. Then a publish would need to publish to all service queues. (probably through a fan-out queue or something similar) This would however mean we store each event multiple times. Hence we did go for the single queue first, because we can still switch to multiple queues when we see fit.

I hope we don't need to trash half of our SSE system in the future

I don't see the connection. Can you elaborate whats wrong with sse system?

@jvillafanez
Copy link
Member

I hope we don't need to trash half of our SSE system in the future

I don't see the connection. Can you elaborate whats wrong with sse system?

I think one of the points of using a "big" message system is to persist the events so we can eventually process all the events even if one of the subscribers is down.
If our SSE service goes down, ownCloud could still generate events (user1 shared with user2, for example). Ideally, that event could be persisted in Nats for a while (maybe days or weeks, depending on the configured limits) so eventually the event is processed when the SSE service is finally up again. The event could be sent to the destination after 2 days or 2 weeks. This would increase the reliability of the system because no event would be lost (there might still be cases, but they should be rare).
Without persistence, if the SSE service goes down, all the events generated during that time will be lost because there is no service consuming the events. It might be acceptable for now, but it might limit what events can be sent in the future.

There might be also a need to have a cluster of Nats servers (or at least multiple servers) in order to have fault tolerance. In this case, I don't know how having only one queue could affect the performance and / or reliability of the system. If there is a performance issue and we can't scale Nats properly.
Maybe Nats handles this problem nicely and there is no such issue.

@jvillafanez
Copy link
Member

Tried to use an empty group name, but it seems we need more changes.

diff --git a/services/sse/pkg/server/http/server.go b/services/sse/pkg/server/http/server.go
index 010d63ed98..61aae8545c 100644
--- a/services/sse/pkg/server/http/server.go
+++ b/services/sse/pkg/server/http/server.go
@@ -8,7 +8,6 @@ import (
        "github.com/cs3org/reva/v2/pkg/events"
        "github.com/go-chi/chi/v5"
        chimiddleware "github.com/go-chi/chi/v5/middleware"
-       "github.com/google/uuid"
        "github.com/owncloud/ocis/v2/ocis-pkg/account"
        "github.com/owncloud/ocis/v2/ocis-pkg/cors"
        "github.com/owncloud/ocis/v2/ocis-pkg/middleware"
@@ -79,7 +78,7 @@ func Server(opts ...Option) (http.Service, error) {
                ),
        )
 
-       ch, err := events.Consume(options.Consumer, "sse-"+uuid.New().String(), options.RegisteredEvents...)
+       ch, err := events.Consume(options.Consumer, "", options.RegisteredEvents...)
        if err != nil {
                return http.Service{}, err
        }

It seems reva uses the group name as a consumer group, and as such, the name can't be empty (target code in https://github.com/cs3org/reva/blob/edge/pkg/events/events.go#L83)

I've tried to bypass that restriction with the following patch in reva, but there are still problems:

diff --git a/pkg/events/events.go b/pkg/events/events.go
index 662a56f2..35bf6eca 100644
--- a/pkg/events/events.go
+++ b/pkg/events/events.go
@@ -80,7 +80,11 @@ type (
 // group defines the service type: One group will get exactly one copy of a event that is emitted
 // NOTE: uses reflect on initialization
 func Consume(s Consumer, group string, evs ...Unmarshaller) (<-chan Event, error) {
-       c, err := s.Consume(MainQueueName, events.WithGroup(group))
+       consumeOpts := []events.ConsumeOption{}
+       if group != "" {
+               consumeOpts = append(consumeOpts, events.WithGroup(group))
+       }
+       c, err := s.Consume(MainQueueName, consumeOpts...)
        if err != nil {
                return nil, err
        }

The next problem seems to be in go-micro, around https://github.com/go-micro/plugins/blob/6c2dd051b8004c679895363a6e0a842bef428902/v4/events/natsjs/nats.go#L206-L208

The options set there always include the "durable" option based on the group name. go-micro doesn't seem to have options to configure the consumer other than https://github.com/go-micro/go-micro/blob/ca6190f5f289e01b0792f678b3b52dbc07f691e3/events/options.go#L69 . In particular, the "durable" option isn't exposed.
The problem is that, with the patches above, if we don't set a group name, go-micro will generate a random uuid and use it as group name, so the only option we have is to overwrite the group name. However, overwriting the group name with an empty string isn't possible.

It seems we'll need changes in go-micro to proceed further through this path.

@kobergj
Copy link
Collaborator

kobergj commented Nov 29, 2023

I think one of the points of using a "big" message system is to persist the events so we can eventually process all the events even if one of the subscribers is down.

We are already using a full nats in production. Isn't that what you would call a "big" message system? It is already running with multiple instances and persists events. So if services go down they will still get all events when they come back.

Without persistence, if the SSE service goes down, all the events generated during that time will be lost because there is no service consuming the events

This is not correct. SSE service is not CREATING nats events it is just CONSUMING them! SSE service is only "translating" a nats event to a server sent event. NOTE: If the SSE service is down, there is no need to persist messages for it, because it will only sent events to subscribed users - and users cannot subscribe if the service is down.

Please don't confuse server sent events with nats events. These are two different things!

There might be also a need to have a cluster of Nats servers (or at least multiple servers) in order to have fault tolerance. In this case, I don't know how having only one queue could affect the performance and / or reliability of the system.

We already have that on our production server. So far we didn't encounter any issues regarding that.

It seems we'll need changes in go-micro to proceed further through this path.

Yes there will be a PR needed to go-micro to make the durable group configurable 👍

@kobergj
Copy link
Collaborator

kobergj commented Nov 30, 2023

Creating non-durable is allowed now micro/plugins#131

We just need to configure them in sse service correctly

@micbar
Copy link
Contributor

micbar commented Dec 11, 2023

Needs a follow up in ocis. Go-micro part was merged.

@jvillafanez
Copy link
Member

#7871 contains some code, but either it wasn't working or I was looking at the wrong place. There is a pending refactor there because we don't want to expose the configuration, but that's another problem.

@kobergj
Copy link
Collaborator

kobergj commented Dec 13, 2023

There was still an issue with the natsjs package. Needs another roundtrip: micro/plugins#135

@kobergj kobergj self-assigned this Dec 15, 2023
@github-project-automation github-project-automation bot moved this from Prio 2 to Done in Infinite Scale Team Board Dec 18, 2023
@micbar micbar added this to the Release 5.0.0 milestone Jan 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Interaction:Release-blocker Priority:p2-high Escalation, on top of current planning, release blocker Type:Bug
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

5 participants