[Background reports] Infinite loader #11752

RachL · 2023-11-02T16:38:07Z

Description

A test scenario which end up with an infinite loader for the user.

Steps to Reproduce

Go to https://coopcircuits.fr/admin/reports/packing/customer as hub 922 or super admin with hub 922 as filter
Change start date for January &st 2023 (keep end date as is)
Ask for xlsx file
See you get an email after 3 minutes with the file, while the report page still shows the loader

Severity

S3 as we get the email in the end

Your Environment

Version used: v4.4.17
Browser name and version: Firefox and chrome
Operating System and version (desktop or mobile): desktop

RachL · 2024-01-25T10:42:34Z

@mkllnk FYI I've reproduced it. Can it be linked to the number of reports run in parallel/at the same time?

mkllnk · 2024-01-30T00:13:04Z

I've reproduced it.

Great, how?

Can it be linked to the number of reports run in parallel/at the same time?

Are you running multiple reports in the same tab? I would need to look into that. Hm, no, you can't be running another report because you are waiting for the first one in that tab. But maybe you are running another one in another tab. It should work though. Let me know how I can reproduce it and then I may find the culprit.

RachL · 2024-01-30T09:42:22Z

@mkllnk same scenario that is in the issue. Only one tab opened. However this scenario passes if I'm trying during low hours (evening for example). So maybe when you try on your timezone it will always work?

I guess maybe this one we can't fix it atm and we need to have more occurrences before we can do something.

mkllnk · 2024-01-31T04:10:53Z

Interesting, I just tried to load that report and I didn't even see the loading screen. The console gave me a hint though:

The connection to wss://coopcircuits.fr/cable was interrupted while the page was loading.

The site was generally very slow.

mkllnk · 2024-01-31T04:34:28Z

Nothing in Bugsnag. The report format doesn't matter, by the way.

mkllnk · 2024-01-31T04:39:39Z

I also didn't need the big date range. The default of one month still triggered the problem.

Firefox used to have a bug related to the error message here and I tried the suggested workaround by changing the Firefox option to false but it didn't help:

network.http.http2.websockets

mkllnk · 2024-01-31T04:59:47Z

I haven't gotten a single report rendered in the browser yet. The connection is slow but the server is not busy. It's only 5am in France now. UK staging works without any problem. AU production works without problem. There's something weird about the websocket connection.

mkllnk · 2024-01-31T05:18:30Z

fr_prod is the only server with increased puma workers and threads. I tried setting it back to our default but that didn't change anything.

mkllnk · 2024-01-31T06:01:12Z

I was wondering what's different about fr_prod. It runs Ubuntu 20 which is otherwise only used by India and New Zealand. On nz_prod I found intermittent problems as well.

When I load the page, there's one cable connection initiated straight away. Another cable connection is request is send several seconds later. If I try to render the report before the second request, nothing happens. But if I do it after the second request, the loading indicator comes up and then the report.

I still need to find out how to debug websockets properly.

mkllnk · 2024-01-31T06:17:16Z

Turns out that Firefox doesn't have the messages tab in my dev tools. 🤷 Chromium has it though. I got the loading message and plenty of ping messages but when the report finished, it didn't get sent to the browser even though the scoped channel was connected properly.

mkllnk · 2024-01-31T06:26:20Z

There's a weird thing that the browser is subscribing twice to the channels. And it's in different order.

uk-staging (success):

fr-prod (fail):

mkllnk · 2024-01-31T06:33:09Z

I have to stop here for now. Ideas for investigation:

Why does fr-prod initiate the session channel straight away and the scoped channel has to wait until the page is loaded while uk-staging does it the other way round?
Could the asset compilation make a difference?
Are we loading some JS twice?
Could a firewall make a difference?

dacook · 2024-02-13T09:25:14Z

@rioug found this:
WebSocket error occurred: Broken pipe in the logs : https://discuss.rubyonrails.org/t/getting-a-broken-pipe-error-with-websockets/80841 Points toward an issue with our redis config, also mentioned in the stackoverflow link above. I reckon we should start with that.

Also my investigations of the logs on au_staging for the bulk products screen:
StimulusReflex issues (WebSocket broken pipe)

rioug · 2024-02-20T03:27:57Z

This has an example of the redis config to potentially fix the issue : https://stackoverflow.com/questions/76255963/rails-actioncable-sporadically-failing-to-subscribe

rioug · 2024-02-21T03:34:49Z

I had a look at the solution offered in the link above, I tried increasing the number of Redis connection as explained but it broke :

New unhandled error in staging from OpenFoodNetwork Aus in gems/redis-client-0.20.0/lib/redis_client/config.rb:21 (details)
Unhandled error
ArgumentError: unknown keyword: :pool_size
Location
gems/redis-client-0.20.0/lib/redis_client/config.rb:21 - initialize

I had a look at the various gem we use, redis-client does support a connection pool via connection_pool gem , see https://github.com/redis-rb/redis-client?tab=readme-ov-file#usage. But the 'redis' gem doesn't seem to be using it. Instead they recommend the following https://github.com/redis/redis-rb?tab=readme-ov-file#connection-pooling-and-thread-safety

I tried to use the redis-client connection pool directly via redis but it broke. So I ended up trying this patch, adding a ConnectionPool::Wrapper when initialising the redis client

# @client = initialize_client(@options)
@client = ConnectionPool::Wrapper.new(size: 50, timeout: 3) { initialize_client(@options) }

https://github.com/redis/redis-rb/blob/7cc45e5e3f33ece7e235434de5fbd24c9b9d3180/lib/redis.rb#L73
It didn't break ! but I doesn't seem to fix our problem. I feel like I am chasing the wrong issue.

Further idea to try:

look at our proxy timeout as explained here : https://stackoverflow.com/questions/76255963/rails-actioncable-sporadically-failing-to-subscribe
monitor redis connection/status to see if the problem is actually coming from there
review our usage of StimulusReflex/CableReady on the page, we seem to a few things on page load

mkllnk · 2024-02-28T00:48:04Z

I've read in some threads that connecting to the same channel multiple times in one page causes issues. So one approach we could try would be to create more channels for different uses. Not as dry but maybe a bit clearer in the communication.

dacook · 2024-03-26T22:21:50Z

The related PR has been deployed for this, I will test shortly.

dacook · 2024-03-27T03:26:56Z

I've tested on au_prod and fr_prod, and unfortunately couldn't see any existing problems, or conclusive improvements for the bulk loading of records. (This might be partly because I forgot to test before deploying 🤦).
However, it's good to see the packing report screen loads much quicker now, and I've found a potential lead for improving the BUU products screen loading!

To fix:

[BUU] Products screen: Rails is often downloading the image for each product.

dacook · 2024-03-27T23:22:20Z

Hokay, replicated on fr_prod (v4.4.36-modified, with latest Redis fix) this morning while I was looking for something else.
I think this is consistent with what we were seeing before, but I feel like I have a new perspective.

Page loads, opens up a websocket and spams the server trying to subscribe to both SessionChannel and StimulusReflex::Channel for 11 seconds until 09:37:55
After a few seconds break, a new websocket is opened. SR subscribes successfully, and requests reflex Products#fetch. A second later, the server stops sending back a ping...
After a 10 second break, a third websocket is opened. It achieves subscription, but knows nothing about what requests were sent already, so does nothing.

The "Broken pipe" error is occurring regularly (several times in the last hour, I can see shoppers are currently using the site). Interestingly it wasn't logged at the same time as this example. The next one was over 5 mins later.. maybe the request took that long.. (but I wasn't able to interpret the logs to prove it)
E, [2024-03-27 22:45:34 #2192118] ERROR -- : WebSocket error occurred: Broken pipe

There was a spike in CPU usage, and a minor spike in Redis usage at that time, quite likely due to the Products#fetch. As I mentioned, there are other shoppers using the system at the same time.

So, while the server was busy, it stopped sending a ping, is that normal? And after 10 seconds, the client decided to re-establish the connection. That's probably because WebSocket or ActionCable layers are not aware of the application, which was waiting for a response. In fact, I don't think even StimulusReflex has a concept of waiting for a response.
If that's the case, maybe we should reconsider what we use it for.

mkllnk · 2024-03-27T23:32:49Z

while the server was busy, it stopped sending a ping, is that normal?

I don't know. It's very disappointing. The server has two workers (processes) with three threads each. It should be able to serve 6 requests in parallel. We got 8 CPUs, some of which may also be busy with the database or Redis but I would have expected the server to process a simple ping within a second. Can we extend the timeout?

mariocarabotta · 2024-04-03T01:07:00Z

from now onwards this issue is going to be related to Background Reports only.
@dacook will create a separate issue for BUU

mkllnk · 2024-04-05T05:54:50Z

Summary of identified problems

Large on-screen payloads don't reach the browser

We found that generating a large report to show on the screen results in the infinite loader. The HTML can be several megabytes in size and is sent from the report job but then disappears somewhere. We don't know what the limit is and which layer drops it (cable_ready, ActionCable or Redis?). But a 4MB report is dropped. Requesting the same report in a download format sends only the link to the report to the browser and that works.

The easiest workaround would be to check for the size of the HTML and only link to the blob like with downloadable reports. Quickest solution is to just display a link like with the other reports. Next, we could render HTML with an iframe or load the content with JS as in element.innerHTML = await (await fetch(url)).json (not tested).

A busy server results in unstable web sockets

When the server is very busy, for example compiling a big report, it stops sending pings to the browser. The client has a timeout of 10 seconds after which the connection is closed and it tries to open a new one. During this time, messages are lost and don't reach the browser. ActionCable is a simple broadcasting service and discards messages when nobody is listening. The client timeout can be configured:

ActionCable.ConnectionMonitor.staleThreshold = 10;

We could also increase the number of threads handling web sockets:

config.action_cable.worker_pool_size = 4

We also need to increase the number of database connections for this one.

dacook · 2024-04-09T23:13:33Z

Great summary, thanks Maikel. Just one comment:

The client timeout can be configured:

True, but that may only put off the problem until later, when it might occur less predictably and be harder to track. Also worth noting that's an unsupported config, as noted on that post:

Please note that this approach is generally a really bad idea: messing with internal variables is one of the direct ways to bugs and issues.

mkllnk · 2024-05-16T06:57:52Z

On the problem of report size on screen:

Previously I found that a 4MB report never reached the browser when testing in production.
Locally, I need to go to 30MB to make it fail.
In specs, it fails earlier but some of it may be timeouts during HTML rendering. A request spec using websockets would be good but I don't know how to do that.

So finding the limitation and solving it may be tricky. The limitation may be in the browser, or in ActionCable or in cable_ready. Instead, it would be much easier to offer a download link.

RachL · 2024-06-06T09:03:32Z

copying a comment from Bethan on slack:

3-4 complaints each month that reports haven't 'loaded' when they've actually been emailed - this inevitably results in lots of emails though as they try and regenerate the report multiple times (one customer yesterday had 7 'report ready' emails all in the same minute!). It might be helpful to have a 'your report is being generated and will be emailed to you shortly' message in the generated report section when the email process is initiated, to minimise users clicking 'go' lots...

mkllnk · 2024-06-07T10:06:16Z

Cool, I have a fix for one part of the problem. I'll create a PR next week to support displaying big reports on the screen. Then we just have the unreliable websockets connection to think about...

amidaOFN · 2024-07-25T23:15:21Z

Hi team, just commenting that our user reporting this is still having the same or worse issue with it - they have had to try 40 times on their last attempt to get reports

RachL added the bug-s3 The bug is stopping a critical or non-critical feature but there is a usable workaround. label Nov 2, 2023

github-project-automation bot added this to OFN Delivery board Nov 2, 2023

github-project-automation bot moved this to All the things in OFN Delivery board Nov 2, 2023

This comment was marked as outdated.

Sign in to view

sigmundpetersen moved this from All the things to Prioritized in OFN Delivery board Feb 16, 2024

dacook added this to the [BUU1] Backoffice & Product List table UI uplift - rollout milestone Feb 20, 2024

dacook changed the title ~~[Background reports] Infinite loader~~ [Background reports][BUU] Infinite loader Feb 20, 2024

rioug self-assigned this Feb 20, 2024

rioug moved this from Prioritised to In Dev 💪 in OFN Delivery board Feb 20, 2024

rioug moved this from In Dev 💪 to Dev ready 👋 in OFN Delivery board Feb 26, 2024

dacook self-assigned this Mar 6, 2024

dacook moved this from Dev ready 👋 to In Dev 💪 in OFN Delivery board Mar 6, 2024

dacook moved this from Dev ready 👋 to Test Ready 🧪 in OFN Delivery board Mar 26, 2024

dacook moved this from Test Ready 🧪 to Dev ready 👋 in OFN Delivery board Mar 27, 2024

mariocarabotta moved this from Dev ready 👋 to In Dev 💪 in OFN Delivery board Apr 3, 2024

mariocarabotta added bug-s3 The bug is stopping a critical or non-critical feature but there is a usable workaround. and removed bug-s1 The bug is stopping the platform from working, and there is no workaround. Impact of lot of users. labels Apr 3, 2024

mariocarabotta moved this from In Dev 💪 to Dev ready 👋 in OFN Delivery board Apr 3, 2024

mariocarabotta changed the title ~~[Background reports][BUU] Infinite loader~~ [Background reports] Infinite loader Apr 3, 2024

rioug removed their assignment Apr 3, 2024

dacook mentioned this issue Apr 3, 2024

[BUU] [Infinite loader] - Remove Stimulus Reflex from Products Page #12331

Closed

3 tasks

dacook mentioned this issue Apr 10, 2024

chore(deps): bump stimulus_reflex from 3.5.0-rc3 to 3.5.0-rc4 #12260

Closed

dacook removed this from the [BUU1] Backoffice & Product List table UI uplift - rollout milestone May 12, 2024

dacook removed their assignment May 14, 2024

mkllnk self-assigned this May 23, 2024

mkllnk moved this from Dev ready 👋 to In Progress ⚙ in OFN Delivery board May 23, 2024

mkllnk mentioned this issue Jun 20, 2024

Load large on-screen reports on demand #12592

Merged

4 tasks

mkllnk mentioned this issue Aug 16, 2024

Add fallback report loading in case websockets fail #12782

Merged

4 tasks

drummer83 mentioned this issue Aug 22, 2024

Reports not loading after multiple clicks/not loading on screen #12746

Closed

mkllnk closed this as completed in #12782 Aug 22, 2024

github-project-automation bot moved this from In Progress ⚙ to Done in OFN Delivery board Aug 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Background reports] Infinite loader #11752

[Background reports] Infinite loader #11752

RachL commented Nov 2, 2023

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

RachL commented Jan 25, 2024 •

edited

Loading

mkllnk commented Jan 30, 2024

RachL commented Jan 30, 2024

mkllnk commented Jan 31, 2024

mkllnk commented Jan 31, 2024

mkllnk commented Jan 31, 2024

mkllnk commented Jan 31, 2024

mkllnk commented Jan 31, 2024

mkllnk commented Jan 31, 2024

mkllnk commented Jan 31, 2024

mkllnk commented Jan 31, 2024

mkllnk commented Jan 31, 2024 •

edited

Loading

dacook commented Feb 13, 2024 •

edited

Loading

rioug commented Feb 20, 2024

rioug commented Feb 21, 2024

mkllnk commented Feb 28, 2024

dacook commented Mar 26, 2024

dacook commented Mar 27, 2024

dacook commented Mar 27, 2024

mkllnk commented Mar 27, 2024

mariocarabotta commented Apr 3, 2024

mkllnk commented Apr 5, 2024

dacook commented Apr 9, 2024

mkllnk commented May 16, 2024

RachL commented Jun 6, 2024

mkllnk commented Jun 7, 2024

amidaOFN commented Jul 25, 2024

[Background reports] Infinite loader #11752

[Background reports] Infinite loader #11752

Comments

RachL commented Nov 2, 2023

Description

Steps to Reproduce

Severity

Your Environment

This comment was marked as outdated.

This comment was marked as outdated.

This comment was marked as outdated.

RachL commented Jan 25, 2024 • edited Loading

mkllnk commented Jan 30, 2024

RachL commented Jan 30, 2024

mkllnk commented Jan 31, 2024

mkllnk commented Jan 31, 2024

mkllnk commented Jan 31, 2024

mkllnk commented Jan 31, 2024

mkllnk commented Jan 31, 2024

mkllnk commented Jan 31, 2024

mkllnk commented Jan 31, 2024

mkllnk commented Jan 31, 2024

mkllnk commented Jan 31, 2024 • edited Loading

dacook commented Feb 13, 2024 • edited Loading

rioug commented Feb 20, 2024

rioug commented Feb 21, 2024

mkllnk commented Feb 28, 2024

dacook commented Mar 26, 2024

dacook commented Mar 27, 2024

dacook commented Mar 27, 2024

mkllnk commented Mar 27, 2024

mariocarabotta commented Apr 3, 2024

mkllnk commented Apr 5, 2024

Summary of identified problems

Large on-screen payloads don't reach the browser

A busy server results in unstable web sockets

dacook commented Apr 9, 2024

mkllnk commented May 16, 2024

RachL commented Jun 6, 2024

mkllnk commented Jun 7, 2024

amidaOFN commented Jul 25, 2024

RachL commented Jan 25, 2024 •

edited

Loading

mkllnk commented Jan 31, 2024 •

edited

Loading

dacook commented Feb 13, 2024 •

edited

Loading