Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Background reports] Infinite loader #11752

Closed
RachL opened this issue Nov 2, 2023 · 51 comments · Fixed by #12782
Closed

[Background reports] Infinite loader #11752

RachL opened this issue Nov 2, 2023 · 51 comments · Fixed by #12782
Assignees
Labels
bug-s3 The bug is stopping a critical or non-critical feature but there is a usable workaround. prod-test

Comments

@RachL
Copy link
Contributor

RachL commented Nov 2, 2023

Description

A test scenario which end up with an infinite loader for the user.

Steps to Reproduce

  1. Go to https://coopcircuits.fr/admin/reports/packing/customer as hub 922 or super admin with hub 922 as filter
  2. Change start date for January &st 2023 (keep end date as is)
  3. Ask for xlsx file
  4. See you get an email after 3 minutes with the file, while the report page still shows the loader

Severity

S3 as we get the email in the end

Your Environment

  • Version used: v4.4.17
  • Browser name and version: Firefox and chrome
  • Operating System and version (desktop or mobile): desktop
@RachL RachL added the bug-s3 The bug is stopping a critical or non-critical feature but there is a usable workaround. label Nov 2, 2023
@github-project-automation github-project-automation bot moved this to All the things in OFN Delivery board Nov 2, 2023
@mkllnk

This comment was marked as outdated.

@RachL

This comment was marked as outdated.

@mkllnk

This comment was marked as outdated.

@RachL
Copy link
Contributor Author

RachL commented Jan 25, 2024

@mkllnk FYI I've reproduced it. Can it be linked to the number of reports run in parallel/at the same time?

@mkllnk
Copy link
Member

mkllnk commented Jan 30, 2024

I've reproduced it.

Great, how?

Can it be linked to the number of reports run in parallel/at the same time?

Are you running multiple reports in the same tab? I would need to look into that. Hm, no, you can't be running another report because you are waiting for the first one in that tab. But maybe you are running another one in another tab. It should work though. Let me know how I can reproduce it and then I may find the culprit.

@RachL
Copy link
Contributor Author

RachL commented Jan 30, 2024

@mkllnk same scenario that is in the issue. Only one tab opened. However this scenario passes if I'm trying during low hours (evening for example). So maybe when you try on your timezone it will always work?

I guess maybe this one we can't fix it atm and we need to have more occurrences before we can do something.

@mkllnk
Copy link
Member

mkllnk commented Jan 31, 2024

Interesting, I just tried to load that report and I didn't even see the loading screen. The console gave me a hint though:

The connection to wss://coopcircuits.fr/cable was interrupted while the page was loading.

The site was generally very slow.

@mkllnk
Copy link
Member

mkllnk commented Jan 31, 2024

Nothing in Bugsnag. The report format doesn't matter, by the way.

@mkllnk
Copy link
Member

mkllnk commented Jan 31, 2024

I also didn't need the big date range. The default of one month still triggered the problem.

Firefox used to have a bug related to the error message here and I tried the suggested workaround by changing the Firefox option to false but it didn't help:

network.http.http2.websockets

@mkllnk
Copy link
Member

mkllnk commented Jan 31, 2024

I haven't gotten a single report rendered in the browser yet. The connection is slow but the server is not busy. It's only 5am in France now. UK staging works without any problem. AU production works without problem. There's something weird about the websocket connection.

@mkllnk
Copy link
Member

mkllnk commented Jan 31, 2024

fr_prod is the only server with increased puma workers and threads. I tried setting it back to our default but that didn't change anything.

@mkllnk
Copy link
Member

mkllnk commented Jan 31, 2024

I was wondering what's different about fr_prod. It runs Ubuntu 20 which is otherwise only used by India and New Zealand. On nz_prod I found intermittent problems as well.

When I load the page, there's one cable connection initiated straight away. Another cable connection is request is send several seconds later. If I try to render the report before the second request, nothing happens. But if I do it after the second request, the loading indicator comes up and then the report.

I still need to find out how to debug websockets properly.

@mkllnk
Copy link
Member

mkllnk commented Jan 31, 2024

Turns out that Firefox doesn't have the messages tab in my dev tools. 🤷 Chromium has it though. I got the loading message and plenty of ping messages but when the report finished, it didn't get sent to the browser even though the scoped channel was connected properly.

@mkllnk
Copy link
Member

mkllnk commented Jan 31, 2024

There's a weird thing that the browser is subscribing twice to the channels. And it's in different order.

uk-staging (success):
image

fr-prod (fail):
image

@mkllnk
Copy link
Member

mkllnk commented Jan 31, 2024

I have to stop here for now. Ideas for investigation:

  • Why does fr-prod initiate the session channel straight away and the scoped channel has to wait until the page is loaded while uk-staging does it the other way round?
  • Could the asset compilation make a difference?
  • Are we loading some JS twice?
  • Could a firewall make a difference?

@dacook
Copy link
Member

dacook commented Feb 13, 2024

@rioug found this:
WebSocket error occurred: Broken pipe in the logs : https://discuss.rubyonrails.org/t/getting-a-broken-pipe-error-with-websockets/80841 Points toward an issue with our redis config, also mentioned in the stackoverflow link above. I reckon we should start with that.

Also my investigations of the logs on au_staging for the bulk products screen:
StimulusReflex issues (WebSocket broken pipe)

@sigmundpetersen sigmundpetersen moved this from All the things to Prioritized in OFN Delivery board Feb 16, 2024
@dacook dacook changed the title [Background reports] Infinite loader [Background reports][BUU] Infinite loader Feb 20, 2024
@rioug
Copy link
Collaborator

rioug commented Feb 20, 2024

This has an example of the redis config to potentially fix the issue : https://stackoverflow.com/questions/76255963/rails-actioncable-sporadically-failing-to-subscribe

@rioug rioug self-assigned this Feb 20, 2024
@rioug rioug moved this from Prioritised to In Dev 💪 in OFN Delivery board Feb 20, 2024
@rioug
Copy link
Collaborator

rioug commented Feb 21, 2024

I had a look at the solution offered in the link above, I tried increasing the number of Redis connection as explained but it broke :

New unhandled error in staging from OpenFoodNetwork Aus in gems/redis-client-0.20.0/lib/redis_client/config.rb:21 (details)
Unhandled error
ArgumentError: unknown keyword: :pool_size
Location
gems/redis-client-0.20.0/lib/redis_client/config.rb:21 - initialize

I had a look at the various gem we use, redis-client does support a connection pool via connection_pool gem , see https://github.com/redis-rb/redis-client?tab=readme-ov-file#usage. But the 'redis' gem doesn't seem to be using it. Instead they recommend the following https://github.com/redis/redis-rb?tab=readme-ov-file#connection-pooling-and-thread-safety

I tried to use the redis-client connection pool directly via redis but it broke. So I ended up trying this patch, adding a ConnectionPool::Wrapper when initialising the redis client

# @client = initialize_client(@options)
@client = ConnectionPool::Wrapper.new(size: 50, timeout: 3) { initialize_client(@options) }

https://github.com/redis/redis-rb/blob/7cc45e5e3f33ece7e235434de5fbd24c9b9d3180/lib/redis.rb#L73
It didn't break ! but I doesn't seem to fix our problem. I feel like I am chasing the wrong issue.

Further idea to try:

@rioug rioug moved this from In Dev 💪 to Dev ready 👋 in OFN Delivery board Feb 26, 2024
@mkllnk
Copy link
Member

mkllnk commented Feb 28, 2024

I've read in some threads that connecting to the same channel multiple times in one page causes issues. So one approach we could try would be to create more channels for different uses. Not as dry but maybe a bit clearer in the communication.

@dacook dacook self-assigned this Mar 6, 2024
@dacook dacook moved this from Dev ready 👋 to In Dev 💪 in OFN Delivery board Mar 6, 2024
@dacook dacook moved this from Dev ready 👋 to Test Ready 🧪 in OFN Delivery board Mar 26, 2024
@dacook
Copy link
Member

dacook commented Mar 26, 2024

The related PR has been deployed for this, I will test shortly.

@dacook
Copy link
Member

dacook commented Mar 27, 2024

I've tested on au_prod and fr_prod, and unfortunately couldn't see any existing problems, or conclusive improvements for the bulk loading of records. (This might be partly because I forgot to test before deploying 🤦).
However, it's good to see the packing report screen loads much quicker now, and I've found a potential lead for improving the BUU products screen loading!

To fix:

  • [BUU] Products screen: Rails is often downloading the image for each product.

@dacook dacook moved this from Test Ready 🧪 to Dev ready 👋 in OFN Delivery board Mar 27, 2024
@dacook
Copy link
Member

dacook commented Mar 27, 2024

Hokay, replicated on fr_prod (v4.4.36-modified, with latest Redis fix) this morning while I was looking for something else.
I think this is consistent with what we were seeing before, but I feel like I have a new perspective.

  1. Page loads, opens up a websocket and spams the server trying to subscribe to both SessionChannel and StimulusReflex::Channel for 11 seconds until 09:37:55
    Screenshot 2024-03-28 at 9 57 33 am
  2. After a few seconds break, a new websocket is opened. SR subscribes successfully, and requests reflex Products#fetch. A second later, the server stops sending back a ping...
    Screenshot 2024-03-28 at 9 44 13 am
  3. After a 10 second break, a third websocket is opened. It achieves subscription, but knows nothing about what requests were sent already, so does nothing.
    Screenshot 2024-03-28 at 9 44 28 am

The "Broken pipe" error is occurring regularly (several times in the last hour, I can see shoppers are currently using the site). Interestingly it wasn't logged at the same time as this example. The next one was over 5 mins later.. maybe the request took that long.. (but I wasn't able to interpret the logs to prove it)
E, [2024-03-27 22:45:34 #2192118] ERROR -- : WebSocket error occurred: Broken pipe

There was a spike in CPU usage, and a minor spike in Redis usage at that time, quite likely due to the Products#fetch. As I mentioned, there are other shoppers using the system at the same time.
Screenshot 2024-03-28 at 9 48 25 am
Screenshot 2024-03-28 at 9 49 38 am

So, while the server was busy, it stopped sending a ping, is that normal? And after 10 seconds, the client decided to re-establish the connection. That's probably because WebSocket or ActionCable layers are not aware of the application, which was waiting for a response. In fact, I don't think even StimulusReflex has a concept of waiting for a response.
If that's the case, maybe we should reconsider what we use it for.

@mkllnk
Copy link
Member

mkllnk commented Mar 27, 2024

while the server was busy, it stopped sending a ping, is that normal?

I don't know. It's very disappointing. The server has two workers (processes) with three threads each. It should be able to serve 6 requests in parallel. We got 8 CPUs, some of which may also be busy with the database or Redis but I would have expected the server to process a simple ping within a second. Can we extend the timeout?

@mariocarabotta mariocarabotta moved this from Dev ready 👋 to In Dev 💪 in OFN Delivery board Apr 3, 2024
@mariocarabotta mariocarabotta added bug-s3 The bug is stopping a critical or non-critical feature but there is a usable workaround. and removed bug-s1 The bug is stopping the platform from working, and there is no workaround. Impact of lot of users. labels Apr 3, 2024
@mariocarabotta mariocarabotta moved this from In Dev 💪 to Dev ready 👋 in OFN Delivery board Apr 3, 2024
@mariocarabotta
Copy link
Collaborator

from now onwards this issue is going to be related to Background Reports only.
@dacook will create a separate issue for BUU

@mariocarabotta mariocarabotta changed the title [Background reports][BUU] Infinite loader [Background reports] Infinite loader Apr 3, 2024
@rioug rioug removed their assignment Apr 3, 2024
@mkllnk
Copy link
Member

mkllnk commented Apr 5, 2024

Summary of identified problems

Large on-screen payloads don't reach the browser

We found that generating a large report to show on the screen results in the infinite loader. The HTML can be several megabytes in size and is sent from the report job but then disappears somewhere. We don't know what the limit is and which layer drops it (cable_ready, ActionCable or Redis?). But a 4MB report is dropped. Requesting the same report in a download format sends only the link to the report to the browser and that works.

The easiest workaround would be to check for the size of the HTML and only link to the blob like with downloadable reports. Quickest solution is to just display a link like with the other reports. Next, we could render HTML with an iframe or load the content with JS as in element.innerHTML = await (await fetch(url)).json (not tested).

A busy server results in unstable web sockets

When the server is very busy, for example compiling a big report, it stops sending pings to the browser. The client has a timeout of 10 seconds after which the connection is closed and it tries to open a new one. During this time, messages are lost and don't reach the browser. ActionCable is a simple broadcasting service and discards messages when nobody is listening. The client timeout can be configured:

ActionCable.ConnectionMonitor.staleThreshold = 10;

We could also increase the number of threads handling web sockets:

config.action_cable.worker_pool_size = 4

We also need to increase the number of database connections for this one.

@dacook
Copy link
Member

dacook commented Apr 9, 2024

Great summary, thanks Maikel. Just one comment:

The client timeout can be configured:

True, but that may only put off the problem until later, when it might occur less predictably and be harder to track. Also worth noting that's an unsupported config, as noted on that post:

Please note that this approach is generally a really bad idea: messing with internal variables is one of the direct ways to bugs and issues.

@mkllnk
Copy link
Member

mkllnk commented May 16, 2024

On the problem of report size on screen:

  • Previously I found that a 4MB report never reached the browser when testing in production.
  • Locally, I need to go to 30MB to make it fail.
  • In specs, it fails earlier but some of it may be timeouts during HTML rendering. A request spec using websockets would be good but I don't know how to do that.

So finding the limitation and solving it may be tricky. The limitation may be in the browser, or in ActionCable or in cable_ready. Instead, it would be much easier to offer a download link.

@mkllnk mkllnk self-assigned this May 23, 2024
@mkllnk mkllnk moved this from Dev ready 👋 to In Progress ⚙ in OFN Delivery board May 23, 2024
@RachL
Copy link
Contributor Author

RachL commented Jun 6, 2024

copying a comment from Bethan on slack:

3-4 complaints each month that reports haven't 'loaded' when they've actually been emailed - this inevitably results in lots of emails though as they try and regenerate the report multiple times (one customer yesterday had 7 'report ready' emails all in the same minute!). It might be helpful to have a 'your report is being generated and will be emailed to you shortly' message in the generated report section when the email process is initiated, to minimise users clicking 'go' lots...

@mkllnk
Copy link
Member

mkllnk commented Jun 7, 2024

Cool, I have a fix for one part of the problem. I'll create a PR next week to support displaying big reports on the screen. Then we just have the unreliable websockets connection to think about...

@amidaOFN
Copy link
Collaborator

Hi team, just commenting that our user reporting this is still having the same or worse issue with it - they have had to try 40 times on their last attempt to get reports

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug-s3 The bug is stopping a critical or non-critical feature but there is a usable workaround. prod-test
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

9 participants