-
-
Notifications
You must be signed in to change notification settings - Fork 729
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Background reports] Infinite loader #11752
Comments
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
@mkllnk FYI I've reproduced it. Can it be linked to the number of reports run in parallel/at the same time? |
Great, how?
Are you running multiple reports in the same tab? I would need to look into that. Hm, no, you can't be running another report because you are waiting for the first one in that tab. But maybe you are running another one in another tab. It should work though. Let me know how I can reproduce it and then I may find the culprit. |
@mkllnk same scenario that is in the issue. Only one tab opened. However this scenario passes if I'm trying during low hours (evening for example). So maybe when you try on your timezone it will always work? I guess maybe this one we can't fix it atm and we need to have more occurrences before we can do something. |
Interesting, I just tried to load that report and I didn't even see the loading screen. The console gave me a hint though:
The site was generally very slow. |
Nothing in Bugsnag. The report format doesn't matter, by the way. |
I also didn't need the big date range. The default of one month still triggered the problem. Firefox used to have a bug related to the error message here and I tried the suggested workaround by changing the Firefox option to false but it didn't help:
|
I haven't gotten a single report rendered in the browser yet. The connection is slow but the server is not busy. It's only 5am in France now. UK staging works without any problem. AU production works without problem. There's something weird about the websocket connection. |
fr_prod is the only server with increased puma workers and threads. I tried setting it back to our default but that didn't change anything. |
I was wondering what's different about fr_prod. It runs Ubuntu 20 which is otherwise only used by India and New Zealand. On nz_prod I found intermittent problems as well. When I load the page, there's one cable connection initiated straight away. Another cable connection is request is send several seconds later. If I try to render the report before the second request, nothing happens. But if I do it after the second request, the loading indicator comes up and then the report. I still need to find out how to debug websockets properly. |
Turns out that Firefox doesn't have the messages tab in my dev tools. 🤷 Chromium has it though. I got the loading message and plenty of ping messages but when the report finished, it didn't get sent to the browser even though the scoped channel was connected properly. |
I have to stop here for now. Ideas for investigation:
|
@rioug found this: Also my investigations of the logs on au_staging for the bulk products screen: |
This has an example of the redis config to potentially fix the issue : https://stackoverflow.com/questions/76255963/rails-actioncable-sporadically-failing-to-subscribe |
I had a look at the solution offered in the link above, I tried increasing the number of Redis connection as explained but it broke :
I had a look at the various gem we use, I tried to use the # @client = initialize_client(@options)
@client = ConnectionPool::Wrapper.new(size: 50, timeout: 3) { initialize_client(@options) } https://github.com/redis/redis-rb/blob/7cc45e5e3f33ece7e235434de5fbd24c9b9d3180/lib/redis.rb#L73 Further idea to try:
|
I've read in some threads that connecting to the same channel multiple times in one page causes issues. So one approach we could try would be to create more channels for different uses. Not as dry but maybe a bit clearer in the communication. |
The related PR has been deployed for this, I will test shortly. |
I've tested on au_prod and fr_prod, and unfortunately couldn't see any existing problems, or conclusive improvements for the bulk loading of records. (This might be partly because I forgot to test before deploying 🤦). To fix:
|
I don't know. It's very disappointing. The server has two workers (processes) with three threads each. It should be able to serve 6 requests in parallel. We got 8 CPUs, some of which may also be busy with the database or Redis but I would have expected the server to process a simple ping within a second. Can we extend the timeout? |
from now onwards this issue is going to be related to Background Reports only. |
Summary of identified problemsLarge on-screen payloads don't reach the browserWe found that generating a large report to show on the screen results in the infinite loader. The HTML can be several megabytes in size and is sent from the report job but then disappears somewhere. We don't know what the limit is and which layer drops it (cable_ready, ActionCable or Redis?). But a 4MB report is dropped. Requesting the same report in a download format sends only the link to the report to the browser and that works. The easiest workaround would be to check for the size of the HTML and only link to the blob like with downloadable reports. Quickest solution is to just display a link like with the other reports. Next, we could render HTML with an iframe or load the content with JS as in A busy server results in unstable web socketsWhen the server is very busy, for example compiling a big report, it stops sending pings to the browser. The client has a timeout of 10 seconds after which the connection is closed and it tries to open a new one. During this time, messages are lost and don't reach the browser. ActionCable is a simple broadcasting service and discards messages when nobody is listening. The client timeout can be configured: ActionCable.ConnectionMonitor.staleThreshold = 10; We could also increase the number of threads handling web sockets: config.action_cable.worker_pool_size = 4 We also need to increase the number of database connections for this one. |
Great summary, thanks Maikel. Just one comment: True, but that may only put off the problem until later, when it might occur less predictably and be harder to track. Also worth noting that's an unsupported config, as noted on that post:
|
On the problem of report size on screen:
So finding the limitation and solving it may be tricky. The limitation may be in the browser, or in ActionCable or in cable_ready. Instead, it would be much easier to offer a download link. |
copying a comment from Bethan on slack:
|
Cool, I have a fix for one part of the problem. I'll create a PR next week to support displaying big reports on the screen. Then we just have the unreliable websockets connection to think about... |
Hi team, just commenting that our user reporting this is still having the same or worse issue with it - they have had to try 40 times on their last attempt to get reports |
Description
A test scenario which end up with an infinite loader for the user.
Steps to Reproduce
Severity
S3 as we get the email in the end
Your Environment
The text was updated successfully, but these errors were encountered: