Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Socket still open post onDisconnect #10796

Open
laurensbl opened this issue Dec 23, 2024 · 2 comments
Open

Socket still open post onDisconnect #10796

laurensbl opened this issue Dec 23, 2024 · 2 comments
Labels
bug Something isn't working unconfirmed

Comments

@laurensbl
Copy link

laurensbl commented Dec 23, 2024

CODE server is unresponsive periodically, and only once at a time, Apache reverse proxy logs 502. CODE logs "Socket still open post onDisconnect(), forced shutdown." Discovered the issue through frequent errors in Nextcloud log.

CODE server on Debian 12 Bookworm container on Proxmox 8 with Apache frontend with SSL termination. Server is running fine, manually connecting to https://office.redacteddomain.tld/hosting/capabilities from desktop gives:
{"convert-to":{"available":false},"hasDocumentSigningSupport":true,"hasMobileSupport":true,"hasProxyPrefix":false,"hasTemplateSaveAs":false,"hasTemplateSource":true,"hasWASMSupport":false,"hasZoteroSupport":true,"productName":"Collabora Online Development Edition","productVersion":"24.04.10.2","productVersionHash":"a4b67a7664","serverId":"a3b65e3d"}

CODE server used with Nextcloud 30.0.4 (richdocuments app) on Ubuntu 24.04 container on Proxmox 8. All works as expected, editing documents via richdocuments app functions without any problems. https://office.redacteddomain.tld/hosting/capabilities is accessible from Nextcloud container.

Just the periodical error in the logs (2-8 times a day, according to the Nextcloud logs):

Nextcloud Ubuntu container
Nextcloud log

[richdocuments] Error: Failed to fetch capabilities: Server error: `GET https://office.redacteddomain.tld/hosting/capabilities` resulted in a `502 Proxy Error` response:
<!DOCTYPE html><html lang="en"><head><title>502 Bad Gateway</title></head><body><h1>Bad Gateway</h1></body></html>
from ? by -- at 23 dec 2024 16:48:18

CODE server Debian container
Apache access log:

192.168.1.10 - - [23/Dec/2024:16:48:18 +0100] "GET /hosting/capabilities HTTP/1.1" 502 2937 "-" "Nextcloud Server Crawler" office.redacteddomain.tld In:- Out:-:-pct.
192.168.1.10 - - [23/Dec/2024:16:48:18 +0100] "GET /hosting/discovery HTTP/1.1" 200 37606 "-" "Nextcloud Server Crawler" office.redacteddomain.tld In:- Out:-:-pct.

Apache error log:

[Mon Dec 23 16:48:18.292052 2024] [proxy_http:error] [pid 177:tid 200] (20014)Internal error (specific information not available): [client 192.168.1.10:53888] AH01102: error reading status line from remote server 127.0.0.1:9980
[Mon Dec 23 16:48:18.292095 2024] [proxy:error] [pid 177:tid 200] [client 192.168.1.10:53888] AH00898: Error reading from remote server returned by /hosting/capabilities

CODE log:

Dec 23 16:48:18 office coolwsd[149]: wsd-00149-00383 2024-12-23 16:48:18.291481 +0100 [ websrv_poll ] WRN  #28: Socket still open post onDisconnect(), forced shutdown.| net/Socket.cpp:1488
Dec 23 16:48:18 office coolwsd[149]: wsd-00149-00383 2024-12-23 16:48:18.291471 +0100 [ websrv_poll ] WRN  #28: CheckRemoval: Timeout: {Inactive true, Termination false}, Stats[dur[total 3600301ms, last 3600301 ms], kBps[in 8e-05, ou>
Dec 23 16:48:18 office coolwsd[149]: wsd-00149-00383 2024-12-23 16:48:18.291430 +0100 [ websrv_poll ] WRN  #27: Socket still open post onDisconnect(), forced shutdown.| net/Socket.cpp:1488
Dec 23 16:48:18 office coolwsd[149]: wsd-00149-00383 2024-12-23 16:48:18.291394 +0100 [ websrv_poll ] WRN  #27: CheckRemoval: Timeout: {Inactive true, Termination false}, Stats[dur[total 3600381ms, last 3600381 ms], kBps[in 8e-05, ou>

If you need more info, please ask.

@Ashod
Copy link
Contributor

Ashod commented Jan 6, 2025

Hi @laurensbl,

Thanks for reporting this.

By unresponsive I assume you mean loading document doesn't work. I also expect that wget localhost:9980/hosting/capabilities (or /hosting/discovery) would fail.

Can you please confirm?

What does prlimit -n return for the user running coolwsd? I expect the soft limit isn't the default 1024, which would be too restrictive. Best to increase it to something more generous.

If you still experience this, can you please set the log-level to trace and do wget as above and extract the relevant part from the logs (I expect EMFILE error or the equivalent).

Finally, when you experience this issue, are you able to run netstat -tulpn (with sudo) and capture the sockets open by COOL?

Thank you.

@laurensbl
Copy link
Author

By unresponsive I assume you mean loading document doesn't work. I also expect that wget localhost:9980/hosting/capabilities (or /hosting/discovery) would fail.

Can you please confirm?

No. Loading docs works fine. Also wget https://office.redacteddomain.tld/hosting/capabilities works fine, from both the host running Nextcloud and from a teststation in the same network. I cannot recreate the error manually, just the occasional error in the Nextcloud log with corresponding errors in the Apache log and the coolwsd log on the host running CODE, as stated in the opening post above.

What does prlimit -n return for the user running coolwsd? I expect the soft limit isn't the default 1024, which would be too restrictive. Best to increase it to something more generous.

# sudo -u cool prlimit -n
RESOURCE DESCRIPTION              SOFT   HARD UNITS
NOFILE   max number of open files 1024 524288 files

I just changed the soft limit to 10000 and I will have look in the Nextcloud log in a few days, see if the error disappeared.

# sudo -u cool prlimit -n
RESOURCE DESCRIPTION               SOFT   HARD UNITS
NOFILE   max number of open files 10000 524288 files

If you still experience this, can you please set the log-level to trace and do wget as above and extract the relevant part from the logs (I expect EMFILE error or the equivalent).

As the error does not occur with wget I have to set the log to TRACE and wait for it to happen. Usually the error occurs every 4-6 hours or so, so it will be a lot of TRACE log. I will do this later as I don't have time now. Will be continued.

Finally, when you experience this issue, are you able to run netstat -tulpn (with sudo) and capture the sockets open by COOL?

# ss -ltnup
Netid          State           Recv-Q          Send-Q                     Local Address:Port                      Peer Address:Port          Process
tcp            LISTEN          0               64                             127.0.0.1:9980                           0.0.0.0:*              users:(("coolwsd",pid=149,fd=14))
tcp            LISTEN          0               100                            127.0.0.1:25                             0.0.0.0:*              users:(("master",pid=367,fd=13))
tcp            LISTEN          0               511                                    *:443                                  *:*              users:(("apache2",pid=6829,fd=4),("apache2",pid=6828,fd=4),("apache2",pid=6825,fd=4))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working unconfirmed
Projects
Status: No status
Development

No branches or pull requests

2 participants