-
-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Complement TestWriteMDirectAccountData
is flakey
#13334
Comments
The errors here all look like something is killing Synapse before the /sync and the test complete. Maybe a test timeout?
|
Concentrating on https://github.com/matrix-org/synapse/runs/7184887268?check_suite_focus=true for a moment. The go testing.T logs report a connection reset
And the synapse entrypoint logs report a SIGTERM request
It's not clear to me if the SIGTERM was sent before or after the /sync connection was reset. That is: did complement send a SIGTERM to shut down the container after the test failed, or did the test fail because of the SIGTERM? Timestamps on the testing.T logs would help here. |
I will buy anyone who fixes this a nice meal or beverage. |
The errors referred to here are:
This is, indeed, very mysterious. I don't understand what would cause the connection to drop other than Synapse being killed, which shouldn't happen while complement is still calling |
Extra logging to help diagnose matrix-org/synapse#13334
Extra logging to help diagnose matrix-org/synapse#13334
I am still mystified. It definitely looks like that /sync request is being dropped by Synapse before complement shuts down the container. I hope #13914 will shed more light. |
Have nginx send its logs to stderr/out, so that we can debug #13334.
Have nginx send its logs to stderr/out, so that we can debug #13334.
It feels significant that this is the first test that gets run by complement. |
Yet another straw to grasp at for debugging matrix-org/synapse#13334
...it seems suspicious that we haven't seen this since turning on the debug mode. |
it really does. It really feels like something weird is happening in the Go HTTP client stack. |
https://github.com/matrix-org/complement/blob/21646b51e62c25196ed978b6eb76d7f1c6ff95ff/internal/client/client.go#L542 gets side-eye from me. The call to
Well, we could. It's going to drive me crazy not knowing though. |
Oh wait, that happens after the error. bah. |
I think I want to turn off Debug mode again and see if it still happens, and if so inspect the nginx logs for clues. |
done in #538 |
|
We can leak additional connections when starting homeservers. As part of startup, we poll But when I try leaking ~1012 connections on my machine, it produces a different error once the per-process file descriptor limit is reached (
so that explanation doesn't work. |
It's strange that the |
The nginx logs have only this:
... which makes it look like nginx never saw the request in question, putting this pretty squarely in the domain of being a complement bug rather than a synapse one. Is it possible that Complement is attempting to re-use an existing HTTP connection, but that is racing with nginx deciding to close the connection? |
I'm having trouble figuring out how we'd see a From stackoverflow, In this case, the remote end would be the If nginx were to close the connection early, docker-proxy would close the connection to complement too and we would usually see an end-of-stream*. I believe when this happens, there is retry logic in the go http client that transparently attempts another connection, which can then reset if docker-proxy fails to connect to nginx (this can happen if the container is shut down prematurely during testing). *I'm unable to get docker-proxy to send a reset once the connection to the backend has been established. Test setupdocker-proxy
nb: docker-proxy is hardcoded to expect fd 3 to exist, otherwise it fails to start. backend which instantly closes the connection #!/usr/bin/env python3
import socket
import struct
s = socket.socket(socket.AF_INET, socket.SOCK_STREAM, 0)
s.bind(("127.0.0.1", 49153))
s.listen(1)
while True:
c, ip = s.accept()
c.close()
s.close() curl sending endless requests head -c 1024 /dev/urandom > 1m
while true; do curl "http://localhost:49152" -X POST --data @1m -H "Expect:" 2>&1 | grep reset; done When running the curl loop for 10 minutes locally, no resets are printed. |
Would we expect a log line when nginx is shut down before an in-flight request completes? Or would nginx wait for the request to complete before stopping? |
I'm not sure, but what we do know is that complement has received its "reset by peer" before it shuts down the containers. So yes it's possible that ngnix received the request and still considers it in-progress, but in that case, where has the "reset by peer" come from? |
Not seen for a while, closing. |
The text was updated successfully, but these errors were encountered: