You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
An issue occurred in an Artio instance where a surge of Resend Requests caused
Artio to hang where it did not process any new messages or heartbeat. After
investigation, the only plausible explanation we found was that something caused
the Framer thread to get stuck. We have been unable to reproduce this behavior
locally and after our own investigation of the code were unable to come up with
a reasonable circumstance that could have led to the behavior observed.
Artio Version
0.160
Observed Behavior
Most of our investigation was performed on a pcap of external traffic going out
of the Artio instance.The issue was caused by a Session disconnecting,
reconnecting and sending a flurry of large Resend Requests. After two Resend
Requests were successfully completed, Artio stopped replying to all FIX traffic.
It stopped sending outbound business logic messages, heartbeating, responding to
Test Requests, Logons, Logouts, etc. There were no Library disconnects or any
logging besides the exception attached below with Artio set to -Dfix.core.debug=CLOSE,LIBRARY_MANAGEMENT.
Sequence of events according to the pcap:
Time
Fix Message Type
Sender
SeqNum
Notes
11:43:52.317299461
Logout (35=5)
Client A
124
Initial logout sent by client
11:43:52.317338433
Logout (35=5)
Artio
788890
Many messages go out to other sessions
1000s of messages are queued for Session A
11:53:50.896666308
Logon (35=A)
Client A
125
Client A logs back on
11:53:50.906870350
Logon (35=A)
Artio
831603
11:53:50.908934773
Resend Request (35=2)
Client A
126
BeginSeqNo: 786876, EndSeqNo: 789375
11:53:51.078517049 - 11:53:51.118294104
Multiple, mostly 35=8
Artio
786876-789375
First request is successfully replayed
11:53:51.118294104
Execution Report (35=8)
Artio
831604-831609
New messages
11:53:51.122397176
Resend Request (35=2)
Client A
127
BeginSeqNo: 789376, EndSeqNo: 0
11:53:51.122620027
Resend Request (35=2)
Client A
128
BeginSeqNo: 789376, EndSeqNo: 791875
11:53:51.122697839
Resend Request (35=2)
Client A
129
BeginSeqNo: 789376, EndSeqNo: 791875
11:53:51.122806842
Resend Request (35=2)
Client A
130
BeginSeqNo: 789376, EndSeqNo: 791875
11:53:51.122873949
Resend Request (35=2)
Client A
131
BeginSeqNo: 789376, EndSeqNo: 791875
11:53:51.122957440
Resend Request (35=2)
Client A
132
BeginSeqNo: 789376, EndSeqNo: 791875
11:53:51.123018806
Resend Request (35=2)
Client A
133
BeginSeqNo: 789376, EndSeqNo: 791875
11:53:51.324842891
Resend Request (35=2)
Client A
134
BeginSeqNo: 791876, EndSeqNo: 0
11:53:51.297865300 - 11:53:51.562065798
Multiple, mostly 35=8
Artio
789376-831609
Second request is successfully replayed, 831609 is last message sent by Artio, starting at seqnum=815340 many TCP Window Full messages pop up in Wireshark and continue for the rest of the replay
The following exception was logged but doesn’t seem like it should have caused
any issues. There was no other logging by Artio.
11:53:51 uk.co.real_logic.Artio.FixGatewayException: Ignore resend request for sessionId=42,connectionId=5096738714653978623 as 5 requests in flight
11:53:51 at uk.co.real_logic.Artio.engine.logger.Replayer.onResendRequest(Replayer.java:292)
11:53:51 at uk.co.real_logic.Artio.engine.logger.Replayer.onFragment(Replayer.java:208)
11:53:51 at io.aeron.Image.controlledPoll(Image.java:385)
11:53:51 at io.aeron.Subscription.controlledPoll(Subscription.java:235)
11:53:51 at uk.co.real_logic.Artio.engine.logger.Replayer.doWork(Replayer.java:475)
11:53:51 at org.agrona.concurrent.CompositeAgent.doWork(CompositeAgent.java:120)
11:53:51 at org.agrona.concurrent.AgentRunner.doWork(AgentRunner.java:304)
11:53:51 at org.agrona.concurrent.AgentRunner.workLoop(AgentRunner.java:296)
11:53:51 at org.agrona.concurrent.AgentRunner.run(AgentRunner.java:162)
11:53:51 at java.base/java.lang.Thread.run(Thread.java:1
Relevant Details
Shortly after messages stopped going out, framer started to backpressure the
library thread. The Framer's Term Buffer roughly match with the number of
bytes that should have been sent by the Framer starting out at 11:53:14.
The TCP socket was serviced for another 2 minutes before it was closed by the
client.
Prior to the second resend request, Artio handled messages to and from the
other sessions as normal. The second request was the last message Artio
processed.
Reproduction Attempts
We have attempted to reproduce this issue locally but have been unsuccessful.
Using a custom FIX client, we were able to replicate the traffic with analogous
sequence numbers and a scaled down number of messages. This reproduced the
exception shown above but did not replicate the spinning behavior. It also
reproduced similar TCP Window Full messages in Wireshark. Attempts at making the
Term Buffer sizes and TCP Window small did not make any difference. As a result,
we believe this issue is due to a complex interaction in Artio with multiple
Resend Requests.
The text was updated successfully, but these errors were encountered:
An issue occurred in an Artio instance where a surge of Resend Requests caused
Artio to hang where it did not process any new messages or heartbeat. After
investigation, the only plausible explanation we found was that something caused
the Framer thread to get stuck. We have been unable to reproduce this behavior
locally and after our own investigation of the code were unable to come up with
a reasonable circumstance that could have led to the behavior observed.
Artio Version
0.160
Observed Behavior
Most of our investigation was performed on a pcap of external traffic going out
of the Artio instance.The issue was caused by a Session disconnecting,
reconnecting and sending a flurry of large Resend Requests. After two Resend
Requests were successfully completed, Artio stopped replying to all FIX traffic.
It stopped sending outbound business logic messages, heartbeating, responding to
Test Requests, Logons, Logouts, etc. There were no Library disconnects or any
logging besides the exception attached below with Artio set to
-Dfix.core.debug=CLOSE,LIBRARY_MANAGEMENT
.Sequence of events according to the pcap:
Related Exception
The following exception was logged but doesn’t seem like it should have caused
any issues. There was no other logging by Artio.
Relevant Details
library thread. The Framer's Term Buffer roughly match with the number of
bytes that should have been sent by the Framer starting out at 11:53:14.
client.
other sessions as normal. The second request was the last message Artio
processed.
Reproduction Attempts
We have attempted to reproduce this issue locally but have been unsuccessful.
Using a custom FIX client, we were able to replicate the traffic with analogous
sequence numbers and a scaled down number of messages. This reproduced the
exception shown above but did not replicate the spinning behavior. It also
reproduced similar TCP Window Full messages in Wireshark. Attempts at making the
Term Buffer sizes and TCP Window small did not make any difference. As a result,
we believe this issue is due to a complex interaction in Artio with multiple
Resend Requests.
The text was updated successfully, but these errors were encountered: