Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Artio Silently Hangs After Several Large Resend Requests #544

Open
maxgawason opened this issue Feb 5, 2025 · 0 comments
Open

Artio Silently Hangs After Several Large Resend Requests #544

maxgawason opened this issue Feb 5, 2025 · 0 comments

Comments

@maxgawason
Copy link

An issue occurred in an Artio instance where a surge of Resend Requests caused
Artio to hang where it did not process any new messages or heartbeat. After
investigation, the only plausible explanation we found was that something caused
the Framer thread to get stuck. We have been unable to reproduce this behavior
locally and after our own investigation of the code were unable to come up with
a reasonable circumstance that could have led to the behavior observed.

Artio Version

0.160

Observed Behavior

Most of our investigation was performed on a pcap of external traffic going out
of the Artio instance.The issue was caused by a Session disconnecting,
reconnecting and sending a flurry of large Resend Requests. After two Resend
Requests were successfully completed, Artio stopped replying to all FIX traffic.
It stopped sending outbound business logic messages, heartbeating, responding to
Test Requests, Logons, Logouts, etc. There were no Library disconnects or any
logging besides the exception attached below with Artio set to
-Dfix.core.debug=CLOSE,LIBRARY_MANAGEMENT.

Sequence of events according to the pcap:

Time Fix Message Type Sender SeqNum Notes
11:43:52.317299461 Logout (35=5) Client A 124 Initial logout sent by client
11:43:52.317338433 Logout (35=5) Artio 788890
Many messages go out to other sessions
1000s of messages are queued for Session A
11:53:50.896666308 Logon (35=A) Client A 125 Client A logs back on
11:53:50.906870350 Logon (35=A) Artio 831603
11:53:50.908934773 Resend Request (35=2) Client A 126 BeginSeqNo: 786876, EndSeqNo: 789375
11:53:51.078517049 - 11:53:51.118294104 Multiple, mostly 35=8 Artio 786876-789375 First request is successfully replayed
11:53:51.118294104 Execution Report (35=8) Artio 831604-831609 New messages
11:53:51.122397176 Resend Request (35=2) Client A 127 BeginSeqNo: 789376, EndSeqNo: 0
11:53:51.122620027 Resend Request (35=2) Client A 128 BeginSeqNo: 789376, EndSeqNo: 791875
11:53:51.122697839 Resend Request (35=2) Client A 129 BeginSeqNo: 789376, EndSeqNo: 791875
11:53:51.122806842 Resend Request (35=2) Client A 130 BeginSeqNo: 789376, EndSeqNo: 791875
11:53:51.122873949 Resend Request (35=2) Client A 131 BeginSeqNo: 789376, EndSeqNo: 791875
11:53:51.122957440 Resend Request (35=2) Client A 132 BeginSeqNo: 789376, EndSeqNo: 791875
11:53:51.123018806 Resend Request (35=2) Client A 133 BeginSeqNo: 789376, EndSeqNo: 791875
11:53:51.324842891 Resend Request (35=2) Client A 134 BeginSeqNo: 791876, EndSeqNo: 0
11:53:51.297865300 - 11:53:51.562065798 Multiple, mostly 35=8 Artio 789376-831609 Second request is successfully replayed, 831609 is last message sent by Artio, starting at seqnum=815340 many TCP Window Full messages pop up in Wireshark and continue for the rest of the replay
11:53:51.640852448 - 11:53:51.643038915 Resend Request (35=2) Client A 135-150 BeginSeqNo: 829528, EndSeqNo: 0, 16x Identical Resend Request
11:53:51.648413074 Resend Request (35=2) Client A 151 BeginSeqNo: 830323, EndSeqNo: 0

Related Exception

The following exception was logged but doesn’t seem like it should have caused
any issues. There was no other logging by Artio.

11:53:51 uk.co.real_logic.Artio.FixGatewayException: Ignore resend request for sessionId=42,connectionId=5096738714653978623 as 5 requests in flight
11:53:51        at uk.co.real_logic.Artio.engine.logger.Replayer.onResendRequest(Replayer.java:292)
11:53:51        at uk.co.real_logic.Artio.engine.logger.Replayer.onFragment(Replayer.java:208)
11:53:51        at io.aeron.Image.controlledPoll(Image.java:385)
11:53:51        at io.aeron.Subscription.controlledPoll(Subscription.java:235)
11:53:51        at uk.co.real_logic.Artio.engine.logger.Replayer.doWork(Replayer.java:475)
11:53:51        at org.agrona.concurrent.CompositeAgent.doWork(CompositeAgent.java:120)
11:53:51        at org.agrona.concurrent.AgentRunner.doWork(AgentRunner.java:304)
11:53:51        at org.agrona.concurrent.AgentRunner.workLoop(AgentRunner.java:296)
11:53:51        at org.agrona.concurrent.AgentRunner.run(AgentRunner.java:162)
11:53:51        at java.base/java.lang.Thread.run(Thread.java:1

Relevant Details

  • Shortly after messages stopped going out, framer started to backpressure the
    library thread. The Framer's Term Buffer roughly match with the number of
    bytes that should have been sent by the Framer starting out at 11:53:14.
  • The TCP socket was serviced for another 2 minutes before it was closed by the
    client.
  • Prior to the second resend request, Artio handled messages to and from the
    other sessions as normal. The second request was the last message Artio
    processed.

Reproduction Attempts

We have attempted to reproduce this issue locally but have been unsuccessful.
Using a custom FIX client, we were able to replicate the traffic with analogous
sequence numbers and a scaled down number of messages. This reproduced the
exception shown above but did not replicate the spinning behavior. It also
reproduced similar TCP Window Full messages in Wireshark. Attempts at making the
Term Buffer sizes and TCP Window small did not make any difference. As a result,
we believe this issue is due to a complex interaction in Artio with multiple
Resend Requests.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant