Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ECO-5033] fix: race condition when callingAblyRealtime#connect() on terminated state #1042

Merged
merged 1 commit into from
Oct 17, 2024

Conversation

ttypic
Copy link
Contributor

@ttypic ttypic commented Oct 14, 2024

Resolves #1041

Summary by CodeRabbit

  • New Features

    • Enhanced error handling during connection startup to improve reliability.
    • Improved thread management to ensure clean termination of connection handler threads.
    • Refined state transition logic for better handling of connection states, particularly during reconnection attempts.
  • Bug Fixes

    • Addressed potential concurrency issues related to handler thread management.
  • Logging Enhancements

    • Added more detailed logging for state transitions and error handling for improved visibility.

Copy link

coderabbitai bot commented Oct 14, 2024

Walkthrough

The changes in the ConnectionManager class focus on enhancing error handling, thread management, and state transition logic. Key modifications include the introduction of a try-catch block in the connect() method to manage InterruptedException, a while loop in the startup() method to ensure proper thread termination before starting a new handler thread, and refined logic in the onTimeout() method for better state transition handling. Additional logging has been implemented to improve visibility into the connection management process.

Changes

File Path Change Summary
lib/src/main/java/io/ably/lib/transport/ConnectionManager.java - Enhanced error handling in connect() method with try-catch for InterruptedException.
- Improved thread management in startup() to prevent multiple handler threads.
- Refined state transition logic in onTimeout() for Disconnected and Suspended states.
- Added logging for better visibility during state transitions and error handling.
- Updated method signatures to include throws InterruptedException.

Assessment against linked issues

Objective Addressed Explanation
Prevent race condition when calling AblyRealtime#connect() on terminated state (#1041)

🐰 In the land of code where rabbits play,
A ConnectionManager found a better way.
With threads in line and errors caught,
It hops along, no longer fraught.
Logs now sing of states and more,
A happy dance on the coding floor! 🐇✨


📜 Recent review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 29c1395 and a70c827.

📒 Files selected for processing (1)
  • lib/src/main/java/io/ably/lib/transport/ConnectionManager.java (5 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • lib/src/main/java/io/ably/lib/transport/ConnectionManager.java
🧰 Additional context used
📓 Learnings (1)
📓 Common learnings
Learnt from: sacOO7
PR: ably/ably-java#1042
File: lib/src/main/java/io/ably/lib/transport/ConnectionManager.java:74-77
Timestamp: 2024-10-16T07:55:49.878Z
Learning: In the `ConnectionManager` class, prefer using `terminated` instead of `terminating` to represent the final state, as it is less confusing and more readable.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

@github-actions github-actions bot temporarily deployed to staging/pull/1042/features October 14, 2024 11:12 Inactive
@github-actions github-actions bot temporarily deployed to staging/pull/1042/javadoc October 14, 2024 11:13 Inactive
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 5155516 and 375817f.

📒 Files selected for processing (1)
  • lib/src/main/java/io/ably/lib/transport/ConnectionManager.java (3 hunks)
🧰 Additional context used
🔇 Additional comments (2)
lib/src/main/java/io/ably/lib/transport/ConnectionManager.java (2)

698-698: Proper synchronization for thread termination

The use of ConnectionManager.this.notifyAll(); after setting handlerThread to null correctly notifies any threads waiting on the ConnectionManager instance. This ensures that threads blocked in the startup() method's wait() call are properly awakened when the handler thread terminates.


1347-1354: Proper synchronization and interruption handling in startup() method

The synchronized startup() method now correctly waits for the handlerThread to become null before starting a new ActionHandler thread. By using a while loop with wait(), you ensure that only one handler thread is active at any time, preventing potential concurrency issues.

Declaring throws InterruptedException allows the method to propagate the interruption to the caller, which aligns with best practices for handling thread interruptions.

@ttypic ttypic force-pushed the ECO-5033/fix-rc-on-connect branch from 375817f to d1637a3 Compare October 14, 2024 11:35
@github-actions github-actions bot temporarily deployed to staging/pull/1042/features October 14, 2024 11:36 Inactive
@github-actions github-actions bot temporarily deployed to staging/pull/1042/javadoc October 14, 2024 11:37 Inactive
@ttypic ttypic requested a review from sacOO7 October 14, 2024 11:37
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (4)
lib/src/main/java/io/ably/lib/transport/ConnectionManager.java (4)

1348-1355: Approve changes in startup() with a minor suggestion

The addition of the while loop to wait for the termination of the handler thread is a good improvement. It addresses potential race conditions by ensuring proper thread cleanup before starting a new one, which aligns with the PR objective of fixing race conditions.

Consider adding a timeout to the wait operation to prevent potential deadlocks. Here's a suggested improvement:

 private synchronized void startup() throws InterruptedException {
+    long startTime = System.currentTimeMillis();
+    long timeout = 5000; // 5 seconds timeout
     while (handlerThread != null) {
         Log.v(TAG, "Waiting for termination action to clean up handler thread");
-        wait();
+        wait(timeout);
+        if (System.currentTimeMillis() - startTime > timeout) {
+            Log.w(TAG, "Timeout waiting for handler thread termination");
+            break;
+        }
     }

     (handlerThread = new Thread(new ActionHandler())).start();
     startConnectivityListener();
 }

Line range hint 1099-1138: Approve changes in onAuthUpdatedAsync() with a minor suggestion

The use of a single-threaded executor to handle the state transition waiting is a good improvement. It prevents potential blocking of the main thread and aligns with the PR objective of improving thread handling in connection management.

Consider adding error handling for the executor's execution. Here's a suggested improvement:

 singleThreadExecutor.execute(() -> {
+    try {
         boolean waitingForConnected = true;
         while (waitingForConnected) {
             final ErrorInfo reason = waiter.waitForChange();
             final ConnectionState connectionState = currentState.state;
             switch (connectionState) {
                 case connected:
                     authUpdateResult.onUpdate(true, null);
                     Log.v(TAG, "onAuthUpdated: got connected");
                     waitingForConnected = false;
                     break;

                 case connecting:
                 case disconnected:
                     Log.v(TAG, "onAuthUpdated: " + connectionState);
                     break;

                 default:
                     /* suspended/closed/error: throw the error. */
                     Log.v(TAG, "onAuthUpdated: throwing exception");
                     authUpdateResult.onUpdate(false, reason);
                     waitingForConnected = false;
             }
         }
         waiter.close();
+    } catch (Exception e) {
+        Log.e(TAG, "Error in onAuthUpdatedAsync execution", e);
+        authUpdateResult.onUpdate(false, new ErrorInfo("Internal error in auth update", 50000));
+    }
 });

Line range hint 1165-1186: Approve changes in onConnected() with a minor suggestion for consistency

The additional logging for connection resume scenarios and handling of failed resumes without errors are good improvements. They enhance visibility into the connection process and improve the robustness of the code.

For consistency, consider using a constant for the log tag instead of hardcoding "TAG". Here's a suggested improvement:

 if(message.connectionId.equals(connection.id)) { // RTN15c6 - resume success
     if(message.error == null) {
-        Log.d(TAG, "connection has reconnected and resumed successfully");
+        Log.d(TAG, "Connection has reconnected and resumed successfully");
     } else {
-        Log.d(TAG, "connection resume success with non-fatal error: " + message.error.message);
+        Log.d(TAG, "Connection resume success with non-fatal error: " + message.error.message);
     }
     addPendingMessagesToQueuedMessages(false);
 } else { // RTN15c7, RTN16d - resume     failure
     if (message.error != null) {
-        Log.d(TAG, "connection resume failed with error: " + message.error.message);
+        Log.d(TAG, "Connection resume failed with error: " + message.error.message);
     } else { // This shouldn't happen but, putting it here for safety
-        Log.d(TAG, "connection resume failed without error" );
+        Log.d(TAG, "Connection resume failed without error" );
     }

     addPendingMessagesToQueuedMessages(true);
     channels.transferToChannelQueue(extractConnectionQueuePresenceMessages());
 }

Line range hint 1203-1222: Approve changes in addPendingMessagesToQueuedMessages() with a minor suggestion

The changes in this method improve the handling of message serials and ensure proper retrying of messages after a failed resume. This aligns well with the PR objective of improving connection resume handling.

Consider extracting the logic for resetting the message serial into a separate method for improved readability. Here's a suggested improvement:

 private void addPendingMessagesToQueuedMessages(boolean resetMessageSerial) {
     synchronized (this) {
         List<QueuedMessage> allPendingMessages = pendingMessages.popAll();

-        if (resetMessageSerial){  // failed resume, so all new published messages start with msgSerial = 0
-            msgSerial = 0; //msgSerial will increase in sendImpl when messages are sent, RTN15c7
-        } else if (!allPendingMessages.isEmpty()) { // pendingMessages needs to expect next msgSerial to be the earliest previously unacknowledged message
-            msgSerial = allPendingMessages.get(0).msg.msgSerial;
-        }
+        resetMessageSerialIfNeeded(resetMessageSerial, allPendingMessages);

         // Add messages from pending messages to front of queuedMessages in order to retry them
         queuedMessages.addAll(0, allPendingMessages);
     }
 }

+private void resetMessageSerialIfNeeded(boolean resetMessageSerial, List<QueuedMessage> allPendingMessages) {
+    if (resetMessageSerial) {
+        // Failed resume, so all new published messages start with msgSerial = 0
+        msgSerial = 0; // msgSerial will increase in sendImpl when messages are sent, RTN15c7
+    } else if (!allPendingMessages.isEmpty()) {
+        // pendingMessages needs to expect next msgSerial to be the earliest previously unacknowledged message
+        msgSerial = allPendingMessages.get(0).msg.msgSerial;
+    }
+}
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 375817f and d1637a3.

📒 Files selected for processing (1)
  • lib/src/main/java/io/ably/lib/transport/ConnectionManager.java (3 hunks)
🧰 Additional context used

@ttypic ttypic force-pushed the ECO-5033/fix-rc-on-connect branch from d1637a3 to 1d28882 Compare October 14, 2024 11:47
@github-actions github-actions bot temporarily deployed to staging/pull/1042/features October 14, 2024 11:47 Inactive
@github-actions github-actions bot temporarily deployed to staging/pull/1042/javadoc October 14, 2024 11:48 Inactive
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (3)
lib/src/main/java/io/ably/lib/transport/ConnectionManager.java (3)

1348-1355: Improved connection resume handling and logging in onConnected() method

The changes in the onConnected() method significantly improve the handling of connection resumes. The additional logging provides better visibility into the connection process, which is crucial for debugging. The method now properly handles both successful and failed connection resumes, aligning with the specifications mentioned in the comments (RTN15c, RTN15c6, RTN15c7, RTN16d).

However, there's a minor improvement that could be made:

Consider using Log.w() instead of Log.d() for logging connection resume failures. This would make it easier to identify issues in production environments where debug logging might be disabled.

-    Log.d(TAG, "connection resume failed with error: " + message.error.message);
+    Log.w(TAG, "Connection resume failed with error: " + message.error.message);

1348-1355: New method for extracting presence messages

The addition of the extractConnectionQueuePresenceMessages() method is a good improvement. It safely extracts presence messages from the queued messages without risking concurrent modification exceptions. The use of an iterator is appropriate for this purpose.

The method is well-implemented and the comment about Android compatibility is helpful for future maintainers.

Consider using a LinkedList instead of an ArrayList for queuedPresenceMessages. This would make the add operation more efficient, especially if there are many presence messages:

-    final List<QueuedMessage> queuedPresenceMessages = new ArrayList<>();
+    final List<QueuedMessage> queuedPresenceMessages = new LinkedList<>();

This change would improve performance when adding elements, as LinkedList has O(1) complexity for add operations at the end, while ArrayList might need to resize.


1348-1355: Improved handling of pending messages in addPendingMessagesToQueuedMessages()

The changes in the addPendingMessagesToQueuedMessages() method significantly improve the handling of pending messages during connection state changes. The method now correctly handles the resetting of message serials based on whether a connection resume failed, aligning with the specifications mentioned (RTN19a, RTN19a1, RTN19a2).

The addition of pending messages to the front of the queued messages list ensures they are sent first, which is the correct behavior for maintaining message order.

Consider adding a brief comment explaining the significance of resetting msgSerial to 0 in the case of a failed resume. This would improve code readability:

 if (resetMessageSerial){  // failed resume, so all new published messages start with msgSerial = 0
+    // Reset msgSerial to 0 for a failed resume to ensure proper message ordering (RTN15c7)
     msgSerial = 0; //msgSerial will increase in sendImpl when messages are sent, RTN15c7
 }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between d1637a3 and 1d28882.

📒 Files selected for processing (1)
  • lib/src/main/java/io/ably/lib/transport/ConnectionManager.java (3 hunks)
🧰 Additional context used
🔇 Additional comments (3)
lib/src/main/java/io/ably/lib/transport/ConnectionManager.java (3)

794-800: Improved error handling in connect() method

The addition of the try-catch block for InterruptedException is a good improvement. It properly handles thread interruptions during connection startup by logging the error, restoring the interrupt status, and returning early. This change prevents further execution after an interruption, which is a good practice for maintaining thread safety and consistency.


1348-1355: Enhanced thread management in startup() method

The addition of the while loop to wait for the termination of the handler thread is a significant improvement. This change ensures that only one handler thread is active at any given time, preventing potential concurrency issues. The method now properly handles thread interruptions by throwing an InterruptedException, allowing for better interrupt handling up the call stack. The use of wait() inside the loop allows for responsive termination of the thread.


Line range hint 1-2230: Overall assessment of changes in ConnectionManager.java

The modifications in this file significantly improve the robustness and reliability of the connection management system. Key improvements include:

  1. Enhanced error handling, particularly for thread interruptions.
  2. Improved thread management in the startup() method.
  3. Better handling of connection resumes, including proper logging.
  4. New method for safely extracting presence messages.
  5. Improved handling of pending messages during connection state changes.

These changes effectively address the race condition mentioned in the PR summary and align well with the specified objectives. The code is now more resilient to edge cases in connection handling, which should lead to a more stable connection management system.

Copy link
Collaborator

@sacOO7 sacOO7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add test around this covering happy and failing path?

@ttypic ttypic force-pushed the ECO-5033/fix-rc-on-connect branch from 1d28882 to 29c1395 Compare October 14, 2024 14:32
@github-actions github-actions bot temporarily deployed to staging/pull/1042/features October 14, 2024 14:33 Inactive
@github-actions github-actions bot temporarily deployed to staging/pull/1042/javadoc October 14, 2024 14:34 Inactive
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
lib/src/main/java/io/ably/lib/transport/ConnectionManager.java (1)

1355-1365: LGTM: Enhanced startup process with improved synchronization

The changes in the startup() method significantly improve the synchronization between startup and termination processes. The while loop ensures that startup doesn't proceed until any ongoing termination is complete, preventing potential race conditions. Throwing InterruptedException allows for better handling of thread interruptions.

One minor suggestion:

Consider using a more descriptive log message for the case when connect() is called twice. For example:

-            Log.v(TAG, "`connect()` has been called twice on uninitialized or terminal state");
+            Log.v(TAG, "`connect()` called on already initialized or non-terminal state. Ignoring redundant call.");

This provides more context about why the second call is being ignored.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 1d28882 and 29c1395.

📒 Files selected for processing (1)
  • lib/src/main/java/io/ably/lib/transport/ConnectionManager.java (5 hunks)
🧰 Additional context used
🔇 Additional comments (4)
lib/src/main/java/io/ably/lib/transport/ConnectionManager.java (4)

74-77: LGTM: New termination flag improves state management

The addition of the terminating flag is a good improvement. It will help in managing the connection state more effectively, especially during the shutdown process. This can prevent potential race conditions and improve the overall reliability of the connection management.


800-806: LGTM: Improved error handling in connect() method

The changes in the connect() method enhance error handling by properly catching and managing InterruptedException. The early return after logging the error and restoring the thread's interrupt status is a good practice. This improvement helps maintain the consistency of the connection state and aids in debugging.


704-705: LGTM: Improved cleanup process in ActionHandler

The changes in the ActionHandler class enhance the cleanup process when the handler thread is about to terminate. Setting terminating to false allows for a potential restart of the connection manager, while the notifyAll() call ensures that any waiting threads are properly notified of this state change. These improvements contribute to a more robust and reliable connection management system.


Line range hint 1-1839: Overall, these changes significantly improve connection management

The modifications made to the ConnectionManager class effectively address the race condition mentioned in the PR objectives. The introduction of the terminating flag, along with the enhancements in error handling and thread synchronization, contribute to a more robust and reliable connection management system.

Key improvements include:

  1. Better handling of connection termination states
  2. Enhanced error management in the connect() method
  3. Improved synchronization between startup and termination processes
  4. More efficient cleanup in the ActionHandler class

These changes align well with the goal of enhancing the stability and reliability of the connection process in the Ably Java library.

@ttypic
Copy link
Contributor Author

ttypic commented Oct 14, 2024

@sacOO7 Unfortunately, we don’t have the test infrastructure for unit testing thread synchronization, and introducing it is beyond the scope of this PR. It’s also impossible to reliably reproduce this in e2e tests.

@ttypic ttypic requested a review from sacOO7 October 14, 2024 15:24
@ttypic ttypic requested a review from sacOO7 October 15, 2024 10:59
Copy link
Collaborator

@sacOO7 sacOO7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@sacOO7 sacOO7 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ttypic ttypic merged commit 6a66eec into main Oct 17, 2024
12 checks passed
@ttypic ttypic deleted the ECO-5033/fix-rc-on-connect branch October 17, 2024 10:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Race condition when callingAblyRealtime#connect() on terminated state
2 participants