-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle Reader thread termination gracefully #476
Conversation
Signed-off-by: Khushboo Rajput <[email protected]>
Codecov Report
@@ Coverage Diff @@
## main #476 +/- ##
============================================
- Coverage 74.78% 74.44% -0.34%
+ Complexity 2668 2664 -4
============================================
Files 316 317 +1
Lines 16243 16301 +58
Branches 1272 1277 +5
============================================
- Hits 12147 12136 -11
- Misses 3581 3651 +70
+ Partials 515 514 -1
|
src/main/java/org/opensearch/performanceanalyzer/PerformanceAnalyzerApp.java
Outdated
Show resolved
Hide resolved
src/main/java/org/opensearch/performanceanalyzer/PerformanceAnalyzerApp.java
Outdated
Show resolved
Hide resolved
StatsCollector.instance().logException(StatExceptionCode.READER_ERROR_RCA_AGENT_STOPPED); | ||
|
||
// Terminate Java Runtime, executes {@link #shutDownGracefully(ClientServers clientServers)} | ||
System.exit(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case users want(or try) to enable PA/RCA back via CLI dynamically, I guess it wouldn't work?
How do we handle such scenarios?
Does it make sense if we just terminate the reader thread? As killing rca process, disabling PA plugin(via 9200) all seems too intrusive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree, killing only the PA thread while keeping the process idle is wasting the system resource. The RCA process as we know depends on reader subcomponent for providing the data, which then flows through analysis graph and made available to nodes/users via the grpc and web server resp.
In case users want(or try) to enable PA/RCA back via CLI dynamically, I guess it wouldn't work?
How do we handle such scenarios?
PA can be enabled using the REST API call, while starting RCA will require bringing up the RCA Agent via the performance-analyzer-agent tool. This updated behavior will be added to the documentation.
Let me know your thoughts.
@@ -51,6 +53,7 @@ public class PerformanceAnalyzerApp { | |||
|
|||
private static final Logger LOG = LogManager.getLogger(PerformanceAnalyzerApp.class); | |||
|
|||
public static final int READER_RESTART_MAX_ATTEMPTS = 3; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should probably do some test runs in the docker env to understand what could be the correct value for READER_RESTART_MAX_ATTEMPTS
. We can also consider exponential backoff here in case the env set up takes time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will increase the number of attempts here to 12(1min).
Exponential Backoff makes sense for API calls but here, a thread is crashing while attempting to read data from disk - unlikely, the backoff retry will help here.
src/main/java/org/opensearch/performanceanalyzer/PerformanceAnalyzerApp.java
Outdated
Show resolved
Hide resolved
Signed-off-by: Khushboo Rajput <[email protected]>
Signed-off-by: Khushboo Rajput <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks @khushbr for making the changes!
…nd PROCESS_STATE_FATAL handling Signed-off-by: Khushboo Rajput <[email protected]>
The backport to
To backport manually, run these commands in your terminal: # Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.x 2.x
# Navigate to the new working tree
cd .worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-476-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 08ce04e21f20628d40383c355d73f2865d7b216c
# Push it to GitHub
git push --set-upstream origin backport/backport-476-to-2.x
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.x Then, create a pull request where the |
Is your feature request related to a problem? Please provide an existing Issue # , or describe.
#468
How can one reproduce the bug?
Update
docker/docker-compose.cluster.yml
to:Describe the solution you are proposing
RCA Agent process creates 4 thread as part of process bootstrap. The Reader thread is most critical as it is responsibly for reading, processing and cleaning up upstream metrics generated by Performance Analyzer(PA) plugin. In case of Reader thread crash, the metrics will start accumulating and filling up the disk, while the other components (RCA Graph, Webserver and GRPC Server) perform no work as the metrics are the leaf nodes to all the processing.
The initial design, thus, was to keep creating new reader thread in an infinite loop. This, however, isn't efficient in case of non-retryable/permanent failures. This code changes adds:
Describe alternatives you've considered
The alternative to not terminating the RCA process is to keep the other threads and the process running, while the Reader thread has crashed. In the absence of no metrics flowing through analysis graph, the Controller thread and Servers (web and grpc) do no work.
Testing
1. PerformanceAnalyzer.log with failure simulation
2. supervisord.log after RCA process termination
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.