-
Notifications
You must be signed in to change notification settings - Fork 57
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
SearchBackPressure Service Node/Cluster RCA #437
SearchBackPressure Service Node/Cluster RCA #437
Conversation
…zon.com) Signed-off-by: CoderJeffrey <[email protected]>
Signed-off-by: CoderJeffrey <[email protected]>
….com) Signed-off-by: CoderJeffrey <[email protected]>
…amazon.com) Signed-off-by: CoderJeffrey <[email protected]>
Signed-off-by: CoderJeffrey <[email protected]>
…@amazon.com) Signed-off-by: CoderJeffrey <[email protected]>
…y Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]>
…@amazon.com) Signed-off-by: CoderJeffrey <[email protected]>
… Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]>
… (signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]>
…[email protected]) Signed-off-by: CoderJeffrey <[email protected]>
…ed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]>
…f-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]>
…e level (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]>
…zon.com) Signed-off-by: CoderJeffrey <[email protected]>
…mazon.com) Signed-off-by: CoderJeffrey <[email protected]>
…@amazon.com) Signed-off-by: CoderJeffrey <[email protected]>
…: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]>
…rchBP (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]>
…[email protected]) Signed-off-by: CoderJeffrey <[email protected]>
…amazon.com) Signed-off-by: CoderJeffrey <[email protected]>
…@amazon.com) Signed-off-by: CoderJeffrey <[email protected]>
Signed-off-by: CoderJeffrey <[email protected]>
…Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]>
…[email protected]) Signed-off-by: CoderJeffrey <[email protected]>
…[email protected]) Signed-off-by: CoderJeffrey <[email protected]>
Signed-off-by: CoderJeffrey <[email protected]>
…iu [email protected]) Signed-off-by: CoderJeffrey <[email protected]>
build.gradle
Outdated
@@ -371,7 +371,7 @@ dependencies { | |||
implementation("org.mockito:mockito-core") { | |||
version { | |||
strictly "2.23.0" | |||
} | |||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: avoid trailing space
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
|
||
// cancellationCount due to heap is less than 30% of all task cancellations in task level | ||
public static final int DEFAULT_TASK_MIN_HEAP_CANCELLATION_THRESHOLD = 30; | ||
private Integer minTaskHeapCancellationPercentageThreshold; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will the values of these thresholds be modified during runtime? If not we can make them final
as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, these values like minTaskHeapCancellationPercentageThreshold will read values from config file in runtime thus we cannot mark them as final
minHeapDecreasePercentageThreshold = conf.readRcaConfig( CONFIG_NAME, RCA_CONF_KEY_CONSTANTS.MAX_HEAP_USAGE_DECREASE_FIELD, DEFAULT_MIN_HEAP_DECREASE_THRESHOLD, (s) -> s >= 0 && s <= 100, Integer.class);
} | ||
|
||
// name for the configuration field | ||
public static class RCA_CONF_KEY_CONSTANTS { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It might be better to use enum
instead of public static class
here to define a list of related constants.
In general, enums offer better readability, type safety, and if we need to iterating over those constants we can simply use the values()
method. This is more straightforward than manually iterating over constants in a public static final class.
you can do something like:
public enum RCA_CONF_KEY_CONSTANTS {
MAX_HEAP_USAGE_INCREASE_FIELD("max-heap-usage-increase"),
MAX_SHARD_HEAP_CANCELLATION_PERCENTAGE_FIELD("max-shard-heap-cancellation-percentage");
// ...
private final String value;
RCA_CONF_KEY_CONSTANTS(final String value) {
this.value = value;
}
@Override
public String toString() {
return value;
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much for the suggestion. I have changed the design from static class to Enum.
RcaConsts.RcaTagConstants.TAG_AGGREGATE_UPSTREAM, | ||
RcaConsts.RcaTagConstants.LOCUS_DATA_NODE); | ||
|
||
// To Do SearchBackPressure Decider |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: use TODO
instead of To Do
for get better IDE support
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed
@@ -250,4 +250,61 @@ public double readMin() { | |||
return Double.NaN; | |||
} | |||
} | |||
|
|||
/** | |||
* Sliding window to check the max/min olg gen usage within a given time frame |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: olg -> old
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
|
||
@Override | ||
public void next(SlidingWindowData e) { | ||
boolean pollFirstCondition; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This variable pollFirstCondition
is not used. We can remove it
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
} | ||
|
||
/* | ||
* operate() is used for local build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
operate() is used for local build
is this related to this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the operate() line is a dead comment for that function, removed.
|
||
updateAllSlidingWindows(searchBackPressureRCAMetric, currentTimeMillis); | ||
|
||
LOG.info("SearchBackPressureRCA currentIterationNumber is {}", currentIterationNumber); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we can change some of the LOG.info
messages to LOG.debug
. LOG.info
should only contains essential information to store in the service log.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change unnecessary log message level to debug
…[email protected]) Signed-off-by: CoderJeffrey <[email protected]>
Signed-off-by: CoderJeffrey <[email protected]>
Signed-off-by: CoderJeffrey <[email protected]>
Signed-off-by: CoderJeffrey <[email protected]>
Signed-off-by: CoderJeffrey <[email protected]>
Signed-off-by: CoderJeffrey <[email protected]>
Signed-off-by: CoderJeffrey <[email protected]>
…@amazon.com) Signed-off-by: CoderJeffrey <[email protected]>
* Remove log files and add DCO (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Remove extra files (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Remove styling difference (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Remove unnecessary file changes (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Add RCA_Decider (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Extract Heap Usage from SQlitedb (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Extract required searchbp metrics for deciders (signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Add SearchBackPressureRCA Metric (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Use SearchBackPressureRCAMetrics to aggregate metrics (signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Add the conf file extracted part for SearchBackPressureRcaConfig.java (signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Add MinMaxSlidingWindow in OldGenRca (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Rename SearchBackPressureClusterRCA and add it to AnalysisGraph (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Add basic UTs for SearchBackPressureRCA cluster/node level (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Add unhealthy/healthy stats UTs for SearchBackPressureRCA cluster/node level (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Add healthy resource unit UT (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Add UT s both shard/task level (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Add a new SearchBp Resource Unit (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Add UTs to test shard/task level resource include-ness (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Remove styling changes for Version.java (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Add metadata to resourceSummary (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Update to more general framework (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Refactor the MinMaxSlidingWindow and bug fix (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Refactor Heap Stats Metrics Getter(Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Refactor HeapUsed and HeapMax Getters (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Refactor operate() (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Refactor operate() and remove dead comments (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Merged Main (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Merged Main (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * remove trailing space in build.gradle (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * nit javadoc update (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * nit javadoc updates (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * nit javadoc updates (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Remove dead comments (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * update javadoc (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * LOG Level Change (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Change from static class to enum (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> --------- Signed-off-by: CoderJeffrey <[email protected]> (cherry picked from commit 2b970e3)
* Remove log files and add DCO (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Remove extra files (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Remove styling difference (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Remove unnecessary file changes (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Add RCA_Decider (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Extract Heap Usage from SQlitedb (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Extract required searchbp metrics for deciders (signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Add SearchBackPressureRCA Metric (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Use SearchBackPressureRCAMetrics to aggregate metrics (signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Add the conf file extracted part for SearchBackPressureRcaConfig.java (signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Add MinMaxSlidingWindow in OldGenRca (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Rename SearchBackPressureClusterRCA and add it to AnalysisGraph (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Add basic UTs for SearchBackPressureRCA cluster/node level (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Add unhealthy/healthy stats UTs for SearchBackPressureRCA cluster/node level (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Add healthy resource unit UT (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Add UT s both shard/task level (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Add a new SearchBp Resource Unit (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Add UTs to test shard/task level resource include-ness (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Remove styling changes for Version.java (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Add metadata to resourceSummary (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Update to more general framework (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Refactor the MinMaxSlidingWindow and bug fix (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Refactor Heap Stats Metrics Getter(Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Refactor HeapUsed and HeapMax Getters (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Refactor operate() (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Refactor operate() and remove dead comments (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Merged Main (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Merged Main (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * remove trailing space in build.gradle (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * nit javadoc update (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * nit javadoc updates (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * nit javadoc updates (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Remove dead comments (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * update javadoc (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * LOG Level Change (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> * Change from static class to enum (Signed-off-by: Jeffrey Liu [email protected]) Signed-off-by: CoderJeffrey <[email protected]> --------- Signed-off-by: CoderJeffrey <[email protected]> (cherry picked from commit 2b970e3) Co-authored-by: Jeffrey Liu <[email protected]>
Is your feature request related to a problem? Please provide an existing Issue # , or describe.
After Reader Component of RCA digest Search Back Pressure service related stats and populate the stats in sqlitedb, we need Cluster/Node level RCA to analyze the stats and send node/cluster level Resource Flow Units to deciders. Then, deciders can use the resource flow units to suggest actions for downstream classes.
Describe the solution you are proposing
Create Cluster/Node Level RCA to generate Resource FlowUnit by comparing stats of SearchBackPressure Service and threshold, and added corresponding UTs.
Search Back Pressure RCA analyze the stats of
Searchbp_Stats Table
in sqlitedb from previous PR of Reader Component for RCA (#427). Example stats (e.g heap_cancellation_count with threshold, generate healthy/unhealthy resource flow units) are used to determine the health status of a node.Describe alternatives you've considered
N/A
Additional context
The logic to decide whether the node level rca healthiness is not decided yet, we will talk with other teams (e.g. SearchBackPressure team) to discuss the final threshold for RCA, and the logic will likely change in near future.
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.