-
Notifications
You must be signed in to change notification settings - Fork 20
Reader changes for dynamic enable/disable of RCA graph components #325
Conversation
Codecov Report
@@ Coverage Diff @@
## master #325 +/- ##
=========================================
Coverage ? 67.15%
Complexity ? 2074
=========================================
Files ? 299
Lines ? 13280
Branches ? 1099
=========================================
Hits ? 8918
Misses ? 3958
Partials ? 404 Continue to review full report at Codecov.
|
General comment on the description message:
Should this be done at collator level or decider level? I was thinking that if an action is muted, it returns |
Aah yes, I started this before I started on collator, so I missed the fact that deciders can suggest multiple actions based. Will change this. |
src/test/java/com/amazon/opendistro/elasticsearch/performanceanalyzer/net/GRPCTest.java
Outdated
Show resolved
Hide resolved
...on/opendistro/elasticsearch/performanceanalyzer/config/overrides/ConfigOverridesWrapper.java
Show resolved
Hide resolved
...istro/elasticsearch/performanceanalyzer/decisionmaker/actions/ModifyQueueCapacityAction.java
Outdated
Show resolved
Hide resolved
src/main/java/com/amazon/opendistro/elasticsearch/performanceanalyzer/rca/RcaController.java
Outdated
Show resolved
Hide resolved
src/main/java/com/amazon/opendistro/elasticsearch/performanceanalyzer/rca/RcaController.java
Outdated
Show resolved
Hide resolved
src/main/java/com/amazon/opendistro/elasticsearch/performanceanalyzer/rca/RcaController.java
Outdated
Show resolved
Hide resolved
.../amazon/opendistro/elasticsearch/performanceanalyzer/rca/framework/core/ConfJsonWrapper.java
Outdated
Show resolved
Hide resolved
...on/opendistro/elasticsearch/performanceanalyzer/config/overrides/ConfigOverridesApplier.java
Show resolved
Hide resolved
...on/opendistro/elasticsearch/performanceanalyzer/config/overrides/ConfigOverridesApplier.java
Outdated
Show resolved
Hide resolved
@@ -75,15 +87,18 @@ public void processEvent(Event event) { | |||
|
|||
// An example node_metrics data is something like this for a two node cluster: | |||
// {"current_time":1566414001749} | |||
// 1566414001749 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file used to be a json object per line. It would be nice if we can keep that and push the override timestamp inside as follows:
{"overrides": {"enabled": {}, "disabled": {}}, ts:54674895}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This needs a change in the plugin as well. Will follow up with a separate PR for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cool. But let's change it.
...n/java/com/amazon/opendistro/elasticsearch/performanceanalyzer/rca/framework/core/Stats.java
Show resolved
Hide resolved
...amazon/opendistro/elasticsearch/performanceanalyzer/reader/ClusterDetailsEventProcessor.java
Outdated
Show resolved
Hide resolved
...on/opendistro/elasticsearch/performanceanalyzer/config/overrides/ConfigOverridesApplier.java
Show resolved
Hide resolved
...on/opendistro/elasticsearch/performanceanalyzer/config/overrides/ConfigOverridesApplier.java
Show resolved
Hide resolved
...on/opendistro/elasticsearch/performanceanalyzer/config/overrides/ConfigOverridesApplier.java
Show resolved
Hide resolved
...java/com/amazon/opendistro/elasticsearch/performanceanalyzer/rca/framework/core/RcaConf.java
Outdated
Show resolved
Hide resolved
this(new ConfigOverridesApplier()); | ||
} | ||
|
||
public ClusterDetailsEventProcessor(final ConfigOverridesApplier overridesApplier) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we don't need to pass ConfigOverridesApplier
as a constructor. No state is kept in it. It reads the config and applies it to the rca.conf
. Maybe we can just use it as a helper static class ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It does keep the lastAppliedTimestamp
as an instance variable and uses that to check if the config overrides string received from the ClusterDetailsEventProcessor
. Making it a static helper will make all the consumers see the same value of last applied time(in an IT for example).
LOG.debug("New set of muted rcas: {}", currentMutedRcaSet); | ||
LOG.debug("New set of muted deciders: {}", currentMutedDeciderSet); | ||
LOG.debug("New set of muted actions: {}", currentMutedActionSet); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should be applying change when someone calls the API or at startup, which shouldn't be that many. Should we make it info logs ? Or add a metric.
LOG.debug("New set of muted rcas: {}", currentMutedRcaSet); | ||
LOG.debug("New set of muted deciders: {}", currentMutedDeciderSet); | ||
LOG.debug("New set of muted actions: {}", currentMutedActionSet); | ||
rcaConf.updateAllRcaConfFiles(currentMutedRcaSet, currentMutedDeciderSet, currentMutedActionSet); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now we are updating three rca.conf
files. updateAllRcaConfFiles
has a void return type. What happens if it succeeded on a subset of files but not all? We will still update lastAppliedTimestamp
and never look back until someone makes a change again. So, should we make lastAppliedTimestamp
return true when writes on all three files succeed and and then only update the lastAppliedTimestamp
. If not, then we will retry again on the next run and eventually all three files will be updated or we likely have a disk corruption or disk full issues. thoughts ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point! I'm checking all three updates before updating the lastAppliedTimestamp.
I'm also emitting a metric if we fail to update an rca conf file. If we end up in a state where we can't do these file system changes, I think we should be failing in more places than just this, so yea, it's probably ok to do the timestamp update only if all three succeed.
@@ -39,10 +40,12 @@ | |||
// initiate a node config cache within each AppContext space | |||
// to store node config settings from ES | |||
private final NodeConfigCache nodeConfigCache; | |||
private Set<String> mutedActions; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RCAController thread updates this but it is read by the SchedulerTask threads. We should make this volatile
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
mutedActions); | ||
if (!updateStatus) { | ||
LOG.error("Failed to update the conf file at path: {}", confFilePath); | ||
StatsCollector.instance().logMetric(RcaConsts.WRITE_UPDATED_RCA_CONF_ERROR); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can use the new metric framework. There your metric can also tell you which file erred out ?
Changes for 1 package (OpenDistroPerformanceAnalyzerEngine), pushed in snapshot... https://code.amazon.com/snapshots/partsrut/2020-08-08T02-49-04 Changes for OpenDistroPerformanceAnalyzerEngine package: 0194f9f Fix failing reaction wheel tests bfdb93d Merge remote-tracking branch 'upstream/master' e03a120 Use StringUtils instead of NumberUtils to check timestamp string (#360) 79841f2 Expose queue rejection time period setting in rca.conf (#356) f9dfda5 Implement isMuted method in DummyAction (#355) 7ede0b9 Reader changes for dynamic enable/disable of RCA graph components (#325) ee58207 Fix bug in publisher to support cool off period on a per node basis (#351) cr https://code.amazon.com/reviews/CR-31344155
Issue #:
Fixes #327
Description of changes:
This change adds functionality to read the overrides written by the NodeDetailsCollector and update the rca conf file accordingly. To that effect this change adds two new sections to the
rca[_idle][_master].conf
:mutedActions: []
mutedDeciders: []
Updating RCA conf and the muted nodes work for RCAs and deciders. For actions, the list of muted actions are looked at by the collator and prunes the ones that are muted.
Tests:
Ran tests with different overrides and made sure that the rca conf was updated correctly and muted correctly. Verified same things for the actions as well.
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.