Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

separate doc-level monitor query indices created by detectors #1324

Merged
merged 4 commits into from
Sep 26, 2024

Conversation

sbcd90
Copy link
Collaborator

@sbcd90 sbcd90 commented Sep 25, 2024

Description

separate doc-level monitor query indices created by detectors

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@@ -887,7 +887,7 @@ private IndexMonitorRequest createDocLevelMonitorMatchAllRequest(

Monitor monitor = new Monitor(monitorId, Monitor.NO_VERSION, monitorName, false, detector.getSchedule(), detector.getLastUpdateTime(), null,
Monitor.MonitorType.DOC_LEVEL_MONITOR.getValue(), detector.getUser(), 1, docLevelMonitorInputs, triggers, Map.of(),
new DataSources(detector.getRuleIndex(),
new DataSources(detector.getRuleIndex() + "_chained_findings",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need separate index?

Copy link
Collaborator Author

@sbcd90 sbcd90 Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

each doc-level monitor has its own query index now. So the chained findings doc-level monitor also has its own index. The Sigma Rules doc-level monitor & the bucket-level monitor's chained monitor run asynchronously & thus cannot share the same query index as it is created & deleted in every run.

}

public static String getRuleIndexOptimized(String logType) {
return String.format(Locale.getDefault(), ".opensearch-sap-%s-detectors-queries-optimized-%s", logType, UUID.randomUUID());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are moving from index patttern to index?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no. this method is utilized by few integ tests & hence the change.

@@ -1369,7 +1369,11 @@ void onGetResponse(Detector currentDetector, User user) {
request.getDetector().setAlertsHistoryIndexPattern(DetectorMonitorConfig.getAlertsHistoryIndexPattern(ruleTopic));
request.getDetector().setFindingsIndex(DetectorMonitorConfig.getFindingsIndex(ruleTopic));
request.getDetector().setFindingsIndexPattern(DetectorMonitorConfig.getFindingsIndexPattern(ruleTopic));
request.getDetector().setRuleIndex(DetectorMonitorConfig.getRuleIndex(ruleTopic));
if (currentDetector.getRuleIndex().contains("optimized")) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if name contains optimized shouldn't we set rule index name as getRuleIndexOptimized value?

is this if else condition flipped?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the if-else condition checks if the detector is already using a dedicated query index, then keep on using it. else, switch from shared query index to new dedicated query index as the detector is updated.

" }\n" +
" }\n" +
"}";
SearchResponse response = executeSearchAndGetResponse(DetectorMonitorConfig.getRuleIndex(randomDetectorType()), request, true);
Copy link
Member

@eirsep eirsep Sep 25, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why has this been removed ?

Plz dont reduce test coverage..
fix test as necessary

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

query indices are created & deleted now. So, it is difficult to assert number of queries in query index now.

Copy link
Member

@eirsep eirsep left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens when someone updates a detector that doesnt have exclusive query index?

@@ -825,21 +825,10 @@ public void testMultipleAggregationAndDocRules_alertSuccess() throws IOException

Response createResponse = makeRequest(client(), "POST", SecurityAnalyticsPlugin.DETECTOR_BASE_URI, Collections.emptyMap(), toHttpEntity(detector));


String request = "{\n" +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plz revert and fix test as necessary

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

query indices are created & deleted now. So, it is difficult to assert number of queries in query index now.

@@ -1209,21 +1209,11 @@ public void testCreateDetectorWithNotCondition_verifyFindingsAndNoFindings_succe

Response createResponse = makeRequest(client(), "POST", SecurityAnalyticsPlugin.DETECTOR_BASE_URI, Collections.emptyMap(), toHttpEntity(detector));

String request = "{\n" +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plz revert

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

query indices are created & deleted now. So, it is difficult to assert number of queries in query index now.

@sbcd90
Copy link
Collaborator Author

sbcd90 commented Sep 25, 2024

what happens when someone updates a detector that doesnt have exclusive query index?

The detector is updated to use exclusive query index.

Signed-off-by: Subhobrata Dey <[email protected]>
@sbcd90
Copy link
Collaborator Author

sbcd90 commented Sep 26, 2024

security tests fail as docker image not updated.

@eirsep
Copy link
Member

eirsep commented Sep 26, 2024

looks like we create aliases not indices from 'queryIndex` field in monitor datasources

@eirsep
Copy link
Member

eirsep commented Sep 26, 2024

can we add tests for scenarios

setting is off. create 2 detectors. verify query index name. turn on setting. update 1 detector, verify query index name.
setting is on. create 1 detector. verify query index name. turn off setting. update detector, verify query index name. create new detector and verify query index(should be old style since setting is off)

request.getDetector().setRuleIndex(DetectorMonitorConfig.getRuleIndexOptimized(ruleTopic));
} else {
String ruleTopicIndex = DetectorMonitorConfig.getRuleIndex(ruleTopic);
request.getDetector().setRuleIndex(ruleTopicIndex.substring(0, ruleTopicIndex.length()-1));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why did we change the existing code from
request.getDetector().setRuleIndex(DetectorMonitorConfig.getRuleIndex(ruleTopic));

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ruleIndex is changing to dedicated ruleIndexOptimzied now. Thats why if-else condition.

@@ -22,7 +24,11 @@ public class DetectorMonitorConfig {
public static final String OPENSEARCH_SAP_RULE_INDEX_TEMPLATE = ".opensearch-sap-detectors-queries-index-template";

public static String getRuleIndex(String logType) {
return String.format(Locale.getDefault(), ".opensearch-sap-%s-detectors-queries", logType);
return String.format(Locale.getDefault(), ".opensearch-sap-%s-detectors-queries*", logType);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why *
this is an alias

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed it.

@@ -1369,7 +1379,16 @@ void onGetResponse(Detector currentDetector, User user) {
request.getDetector().setAlertsHistoryIndexPattern(DetectorMonitorConfig.getAlertsHistoryIndexPattern(ruleTopic));
request.getDetector().setFindingsIndex(DetectorMonitorConfig.getFindingsIndex(ruleTopic));
request.getDetector().setFindingsIndexPattern(DetectorMonitorConfig.getFindingsIndexPattern(ruleTopic));
request.getDetector().setRuleIndex(DetectorMonitorConfig.getRuleIndex(ruleTopic));
if (currentDetector.getRuleIndex().contains("optimized")) {
request.getDetector().setRuleIndex(currentDetector.getRuleIndex());
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

plz add comment that if we turn OFF setting after turning on, updating detector will not change from optimized to normal and we need to re-create detector

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a comment. added a test too.

request.getDetector().setRuleIndex(DetectorMonitorConfig.getRuleIndexOptimized(ruleTopic));
} else {
String ruleTopicIndex = DetectorMonitorConfig.getRuleIndex(ruleTopic);
request.getDetector().setRuleIndex(ruleTopicIndex.substring(0, ruleTopicIndex.length() - 1));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this line is very confusing

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed it.

@@ -266,6 +266,7 @@ void setDebugLogLevel() throws IOException, InterruptedException {


makeRequest(client(), "PUT", "_cluster/settings", Collections.emptyMap(), se, new BasicHeader("Content-Type", "application/json"));
updateClusterSetting("plugins.security_analytics.enable_detectors_with_dedicated_query_indices", "true");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are there tests with setting turned off

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a test for setting turned off.

@@ -1722,12 +1725,16 @@ public void testAzureMappings() throws IOException {
DetectorInput input = new DetectorInput("windows detector for security analytics", List.of(indexName), List.of(),
getPrePackagedRules("azure").stream().map(DetectorRule::new).collect(Collectors.toList()));
Detector detector = randomDetectorWithInputs(List.of(input), "azure");
createDetector(detector);

List<SearchHit> hits = executeSearch(".opensearch-sap-azure-detectors-queries-000001", matchAllSearchBody);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you retain this test and add a new one - basically test for both setting turned. on and turned off

Copy link
Collaborator Author

@sbcd90 sbcd90 Sep 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added scenario for flag turned off in test cases. updated this test case to fetch queries from index pattern. https://github.com/sbcd90/security-analytics/blob/query_idx/src/test/java/org/opensearch/securityanalytics/mapper/MapperRestApiIT.java#L1727

@@ -1104,21 +1094,11 @@ public void testCreateDetector_verifyWorkflowCreation_success_WithGroupByRulesIn

Response createResponse = makeRequest(client(), "POST", SecurityAnalyticsPlugin.DETECTOR_BASE_URI, Collections.emptyMap(), toHttpEntity(detector));

String request = "{\n" +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should not remove this code. we lose code coverage of tests. that's an anti-pattern. let's fix these tests

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed it.

Signed-off-by: Subhobrata Dey <[email protected]>
@sbcd90
Copy link
Collaborator Author

sbcd90 commented Sep 26, 2024

can we add tests for scenarios

setting is off. create 2 detectors. verify query index name. turn on setting. update 1 detector, verify query index name. setting is on. create 1 detector. verify query index name. turn off setting. update detector, verify query index name. create new detector and verify query index(should be old style since setting is off)

added both test cases.

1st scenario: https://github.com/sbcd90/security-analytics/blob/c514a780d4b3f65ac828e3ff868c56f46f2a82b7/src/test/java/org/opensearch/securityanalytics/resthandler/DetectorRestApiIT.java#L1729
2nd scenario: https://github.com/sbcd90/security-analytics/blob/c514a780d4b3f65ac828e3ff868c56f46f2a82b7/src/test/java/org/opensearch/securityanalytics/resthandler/DetectorRestApiIT.java#L1831

@@ -83,7 +83,8 @@ private void getAllRuleIndices(ActionListener<List<String>> listener) {
listener.onResponse(
logTypes
.stream()
.map(logType -> DetectorMonitorConfig.getRuleIndex(logType))
// use index pattern here to define rule topic index template for all query indices which match the pattern
.map(logType -> DetectorMonitorConfig.getRuleIndex(logType) + "*")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be detector specific?
why do we have this method? getAllRuleIndices

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a single index template containing specific settings for sigma rule executions are applied to all rule indices.

@eirsep
Copy link
Member

eirsep commented Sep 26, 2024

CI is failing

@sbcd90
Copy link
Collaborator Author

sbcd90 commented Sep 26, 2024

CI is failing

ci passed now.

@eirsep
Copy link
Member

eirsep commented Sep 26, 2024

Security tests failed

@sbcd90 sbcd90 merged commit 1ca4090 into opensearch-project:main Sep 26, 2024
13 of 19 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Sep 26, 2024
Signed-off-by: Subhobrata Dey <[email protected]>
(cherry picked from commit 1ca4090)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.15 failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/security-analytics/backport-2.15 2.15
# Navigate to the new working tree
pushd ../.worktrees/security-analytics/backport-2.15
# Create a new branch
git switch --create backport-1324-to-2.15
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 1ca40908ac7b050cc7eb90df3108ab8a21a1f6d1
# Push it to GitHub
git push --set-upstream origin backport-1324-to-2.15
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/security-analytics/backport-2.15

Then, create a pull request where the base branch is 2.15 and the compare/head branch is backport-1324-to-2.15.

@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.16 failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Navigate to the root of your repository
cd $(git rev-parse --show-toplevel)
# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/security-analytics/backport-2.16 2.16
# Navigate to the new working tree
pushd ../.worktrees/security-analytics/backport-2.16
# Create a new branch
git switch --create backport-1324-to-2.16
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 1ca40908ac7b050cc7eb90df3108ab8a21a1f6d1
# Push it to GitHub
git push --set-upstream origin backport-1324-to-2.16
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/security-analytics/backport-2.16

Then, create a pull request where the base branch is 2.16 and the compare/head branch is backport-1324-to-2.16.

sbcd90 added a commit to sbcd90/security-analytics that referenced this pull request Sep 26, 2024
sbcd90 pushed a commit that referenced this pull request Sep 30, 2024
opensearch-trigger-bot bot added a commit that referenced this pull request Oct 7, 2024
sbcd90 added a commit to sbcd90/security-analytics that referenced this pull request Oct 7, 2024
sbcd90 added a commit that referenced this pull request Oct 7, 2024
sbcd90 pushed a commit that referenced this pull request Oct 16, 2024
…#1330) (#1343)

Signed-off-by: Subhobrata Dey <[email protected]>
(cherry picked from commit 038d60a)

Co-authored-by: opensearch-trigger-bot[bot] <98922864+opensearch-trigger-bot[bot]@users.noreply.github.com>
@ELKHostmaster
Copy link

@sbcd90 Good morning,
Regarding this change to code.... We have a large cluster running lots of SIGMA rule security detectors. We recently upgraded from 2.13 to 2.16 but the cluster and detector performance has been absolutely terrible ever since. We see two nodes from our cluster that seem to perform the doclevel fan out and these tasks run for days until nodes in the cluster start having heap issues, circuit breakers, nodes dropping out, etc.

It appears in 2.14 that the change to document level fanout opensearch-project/alerting#1521 could be the issue.

Is the change above - to the way fan outs are handled an attempt to fix the performance of the security analytics detections?

@sbcd90
Copy link
Collaborator Author

sbcd90 commented Nov 21, 2024

hi @ELKHostmaster , please bounce(stop and start) OpenSearch process in every data node. This would cleanup those tasks.

Please delete(after deleting of detectors you may also want to delete .opensearch-sap-detectors* indices & cleanup monitors linked to the 2 detectors you mentioned from .opendistro-alerting-config index) & re-create detectors on a rolling index alias with write index pointing to the current index where data is ingested(https://opensearch.org/docs/latest/im-plugin/index-alias/#index-alias-options )

In the end, it should result in better performance.

@mimicbox
Copy link

Hi @sbcd90

I am having a simmilar issue to this and have been following this. I want to try your suggestion but the part I do not get it is e-

create detectors on a rolling index alias with write index pointing to the current index where data is ingested

How do I define this during detector creation? I use the API to create detectors. Are you talking about the index pattern you define in the inputs.detector_input.indices value? I currently use an alias that points to a datastream.

Thank you for any help at all!

@sbcd90
Copy link
Collaborator Author

sbcd90 commented Nov 21, 2024

hi @mimicbox , if you're currently using an index pattern with your detector, just change it to an index alias instead. the term rolling refers to index alias with rollovers.
e.g.
we have an index,

api_output-000001 -> currently points as write index for index alias api_output
now, index rolls over after sometime to ,
api_output-000002 , write index also moves if you're using an ISM policy for rollover. https://opensearch.org/docs/latest/im-plugin/ism/policies/#rollover

provide api_output as input to the detector. if there is no auto suggestion, manually type it into the text box.
Lastly, keep index alias name api_output a prefix of the index names behind it e.g api_output-000001, api_output-000002... so on, or api_output-test-000001, api_output-test-000002... so on.

@ELKHostmaster
Copy link

@sbcd90 If I understand correctly, your saying point the detector to the index alias and use ISM to rollover the indices at regular intervals.

I'm already doing that. There is something related to the fan-out change introduced in 2.14 that I think has really caused the performance issues.

Is this fixed in 2.18? Is upgrading from 2.16 a possible solution?

@sbcd90
Copy link
Collaborator Author

sbcd90 commented Nov 21, 2024

@ELKHostmaster , are you using Sigma Aggregation Rules?

@ELKHostmaster
Copy link

@sbcd90 do you mean - using any of the aggregation functions in sigma rules (https://help.sigmacomputing.com/docs/aggregate-functions) ?

No, I dont believe so but I am scanning through them now.

@ELKHostmaster
Copy link

No correlation rules either. I've stripped it down to just our sigma detections as we've been trying to get to the bottom of the issue for the last month.

@sbcd90
Copy link
Collaborator Author

sbcd90 commented Nov 21, 2024

hi @ELKHostmaster , please bounce(stop and start) OpenSearch process in every data node. This would cleanup those tasks.
please re-create(delete & create) the detectors. This will help cleanup old metadata.
I would suggest move to an odd release. e.g, 2.15 or 2.17 before doing these steps. They will have the latest fixes.

@ELKHostmaster
Copy link

@sbcd90 thank you. yes, I have stopped/started every data node several times over the last month. its the only way to keep the cluster running at this point. I'm currently on 2.16

I thought this fix was added in 2.18. Are you saying it was backported to 2.17.1 ? I didn't think that backports resulted in a new release package being built. So, I didn't think the fix would be included.

Also, when you say "odd" release... is there something different about odd/even releases?

Thanks again for the responses!

@sbcd90
Copy link
Collaborator Author

sbcd90 commented Nov 21, 2024

hi @ELKHostmaster , i dont think data nodes need to be stopped/started frequently. i would say take a thread dump of a data node when doc-level fanout tasks are stuck. That would help root cause the issue.

yes. i think all these fixes should be in 2.18. But to stay up-to-date with future fixes, following an odd release helps.

@ELKHostmaster
Copy link

ELKHostmaster commented Nov 22, 2024

@sbcd90 Good evening. I waited until a couple data nodes got really busy and took jstack dumps for you to look at. I also included a list of tasks running on the cluster. You can see a lot of tasks are running for 8 hours now and growing since the last reboot of the nodes. I appreciate any help you can give. Thanks!
_cat_tasks.txt
data-21.txt
data-13.txt

@ELKHostmaster
Copy link

@sbcd90 Good morning. Could I ask a favor of you? Could you take a look at the thread dumps above and see if the issue is related to the fanout change? Our cluster is unusable right now with detectors enabled/running. Circuit breaker issues, java heap exceptions, nodes dropping out, 99% cpu on data nodes, etc. This was an upgrade from 2.13 where everything was fine. The only change was upgrading to 2.16. Same detectors/rules, etc.

Do you think the fixes to fanout in 2.18 will help us?

Thank You in advance!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants