Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Anomaly detection job throws errors when not all remote clusters have source data #100311

Closed
wwang500 opened this issue Oct 5, 2023 · 3 comments
Labels
>bug :ml Machine learning Team:ML Meta label for the ML team

Comments

@wwang500
Copy link

wwang500 commented Oct 5, 2023

Stack Version:

8.10.2

Error:

[instance-0000000034] [ccs_using_wildcard_for_both_clusters] error while extracting data
org.elasticsearch.ResourceNotFoundException: [3] remote clusters out of [4] were skipped when performing datafeed search
	at org.elasticsearch.xpack.core.ml.datafeed.extractor.DataExtractor.checkForSkippedClusters(DataExtractor.java:61) ~[?:?]
	at org.elasticsearch.xpack.ml.datafeed.extractor.chunked.ChunkedDataExtractor.executeSearchRequest(ChunkedDataExtractor.java:150) ~[?:?]
	at org.elasticsearch.xpack.ml.datafeed.extractor.chunked.ChunkedDataExtractor$DataSummaryFactory.newAggregatedDataSummary(ChunkedDataExtractor.java:248) ~[?:?]
	at org.elasticsearch.xpack.ml.datafeed.extractor.chunked.ChunkedDataExtractor$DataSummaryFactory.buildDataSummary(ChunkedDataExtractor.java:221) ~[?:?]
	at org.elasticsearch.xpack.ml.datafeed.extractor.chunked.ChunkedDataExtractor.setUpChunkedSearch(ChunkedDataExtractor.java:123) ~[?:?]
	at org.elasticsearch.xpack.ml.datafeed.extractor.chunked.ChunkedDataExtractor.next(ChunkedDataExtractor.java:116) ~[?:?]
	at org.elasticsearch.xpack.ml.datafeed.DatafeedJob.run(DatafeedJob.java:376) ~[?:?]
	at org.elasticsearch.xpack.ml.datafeed.DatafeedJob.runRealtime(DatafeedJob.java:226) ~[?:?]
	at org.elasticsearch.xpack.ml.datafeed.DatafeedRunner$Holder.executeRealTime(DatafeedRunner.java:561) ~[?:?]
	at org.elasticsearch.xpack.ml.datafeed.DatafeedRunner$3.doRun(DatafeedRunner.java:307) ~[?:?]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:983) ~[elasticsearch-8.10.2.jar:?]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.10.2.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
	at java.lang.Thread.run(Thread.java:1623) ~[?:?]

Possible root cause:

#97731

Step to reproduce:

  1. Deploy three clusters:
  • main_cluster
  • remote_cluster_1
  • remote_cluster_2
  1. From main_cluster, setup remote clusters to remote_cluster_1 and remote_cluster_2
  2. In remote_cluster_1, load some data, in my case, I have dataset named: gallery-2023
  3. In main_cluster, create an anomaly detection job, configured with the datafeed:
"indices": [
      "*:gallery-*"
    ]
  1. Start ad job,

Observerd:

AD job is in started state, however it did't process data, and In job message, there is an error: Datafeed is encountering errors extracting data: [3] remote clusters out of [4] were skipped when performing datafeed search

Note:
With the same configure, the job was working fine on stack version 8.9.0.

  • In 8.9.0, for this settings (multiple remote clusters, only one cluster has the gallery data), the search GET *:gallery-*/_search returns:
"_clusters": {
    "total": 2,
    "successful": 2,
    "skipped": 0
  },
  • In 8.10.x, the same search GET *:gallery-*/_search returns:
"_clusters": {
    "total": 2,
    "successful": 1,
    "skipped": 1,
    "details": {
      "remote_cluster_1": {
        "status": "successful",
        "indices": "gallery-*",
        "took": 19,
        "timed_out": false,
        "_shards": {
          "total": 126,
          "successful": 126,
          "skipped": 0,
          "failed": 0
        }
      },
      "remote_cluster_2": {
        "status": "failed",
        "indices": "gallery-*",
        "took": 2,
        "timed_out": false,
        "_shards": {
          "total": 0,
          "successful": 0,
          "skipped": 0,
          "failed": 0
        }
      }
    }
  },
@wwang500 wwang500 added >bug :ml Machine learning Team:ML Meta label for the ML team labels Oct 5, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

@droberts195
Copy link
Contributor

Setting ccs_minimize_roundtrips=false is a possible workaround for this problem.

Hopefully #100354 will fix it. We should retest once that is merged and then assess whether any further fixes on top are required in the ML datafeed code.

@wwang500
Copy link
Author

Setting ccs_minimize_roundtrips=false is a possible workaround for this problem.

Hopefully #100354 will fix it. We should retest once that is merged and then assess whether any further fixes on top are required in the ML datafeed code.

After the PR #100354 was merged, I did a retest by using latest 8.12.0-SNAPSHOT build and can confirm this issue is fixed.

Now, when I ran GET *:gallery-*/_search?size=1, (with and without the checkbox of Skip if unavailable) I am getting this:

{
  "took": 6,
  "timed_out": false,
  "num_reduce_phases": 3,
  "_shards": {
    "total": 2,
    "successful": 2,
    "skipped": 0,
    "failed": 0
  },
  "_clusters": {
    "total": 2,
    "successful": 2,
    "skipped": 0,
    "running": 0,
    "partial": 0,
    "failed": 0,
    "details": {
      "ccs_1": {
        "status": "successful",
        "indices": "gallery-*",
        "took": 3,
        "timed_out": false,
        "_shards": {
          "total": 2,
          "successful": 2,
          "skipped": 0,
          "failed": 0
        }
      },
      "ccs_2": {
        "status": "successful",
        "indices": "gallery-*",
        "took": 0,
        "timed_out": false,
        "_shards": {
          "total": 0,
          "successful": 0,
          "skipped": 0,
          "failed": 0
        }
      }
    }
  },
  "hits": {
    "total": {
      "value": 10000,
      "relation": "gte"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "ccs_1:gallery-2023-10",
        "_id": "AVvTrgje9UgloJtm0xiN",
        "_score": 1,
        "_source": {
          "referer": "http://www.galleryjasminewhite.com/",
          "ver": null,
          "referer_domain": "http://www.galleryjasminewhite.com",
          "method": "GET",
          "useragent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20100101 Firefox/5.0",
          "uri": "/wp-content/uploads/2013/06/dune_house_oil_on_canvas_24x20-298x298.jpg",
          "version": "HTTP/1.1",
          "@timestamp": "2023-10-10T14:14:14.000Z",
          "file": "dune_house_oil_on_canvas_24x20-298x298.jpg",
          "bytes": "21627",
          "v": null,
          "clientip": "77.99.79.189",
          "root": "wp-content",
          "action": null,
          "user": "-",
          "status": "200"
        }
      }
    ]
  }
}

There are no further fixes needed on our ml side, as ML job now can successfully run.

cc: @droberts195 and @quux00

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :ml Machine learning Team:ML Meta label for the ML team
Projects
None yet
Development

No branches or pull requests

3 participants