[ML] Anomaly detection job throws errors when not all remote clusters have source data #100311

wwang500 · 2023-10-05T02:45:54Z

Stack Version:

8.10.2

Error:

[instance-0000000034] [ccs_using_wildcard_for_both_clusters] error while extracting data
org.elasticsearch.ResourceNotFoundException: [3] remote clusters out of [4] were skipped when performing datafeed search
	at org.elasticsearch.xpack.core.ml.datafeed.extractor.DataExtractor.checkForSkippedClusters(DataExtractor.java:61) ~[?:?]
	at org.elasticsearch.xpack.ml.datafeed.extractor.chunked.ChunkedDataExtractor.executeSearchRequest(ChunkedDataExtractor.java:150) ~[?:?]
	at org.elasticsearch.xpack.ml.datafeed.extractor.chunked.ChunkedDataExtractor$DataSummaryFactory.newAggregatedDataSummary(ChunkedDataExtractor.java:248) ~[?:?]
	at org.elasticsearch.xpack.ml.datafeed.extractor.chunked.ChunkedDataExtractor$DataSummaryFactory.buildDataSummary(ChunkedDataExtractor.java:221) ~[?:?]
	at org.elasticsearch.xpack.ml.datafeed.extractor.chunked.ChunkedDataExtractor.setUpChunkedSearch(ChunkedDataExtractor.java:123) ~[?:?]
	at org.elasticsearch.xpack.ml.datafeed.extractor.chunked.ChunkedDataExtractor.next(ChunkedDataExtractor.java:116) ~[?:?]
	at org.elasticsearch.xpack.ml.datafeed.DatafeedJob.run(DatafeedJob.java:376) ~[?:?]
	at org.elasticsearch.xpack.ml.datafeed.DatafeedJob.runRealtime(DatafeedJob.java:226) ~[?:?]
	at org.elasticsearch.xpack.ml.datafeed.DatafeedRunner$Holder.executeRealTime(DatafeedRunner.java:561) ~[?:?]
	at org.elasticsearch.xpack.ml.datafeed.DatafeedRunner$3.doRun(DatafeedRunner.java:307) ~[?:?]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:983) ~[elasticsearch-8.10.2.jar:?]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) ~[elasticsearch-8.10.2.jar:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) ~[?:?]
	at java.lang.Thread.run(Thread.java:1623) ~[?:?]

Possible root cause:

#97731

Step to reproduce:

Deploy three clusters:

main_cluster
remote_cluster_1
remote_cluster_2

From main_cluster, setup remote clusters to remote_cluster_1 and remote_cluster_2
In remote_cluster_1, load some data, in my case, I have dataset named: gallery-2023
In main_cluster, create an anomaly detection job, configured with the datafeed:

"indices": [
      "*:gallery-*"
    ]

Start ad job,

Observerd:

AD job is in started state, however it did't process data, and In job message, there is an error: Datafeed is encountering errors extracting data: [3] remote clusters out of [4] were skipped when performing datafeed search

Note:
With the same configure, the job was working fine on stack version 8.9.0.

In 8.9.0, for this settings (multiple remote clusters, only one cluster has the gallery data), the search GET *:gallery-*/_search returns:

"_clusters": {
    "total": 2,
    "successful": 2,
    "skipped": 0
  },

In 8.10.x, the same search GET *:gallery-*/_search returns:

"_clusters": {
    "total": 2,
    "successful": 1,
    "skipped": 1,
    "details": {
      "remote_cluster_1": {
        "status": "successful",
        "indices": "gallery-*",
        "took": 19,
        "timed_out": false,
        "_shards": {
          "total": 126,
          "successful": 126,
          "skipped": 0,
          "failed": 0
        }
      },
      "remote_cluster_2": {
        "status": "failed",
        "indices": "gallery-*",
        "took": 2,
        "timed_out": false,
        "_shards": {
          "total": 0,
          "successful": 0,
          "skipped": 0,
          "failed": 0
        }
      }
    }
  },

The text was updated successfully, but these errors were encountered:

elasticsearchmachine · 2023-10-05T02:46:17Z

Pinging @elastic/ml-core (Team:ML)

droberts195 · 2023-10-05T18:28:42Z

Setting ccs_minimize_roundtrips=false is a possible workaround for this problem.

Hopefully #100354 will fix it. We should retest once that is merged and then assess whether any further fixes on top are required in the ML datafeed code.

wwang500 · 2023-10-10T20:14:01Z

Setting ccs_minimize_roundtrips=false is a possible workaround for this problem.

Hopefully #100354 will fix it. We should retest once that is merged and then assess whether any further fixes on top are required in the ML datafeed code.

After the PR #100354 was merged, I did a retest by using latest 8.12.0-SNAPSHOT build and can confirm this issue is fixed.

Now, when I ran GET *:gallery-*/_search?size=1, (with and without the checkbox of Skip if unavailable) I am getting this:

{
  "took": 6,
  "timed_out": false,
  "num_reduce_phases": 3,
  "_shards": {
    "total": 2,
    "successful": 2,
    "skipped": 0,
    "failed": 0
  },
  "_clusters": {
    "total": 2,
    "successful": 2,
    "skipped": 0,
    "running": 0,
    "partial": 0,
    "failed": 0,
    "details": {
      "ccs_1": {
        "status": "successful",
        "indices": "gallery-*",
        "took": 3,
        "timed_out": false,
        "_shards": {
          "total": 2,
          "successful": 2,
          "skipped": 0,
          "failed": 0
        }
      },
      "ccs_2": {
        "status": "successful",
        "indices": "gallery-*",
        "took": 0,
        "timed_out": false,
        "_shards": {
          "total": 0,
          "successful": 0,
          "skipped": 0,
          "failed": 0
        }
      }
    }
  },
  "hits": {
    "total": {
      "value": 10000,
      "relation": "gte"
    },
    "max_score": 1,
    "hits": [
      {
        "_index": "ccs_1:gallery-2023-10",
        "_id": "AVvTrgje9UgloJtm0xiN",
        "_score": 1,
        "_source": {
          "referer": "http://www.galleryjasminewhite.com/",
          "ver": null,
          "referer_domain": "http://www.galleryjasminewhite.com",
          "method": "GET",
          "useragent": "Mozilla/5.0 (Windows NT 6.1; WOW64; rv:5.0) Gecko/20100101 Firefox/5.0",
          "uri": "/wp-content/uploads/2013/06/dune_house_oil_on_canvas_24x20-298x298.jpg",
          "version": "HTTP/1.1",
          "@timestamp": "2023-10-10T14:14:14.000Z",
          "file": "dune_house_oil_on_canvas_24x20-298x298.jpg",
          "bytes": "21627",
          "v": null,
          "clientip": "77.99.79.189",
          "root": "wp-content",
          "action": null,
          "user": "-",
          "status": "200"
        }
      }
    ]
  }
}

There are no further fixes needed on our ml side, as ML job now can successfully run.

cc: @droberts195 and @quux00

wwang500 added >bug :ml Machine learning Team:ML Meta label for the ML team labels Oct 5, 2023

quux00 mentioned this issue Oct 5, 2023

Cross-cluster search with minimize_roundtrips=true reports clusters with no shards to search as failed or skipped #100350

Closed

wwang500 closed this as completed Oct 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Anomaly detection job throws errors when not all remote clusters have source data #100311

[ML] Anomaly detection job throws errors when not all remote clusters have source data #100311

wwang500 commented Oct 5, 2023

elasticsearchmachine commented Oct 5, 2023

droberts195 commented Oct 5, 2023

wwang500 commented Oct 10, 2023

[ML] Anomaly detection job throws errors when not all remote clusters have source data #100311

[ML] Anomaly detection job throws errors when not all remote clusters have source data #100311

Comments

wwang500 commented Oct 5, 2023

Stack Version:

Error:

Possible root cause:

Step to reproduce:

elasticsearchmachine commented Oct 5, 2023

droberts195 commented Oct 5, 2023

wwang500 commented Oct 10, 2023