Internal action like shard started/failure should not trigger circuit breaker #92783

xiaoyuan0821 · 2023-01-10T09:07:37Z

Elasticsearch Version

7.x

Installed Plugins

No response

Java Version

bundled

OS Version

Linux HOSTNAME 3.10.0-1160.31.1.el7.x86_64 #1 SMP Thu Jun 10 13:30:10 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

Problem Description

I have a cluster stuck at shard initializing

{
  "cluster_name" : "prod-es",
  "status" : "yellow",
  "timed_out" : false,
  "number_of_nodes" : 3,
  "number_of_data_nodes" : 3,
  "active_primary_shards" : 33612,
  "active_shards" : 61211,
  "relocating_shards" : 0,
  "initializing_shards" : 41,
  "unassigned_shards" : 5961,
  "delayed_unassigned_shards" : 0,
  "number_of_pending_tasks" : 0,
  "number_of_in_flight_fetch" : 0,
  "task_max_waiting_in_queue_millis" : 0,
  "active_shards_percent_as_number" : 91.07017987591686
}

but cat recovery api returns nothing:

curl -X GET  http://192.168.0.208:9200/_cat/recovery?active_only=true
# empty response

the node log says shard started action is dropped by master node because of circuit breaker:

[2023-01-10T07:45:10,145][WARN ][o.e.c.a.s.ShardStateAction] [prod-es-ess-esn-2-1] unexpected failure while sending request [internal:cluster/shard/started] to [{prod-es-ess-esn-3-1}{J3pDbQD9TRayTunq585UDg}{WSW5HgtPSiGCzC2_PONNJg}{192.168.0.164}{192.168.0.164:9300}{dimr}] for shard entry [StartedShardEntry{shardId [[2059-bpmbussprojectdmg_209-bpmm209_bpdmgmodel_219_log_replica_sdm_archive_es-20221216093814][2]], allocationId [HzHB6jpSSKajaJK5De3kNw], primary term [3], message [after peer recovery]}]
org.elasticsearch.transport.RemoteTransportException: [prod-es-ess-esn-3-1][192.168.0.164:9300][internal:cluster/shard/started]
Caused by: org.elasticsearch.common.breaker.CircuitBreakingException: [parent] Data too large, data for [internal:cluster/shard/started] would be [8172567858/7.6gb], which is larger than the limit of [8160437862/7.5gb], real usage: [8172567536/7.6gb], new bytes reserved: [322/322b], usages [request=0/0b, fielddata=0/0b, in_flight_requests=322/322b, accounting=25496308/24.3mb]
	at org.elasticsearch.indices.breaker.HierarchyCircuitBreakerService.checkParentLimit(HierarchyCircuitBreakerService.java:364) ~[elasticsearch-7.10.2.jar:7.10.2]
	at org.elasticsearch.common.breaker.ChildMemoryCircuitBreaker.addEstimateBytesAndMaybeBreak(ChildMemoryCircuitBreaker.java:109) ~[elasticsearch-7.10.2.jar:7.10.2]
	at org.elasticsearch.transport.InboundAggregator.checkBreaker(InboundAggregator.java:211) ~[elasticsearch-7.10.2.jar:7.10.2]
	at org.elasticsearch.transport.InboundAggregator.finishAggregation(InboundAggregator.java:120) ~[elasticsearch-7.10.2.jar:7.10.2]
	at org.elasticsearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:140) ~[elasticsearch-7.10.2.jar:7.10.2]
	at org.elasticsearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:117) ~[elasticsearch-7.10.2.jar:7.10.2]
	at org.elasticsearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:82) ~[elasticsearch-7.10.2.jar:7.10.2]
	at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:77) ~[?:?]
	at org.elasticsearch.transport.netty4.Netty4HeartBeatChannelHandler.channelRead(Netty4HeartBeatChannelHandler.java:40) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
	at io.netty.handler.logging.LoggingHandler.channelRead(LoggingHandler.java:280) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
	at io.netty.handler.ssl.SslHandler.unwrap(SslHandler.java:1371) ~[?:?]
	at io.netty.handler.ssl.SslHandler.decodeJdkCompatible(SslHandler.java:1234) ~[?:?]
	at io.netty.handler.ssl.SslHandler.decode(SslHandler.java:1283) ~[?:?]
	at io.netty.handler.codec.ByteToMessageDecoder.decodeRemovalReentryProtection(ByteToMessageDecoder.java:507) ~[?:?]
	at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:446) ~[?:?]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:276) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:365) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:357) ~[?:?]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410) ~[?:?]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:379) ~[?:?]

I have to reroute retry failed to make the shard initialize to go, but after some while, it stucks again...

I think internal actions like shard started/failure should not trigger circuit breaker, we should change this code https://github.com/elastic/elasticsearch/blob/main/server/src/main/java/org/elasticsearch/cluster/action/shard/ShardStateAction.java#L100 to

        transportService.registerRequestHandler(SHARD_STARTED_ACTION_NAME, ThreadPool.Names.SAME, false, false, StartedShardEntry::new,
            new ShardStartedTransportHandler(clusterService,
                new ShardStartedClusterStateTaskExecutor(allocationService, rerouteService, () -> followUpRerouteTaskPriority, logger),
                logger));
        transportService.registerRequestHandler(SHARD_FAILED_ACTION_NAME, ThreadPool.Names.SAME, false, false, FailedShardEntry::new,
            new ShardFailedTransportHandler(clusterService,
                new ShardFailedClusterStateTaskExecutor(allocationService, rerouteService, () -> followUpRerouteTaskPriority, logger),
                logger));

Steps to Reproduce

High jvm usage cluster can reproduce this issue

Logs (if relevant)

No response

The text was updated successfully, but these errors were encountered:

DaveCTurner · 2023-01-10T12:45:35Z

Circuit-breaking on these messages is the correct behaviour: if the master is overloaded, we want to push back on the rest of the cluster. If we didn't, the master would just go OOM.

Moreover, quoting the bug report form:

Please also check your OS is supported, and that the version of Elasticsearch has not passed end-of-life. If you are using an unsupported OS or an unsupported version then the issue is likely to be closed.

You are using 7.10.2 which is long past EOL, and newer versions are much less likely to circuit-break on the master (see e.g. #77466). Therefore I am closing this.

xiaoyuan0821 added >bug needs:triage labels Jan 10, 2023

DaveCTurner closed this as completed Jan 10, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Internal action like shard started/failure should not trigger circuit breaker #92783

Internal action like shard started/failure should not trigger circuit breaker #92783

xiaoyuan0821 commented Jan 10, 2023

DaveCTurner commented Jan 10, 2023

Internal action like shard started/failure should not trigger circuit breaker #92783

Internal action like shard started/failure should not trigger circuit breaker #92783

Comments

xiaoyuan0821 commented Jan 10, 2023

Elasticsearch Version

Installed Plugins

Java Version

OS Version

Problem Description

Steps to Reproduce

Logs (if relevant)

DaveCTurner commented Jan 10, 2023