Exceeding the maximum doc count of a shard fails the shard #51136

DaveCTurner · 2020-01-17T08:44:55Z

Today (7.5.0) if we try and index a document into a shard that already contains 2147483519 documents then it is rejected by Lucene and a no-op is written to the translog to record this. However, since #30226 we also try and record no-ops themselves as documents in Lucene; the index is already full so we fail to add the tombstone too. The failure to add this tombstone is fatal to the shard:

[2020-01-17T09:20:02,405][WARN ][o.e.i.c.IndicesClusterStateService] [node-0] [i][0] marking and sending shard failed due to [shard failure, reason [no-op origin[PRIMARY] seq#[2147483519] failed at document level]]
java.lang.IllegalArgumentException: number of documents in the index cannot exceed 2147483519
        at org.apache.lucene.index.DocumentsWriterPerThread.reserveOneDoc(DocumentsWriterPerThread.java:225) ~[lucene-core-8.3.0.jar:8.3.0 2aa586909b911e66e1d8863aa89f173d69f86cd2 - ishan - 2019-10-25 23:10:03]
        at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:234) ~[lucene-core-8.3.0.jar:8.3.0 2aa586909b911e66e1d8863aa89f173d69f86cd2 - ishan - 2019-10-25 23:10:03]
        at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:495) ~[lucene-core-8.3.0.jar:8.3.0 2aa586909b911e66e1d8863aa89f173d69f86cd2 - ishan - 2019-10-25 23:10:03]
        at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1594) ~[lucene-core-8.3.0.jar:8.3.0 2aa586909b911e66e1d8863aa89f173d69f86cd2 - ishan - 2019-10-25 23:10:03]
        at org.apache.lucene.index.IndexWriter.addDocument(IndexWriter.java:1213) ~[lucene-core-8.3.0.jar:8.3.0 2aa586909b911e66e1d8863aa89f173d69f86cd2 - ishan - 2019-10-25 23:10:03]
        at org.elasticsearch.index.engine.InternalEngine.innerNoOp(InternalEngine.java:1533) ~[elasticsearch-7.5.0.jar:7.5.0]
        at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:937) ~[elasticsearch-7.5.0.jar:7.5.0]
        at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:796) ~[elasticsearch-7.5.0.jar:7.5.0]
        at org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:768) ~[elasticsearch-7.5.0.jar:7.5.0]
        at org.elasticsearch.index.shard.IndexShard.applyIndexOperationOnPrimary(IndexShard.java:725) ~[elasticsearch-7.5.0.jar:7.5.0]
        at org.elasticsearch.action.bulk.TransportShardBulkAction.executeBulkItemRequest(TransportShardBulkAction.java:258) ~[elasticsearch-7.5.0.jar:7.5.0]
        at org.elasticsearch.action.bulk.TransportShardBulkAction$2.doRun(TransportShardBulkAction.java:161) ~[elasticsearch-7.5.0.jar:7.5.0]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.5.0.jar:7.5.0]
        at org.elasticsearch.action.bulk.TransportShardBulkAction.performOnPrimary(TransportShardBulkAction.java:193) ~[elasticsearch-7.5.0.jar:7.5.0]
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:118) ~[elasticsearch-7.5.0.jar:7.5.0]
        at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:79) ~[elasticsearch-7.5.0.jar:7.5.0]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform(TransportReplicationAction.java:917) ~[elasticsearch-7.5.0.jar:7.5.0]
        at org.elasticsearch.action.support.replication.ReplicationOperation.execute(ReplicationOperation.java:108) ~[elasticsearch-7.5.0.jar:7.5.0]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.runWithPrimaryShardReference(TransportReplicationAction.java:394) ~[elasticsearch-7.5.0.jar:7.5.0]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.lambda$doRun$0(TransportReplicationAction.java:316) ~[elasticsearch-7.5.0.jar:7.5.0]
        at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:63) ~[elasticsearch-7.5.0.jar:7.5.0]
        at org.elasticsearch.index.shard.IndexShard.lambda$wrapPrimaryOperationPermitListener$21(IndexShard.java:2752) ~[elasticsearch-7.5.0.jar:7.5.0]
        at org.elasticsearch.action.ActionListener$3.onResponse(ActionListener.java:113) ~[elasticsearch-7.5.0.jar:7.5.0]
        at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:285) ~[elasticsearch-7.5.0.jar:7.5.0]
        at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:237) ~[elasticsearch-7.5.0.jar:7.5.0]
        at org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationPermit(IndexShard.java:2726) ~[elasticsearch-7.5.0.jar:7.5.0]
        at org.elasticsearch.action.support.replication.TransportReplicationAction.acquirePrimaryOperationPermit(TransportReplicationAction.java:858) ~[elasticsearch-7.5.0.jar:7.5.0]
        at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.doRun(TransportReplicationAction.java:312) ~[elasticsearch-7.5.0.jar:7.5.0]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.5.0.jar:7.5.0]
        at org.elasticsearch.action.support.replication.TransportReplicationAction.handlePrimaryRequest(TransportReplicationAction.java:275) ~[elasticsearch-7.5.0.jar:7.5.0]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler$1.doRun(SecurityServerTransportInterceptor.java:257) ~[?:?]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.5.0.jar:7.5.0]
        at org.elasticsearch.xpack.security.transport.SecurityServerTransportInterceptor$ProfileSecuredRequestHandler.messageReceived(SecurityServerTransportInterceptor.java:315) ~[?:?]
        at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:63) ~[elasticsearch-7.5.0.jar:7.5.0]
        at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:752) ~[elasticsearch-7.5.0.jar:7.5.0]
        at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:773) ~[elasticsearch-7.5.0.jar:7.5.0]
        at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-7.5.0.jar:7.5.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
        at java.lang.Thread.run(Thread.java:830) [?:?]

However, we immediately restart the shard:

[2020-01-17T09:20:02,408][INFO ][o.e.c.r.a.AllocationService] [node-0] Cluster health status changed from [GREEN] to [RED] (reason: [shards failed [[i][0]]]).
[2020-01-17T09:20:02,963][INFO ][o.e.c.r.a.AllocationService] [node-0] Cluster health status changed from [RED] to [GREEN] (reason: [shards started [[i][0]]]).

I think this is ok - the operation fails before it makes it to the translog so there's nothing to replay - but it would be good to confirm that there's no risk we do something bad (e.g. leak a seqno) here.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2020-01-17T08:45:00Z

Pinging @elastic/es-distributed (:Distributed/Engine)

ywelsch · 2020-01-24T10:37:49Z

While what the implementation is doing here is very harsh, it's currently the only way to guarantee that sequence numbers are not leaked. This is a particularly bad kind of event, and as primary and replicas can have different doc count (due to deletes), can hit either copy at an arbitrary point. On the primary, we could start to reject requests at an earlier point (based on some kind of pre-flight check, before generating the sequence number). On the replica, we have no other choice than to fail the copy, otherwise it will be out of sync with the primary.

JalehD · 2020-08-13T12:25:05Z

I have seen this issue a cloud cluster

In this case , cluster kept flipping to RED as the index was sent more and more data aimed at this index. the fact that the index name that hits the issue functionbeat-7.5.0-2020.01.30-000001 suggest we may need better prevention of this issue within ILM policies.

Error
[instance-0000000012] failing shard [failed shard, shard [functionbeat-7.5.0-2020.01.30-000001][0], node[PmZKd7TcSd-uh1KVX_uOYA], [P], s[STARTED], a[id=Dsmath-YQnySjr66fx7WUQ], message [shard failure, reason [no-op origin[PRIMARY] seq#[2391692890] failed at document level]], failure [IllegalArgumentException[number of documents in the index cannot exceed 2147483519]], markAsStale [true]] java.lang.IllegalArgumentException: number of documents in the index cannot exceed 2147483519 at org.apache.lucene.index.DocumentsWriterPerThread.reserveOneDoc(DocumentsWriterPerThread.java:225) ~[lucene-core-8.3.0.jar:8.3.0 2aa586909b911e66e1d8863aa89f173d69f86cd2 - ishan - 2019-10-25 23:10:03] at org.apache.lucene.index.DocumentsWriterPerThread.updateDocument(DocumentsWriterPerThread.java:234) ~[lucene-core-8.3.0.jar:8.3.0 2aa586909b911e66e1d8863aa89f173d69f86cd2 - ishan - 2019-10-25 23:10:03] at org.apache.lucene.index.DocumentsWriter.updateDocument(DocumentsWriter.java:495) ~[lucene-core-8.3.0.jar:8.3.0 2aa586909b911e66e1d8863aa89f173d69f86cd2 - ishan - 2019-10-25 23:10:03] at org.apache.lucene.index.IndexWriter.updateDocument(IndexWriter.java:1594) ~[lucene-core-8.3.0.jar:8.3.0 2aa586909b911e66e1d8863aa89f173d69f86cd2 - ishan - 2019-10-25 23:10:03]

dnhatn · 2020-10-05T17:42:50Z

I've opened #63273.

Today indexing to a shard with 2147483519 documents will fail that shard. We should check the number of documents and reject the write requests instead. Closes #51136

Today indexing to a shard with 2147483519 documents will fail that shard. We should check the number of documents and reject the write requests instead. Closes elastic#51136

Today indexing to a shard with 2147483519 documents will fail that shard. We should check the number of documents and reject the write requests instead. Closes #51136

DaveCTurner added >bug :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. labels Jan 17, 2020

DaveCTurner added the v7.5.0 label Jan 17, 2020

rjernst added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label May 4, 2020

dnhatn self-assigned this Oct 5, 2020

dnhatn mentioned this issue Oct 5, 2020

Check docs limit before indexing on primary #63273

Merged

dnhatn closed this as completed in #63273 Oct 13, 2020

dnhatn added a commit that referenced this issue Oct 13, 2020

Check docs limit before indexing on primary (#63273)

2b5e337

Today indexing to a shard with 2147483519 documents will fail that shard. We should check the number of documents and reject the write requests instead. Closes #51136

dnhatn mentioned this issue Oct 13, 2020

Check docs limit before indexing on primary #63627

Merged

dnhatn mentioned this issue Oct 13, 2020

Check docs limit before indexing on primary #63628

Merged

dnhatn added a commit that referenced this issue Oct 13, 2020

Check docs limit before indexing on primary (#63273)

9015b50

Today indexing to a shard with 2147483519 documents will fail that shard. We should check the number of documents and reject the write requests instead. Closes #51136

dnhatn added a commit that referenced this issue Oct 13, 2020

Check docs limit before indexing on primary (#63273)

584fe70

Today indexing to a shard with 2147483519 documents will fail that shard. We should check the number of documents and reject the write requests instead. Closes #51136

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exceeding the maximum doc count of a shard fails the shard #51136

Exceeding the maximum doc count of a shard fails the shard #51136

DaveCTurner commented Jan 17, 2020

elasticmachine commented Jan 17, 2020

ywelsch commented Jan 24, 2020

JalehD commented Aug 13, 2020 •

edited

Loading

dnhatn commented Oct 5, 2020

Exceeding the maximum doc count of a shard fails the shard #51136

Exceeding the maximum doc count of a shard fails the shard #51136

Comments

DaveCTurner commented Jan 17, 2020

elasticmachine commented Jan 17, 2020

ywelsch commented Jan 24, 2020

JalehD commented Aug 13, 2020 • edited Loading

dnhatn commented Oct 5, 2020

JalehD commented Aug 13, 2020 •

edited

Loading