Elasticsearch 6.2.2 nodes crash after reaching ulimit setting #29097

redlus · 2018-03-15T19:31:15Z

Hi,

We're experiencing a critical production issue in elasticsearch 6.2.2 related to open_file_descriptors. The cluster is an exact replica (as much as possible) of a 5.2.2 cluster, and documents are indexed into both clusters in parallel.
While the indexing performance of the new cluster seems to be at least as good as the 5.2.2 cluster, the new nodes open_file_descriptors is reaching record-breaking levels (especially when compared to v5.2.2).

Elasticsearch version
6.2.2

Plugins installed:
ingest-attachment
ingest-geoip
mapper-murmur3
mapper-size
repository-azure
repository-gcs
repository-s3

JVM version:
openjdk version "1.8.0_151"
OpenJDK Runtime Environment (build 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12)
OpenJDK 64-Bit Server VM (build 25.151-b12, mixed mode)

OS version:
Linux 4.13.0-1011-azure #14-Ubuntu SMP 2018 x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

All machines have ulimit of 65536, as recommended by the official documentation.
All nodes on the v5.2.2 cluster have up to 4,500 open_file_descripors, while the new v6.2.2 nodes are divided: some have up to 4,500 open_file_descriptors, while others consistently open more and more file descriptors until reaching the limit and crashing with java.nio.file.FileSystemException: Too many open files -

[WARN ][o.e.c.a.s.ShardStateAction] [prod-elasticsearch-master-002] [newlogs_20180315-01][0] received shard failed for shard id [[newlogs_20180315-01][0]], allocation id [G8NGOPNHRNuqNKYKzfiPcg], primary term [0], message [shard failure, reason [already closed by tragic event on the translog]], failure [FileSystemException[/mnt/nodes/0/indices/nIgarkzwRwe0DmT-nmLhvg/0/translog/translog.ckp: Too many open files]]
java.nio.file.FileSystemException: /mnt/nodes/0/indices/nIgarkzwRwe0DmT-nmLhvg/0/translog/translog.ckp: Too many open files
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91) ~[?:?]
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) ~[?:?]
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) ~[?:?]
at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:177) ~[?:?]
at java.nio.channels.FileChannel.open(FileChannel.java:287) ~[?:1.8.0_151]
at java.nio.channels.FileChannel.open(FileChannel.java:335) ~[?:1.8.0_151]
at org.elasticsearch.index.translog.Checkpoint.write(Checkpoint.java:179) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.translog.TranslogWriter.writeCheckpoint(TranslogWriter.java:419) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.translog.TranslogWriter.syncUpTo(TranslogWriter.java:362) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.translog.Translog.ensureSynced(Translog.java:699) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.translog.Translog.ensureSynced(Translog.java:724) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.shard.IndexShard$3.write(IndexShard.java:2336) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.common.util.concurrent.AsyncIOProcessor.processList(AsyncIOProcessor.java:107) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.common.util.concurrent.AsyncIOProcessor.drainAndProcess(AsyncIOProcessor.java:99) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.common.util.concurrent.AsyncIOProcessor.put(AsyncIOProcessor.java:82) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.shard.IndexShard.sync(IndexShard.java:2358) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.TransportWriteAction$AsyncAfterWriteAction.run(TransportWriteAction.java:362) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.TransportWriteAction$WritePrimaryResult.(TransportWriteAction.java:160) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.bulk.TransportShardBulkAction.performOnPrimary(TransportShardBulkAction.java:133) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:110) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnPrimary(TransportShardBulkAction.java:72) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform(TransportReplicationAction.java:1034) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryShardReference.perform(TransportReplicationAction.java:1012) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.ReplicationOperation.execute(ReplicationOperation.java:103) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.onResponse(TransportReplicationAction.java:359) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.onResponse(TransportReplicationAction.java:299) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$1.onResponse(TransportReplicationAction.java:975) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$1.onResponse(TransportReplicationAction.java:972) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:238) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.shard.IndexShard.acquirePrimaryOperationPermit(IndexShard.java:2220) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction.acquirePrimaryShardReference(TransportReplicationAction.java:984) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction.access$500(TransportReplicationAction.java:98) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncPrimaryAction.doRun(TransportReplicationAction.java:320) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:295) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$PrimaryOperationTransportHandler.messageReceived(TransportReplicationAction.java:282) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.transport.TransportService$7.doRun(TransportService.java:656) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.2.2.jar:6.2.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_151]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_151]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]

After this exception, some of the nodes throw many exceptions and reduce the number of open file descriptors. Other times they just crash. The issue repeats itself with interleaving nodes.

Steps to reproduce:
None yet. This is a production-scale issue, so it is not easy to reproduce.

I'd be happy to provide additional details, whatever is needed.

Thanks!

jasontedor · 2018-03-16T01:23:16Z

Would you please take the PID of an Elasticsearch process and as the number of open file descriptors grows capture the output of ls -al /proc/<PID>/fd and send the output to me privately (first name at elastic.co)? You might want to repeat this a few times as it grows and grows so we can diff and easily see what it is growing on.

When you say the node is crashing, do you have logs for this?

elasticmachine · 2018-03-16T01:24:11Z

Pinging @elastic/es-core-infra

redlus · 2018-03-16T02:26:04Z

Hi Jason,

I've sent you several files with the output of ls -al /proc/<PID>/fd on timestamps along 17 minutes, during which the open file descriptors increased from ~99k to ~105k.

Nodes reaching the ulimit actually crash the process. Here are the last few logs of a node which crashed:

[2018-03-14T03:27:47,930][WARN ][o.e.i.e.Engine ] [prod-elasticsearch-data-028] [newlogs_20180314-01][1] failed to read latest segment infos on flush
java.nio.file.FileSystemException: /mnt/nodes/0/indices/-mwGJUcDRXKRtezeXPaZSA/1/index: Too many open files
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91) ~[?:?]
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) ~[?:?]
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) ~[?:?]
at sun.nio.fs.UnixFileSystemProvider.newDirectoryStream(UnixFileSystemProvider.java:427) ~[?:?]
at java.nio.file.Files.newDirectoryStream(Files.java:457) ~[?:1.8.0_151]
at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:215) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
at org.apache.lucene.store.FSDirectory.listAll(FSDirectory.java:234) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
at org.apache.lucene.store.FilterDirectory.listAll(FilterDirectory.java:57) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:655) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:627) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
at org.apache.lucene.index.SegmentInfos.readLatestCommit(SegmentInfos.java:434) ~[lucene-core-7.2.1.jar:7.2.1 b2b6438b37073bee1fca40374e85bf91aa457c0b - ubuntu - 2018-01-10 00:48:43]
at org.elasticsearch.common.lucene.Lucene.readSegmentInfos(Lucene.java:123) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.store.Store.readSegmentsInfo(Store.java:202) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.store.Store.readLastCommittedSegmentsInfo(Store.java:187) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.engine.InternalEngine.refreshLastCommittedSegmentInfos(InternalEngine.java:1599) [elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.engine.InternalEngine.flush(InternalEngine.java:1570) [elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.shard.IndexShard.flush(IndexShard.java:1030) [elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.shard.IndexShard$4.doRun(IndexShard.java:2403) [elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) [elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.2.2.jar:6.2.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_151]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_151]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
[2018-03-14T03:27:48,040][WARN ][o.e.i.s.IndexShard ] [prod-elasticsearch-data-028] [newlogs_20180314-01][1] failed to flush index
org.elasticsearch.index.engine.FlushFailedEngineException: Flush failed
at org.elasticsearch.index.engine.InternalEngine.flush(InternalEngine.java:1568) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.shard.IndexShard.flush(IndexShard.java:1030) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.shard.IndexShard$4.doRun(IndexShard.java:2403) [elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) [elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-6.2.2.jar:6.2.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_151]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_151]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
Caused by: java.nio.file.FileSystemException: /mnt/nodes/0/indices/-mwGJUcDRXKRtezeXPaZSA/1/translog/translog-24471.ckp: Too many open files
at sun.nio.fs.UnixException.translateToIOException(UnixException.java:91) ~[?:?]
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:102) ~[?:?]
at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:107) ~[?:?]
at sun.nio.fs.UnixCopyFile.copyFile(UnixCopyFile.java:243) ~[?:?]
at sun.nio.fs.UnixCopyFile.copy(UnixCopyFile.java:581) ~[?:?]
at sun.nio.fs.UnixFileSystemProvider.copy(UnixFileSystemProvider.java:253) ~[?:?]
at java.nio.file.Files.copy(Files.java:1274) ~[?:1.8.0_151]
at org.elasticsearch.index.translog.Translog.rollGeneration(Translog.java:1539) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.engine.InternalEngine.flush(InternalEngine.java:1560) ~[elasticsearch-6.2.2.jar:6.2.2]
... 7 more
[2018-03-14T03:27:48,398][ERROR][o.e.b.ElasticsearchUncaughtExceptionHandler] [prod-elasticsearch-data-028] fatal error in thread [elasticsearch[prod-elasticsearch-data-028][bulk][T#2]], exiting
java.lang.AssertionError: Unexpected AlreadyClosedException
at org.elasticsearch.index.engine.InternalEngine.failOnTragicEvent(InternalEngine.java:1779) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.engine.InternalEngine.maybeFailEngine(InternalEngine.java:1796) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:904) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.shard.IndexShard.index(IndexShard.java:738) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.shard.IndexShard.applyIndexOperation(IndexShard.java:707) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.shard.IndexShard.applyIndexOperationOnReplica(IndexShard.java:680) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.bulk.TransportShardBulkAction.performOpOnReplica(TransportShardBulkAction.java:518) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.bulk.TransportShardBulkAction.performOnReplica(TransportShardBulkAction.java:480) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:466) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.bulk.TransportShardBulkAction.shardOperationOnReplica(TransportShardBulkAction.java:72) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.onResponse(TransportReplicationAction.java:567) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.onResponse(TransportReplicationAction.java:530) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.shard.IndexShard$2.onResponse(IndexShard.java:2314) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.shard.IndexShard$2.onResponse(IndexShard.java:2292) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.shard.IndexShardOperationPermits.acquire(IndexShardOperationPermits.java:238) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.shard.IndexShard.acquireReplicaOperationPermit(IndexShard.java:2291) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$AsyncReplicaAction.doRun(TransportReplicationAction.java:641) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:513) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.action.support.replication.TransportReplicationAction$ReplicaOperationTransportHandler.messageReceived(TransportReplicationAction.java:493) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:66) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.transport.TcpTransport$RequestHandler.doRun(TcpTransport.java:1555) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:672) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.2.2.jar:6.2.2]
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) ~[?:1.8.0_151]
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) ~[?:1.8.0_151]
at java.lang.Thread.run(Thread.java:748) [?:1.8.0_151]
Caused by: org.apache.lucene.store.AlreadyClosedException: translog is already closed
at org.elasticsearch.index.translog.Translog.ensureOpen(Translog.java:1663) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.translog.Translog.add(Translog.java:504) ~[elasticsearch-6.2.2.jar:6.2.2]
at org.elasticsearch.index.engine.InternalEngine.index(InternalEngine.java:886) ~[elasticsearch-6.2.2.jar:6.2.2]
... 24 more

Could this be related to #25005? https://github.com/elastic/elasticsearch/pull/25005

farin99 · 2018-03-16T08:52:15Z

from running: sudo lsof -p <es process>
90% of the file descriptors are from translog-xxxxx.tlog files. Seems like elastic never deletes these files

bleskes · 2018-03-16T10:16:53Z

@farin99 can you describe how many active shards you have on each nodes - i.e., how many shards receive index operations in a 12 hours period ?

ArielCoralogix · 2018-03-16T10:23:47Z

Hey @bleskes
Answering for @farin99 here - we have a few hundreds of active shards per node.

amnons77 · 2018-03-16T11:48:52Z

Hey @bleskes,

We have about 2600 shards that received an index operations in a 24H period.
We tried to change the "index.translog.retention.age" to 30m for the indexes who holds that shards and it didn't reduced the amount of "open_file_descriptors".

amnons77 · 2018-03-16T12:36:19Z

@bleskes
Sorry, my previous comment described the number of open shards on this entire cluster. We have about 100 open shards per node per 12 hours.
Btw, this same setup works in parallel now (data fan out) on version 5.2 with no issues and about 5000 open file descriptors while the 6.2 cluster already exceeded 150,000 for some of the nodes.

bleskes · 2018-03-16T14:36:40Z

@amnons77 ok, if trimming the retention policy doesn't help we need figure out what's going on. Let's takes this to the forums to sort out. Can you post a message, including the output of nodes stats on http://discuss.elastic.co and link it here? once we know what it is we can, if needed, come back to github

ArielCoralogix · 2018-03-16T15:34:53Z

Hey @bleskes I am referring to the ticket opened by our colleague @redlus on the Discuss forum: https://discuss.elastic.co/t/elasticsearch-6-2-2-nodes-crash-after-reaching-ulimit-setting/124172
Also, Lior ran 'GET /_stats?include_segment_file_sizes&level=shards' and send the result files from 3 different times to @nhat which is also active on that thread.

jasontedor · 2018-03-17T01:58:06Z

The translog for four of your shards (from three distinct indices) is behaving in an unanticipated way. Could you please share the index settings for:
DN_QfOruR5SMVRavPDnRfw
rRrAxVZzSye4W_-exBocCA
r-cR4xWkSFinkKg_RqiDsA

(from /{index}/_settings). I do not know the name of the indices that these correspond to, you will have to backtrace.

And can you also share the shard stats for these indices (from /{index}/_stats?level=shards&include_segment_file_sizes=true). You can email them to the same email address that you used before.

In elastic#28350, we fixed an endless flushing loop which can happen on replicas by tightening the relation between the flush action and the periodically flush condition. 1. The periodically flush condition is enabled only if it will be disabled after a flush. 2. If the periodically flush condition is true then a flush will actually happen regardless of Lucene state. (1) and (2) guarantee a flushing loop will be terminated. Sadly, the condition elastic#1 can be violated in edge cases as we used two different algorithms to evaluate the current and future uncommitted size. - We use method `uncommittedSizeInBytes` to calculate current uncommitted size. It is the sum of translogs whose generation at least the minGen (determined by a given seqno). We pick a continuous range of translogs since the minGen to evaluate the current uncommitted size. - We use method `sizeOfGensAboveSeqNoInBytes` to calculate the future uncommitted size. It is the sum of translogs whose maxSeqNo at least the given seqNo. Here we don't pick a range but select translog one by one. Suppose we have 3 translogs gen1={elastic#1,elastic#2}, gen2={}, gen3={elastic#3} and seqno=elastic#1, uncommittedSizeInBytes is the sum of gen1, gen2, and gen3 while sizeOfGensAboveSeqNoInBytes is sum of gen1 and gen3. Gen2 is excluded because its maxSeqno is still -1. This commit ensures sizeOfGensAboveSeqNoInBytes use the same algorithm from uncommittedSizeInBytes Closes elastic#29097

dnhatn · 2018-03-17T14:35:50Z

I opened #29125 to propose a fix for this.

dnhatn · 2018-03-17T14:58:16Z

The problem is that one replica got into an endless loop of flushing.

elasticmachine · 2018-03-17T14:58:32Z

Pinging @elastic/es-distributed

redlus · 2018-03-22T13:56:03Z

Hey @dnhatn @bleskes and @jasontedor,

We can confirm this affects v6.0.2 as well:

At 09:22 the node enters a state where the open file descriptors monotonically increase.
This is accompanied by higher load average, cpu, total indices actions and indices operation time.

In #28350, we fixed an endless flushing loop which may happen on replicas by tightening the relation between the flush action and the periodically flush condition. 1. The periodically flush condition is enabled only if it is disabled after a flush. 2. If the periodically flush condition is enabled then a flush will actually happen regardless of Lucene state. (1) and (2) guarantee that a flushing loop will be terminated. Sadly, the condition 1 can be violated in edge cases as we used two different algorithms to evaluate the current and future uncommitted translog size. - We use method `uncommittedSizeInBytes` to calculate current uncommitted size. It is the sum of translogs whose generation at least the minGen (determined by a given seqno). We pick a continuous range of translogs since the minGen to evaluate the current uncommitted size. - We use method `sizeOfGensAboveSeqNoInBytes` to calculate the future uncommitted size. It is the sum of translogs whose maxSeqNo at least the given seqNo. Here we don't pick a range but select translog one by one. Suppose we have 3 translogs `gen1={#1,#2}, gen2={}, gen3={#3} and seqno=#1`, `uncommittedSizeInBytes` is the sum of gen1, gen2, and gen3 while `sizeOfGensAboveSeqNoInBytes` is the sum of gen1 and gen3. Gen2 is excluded because its maxSeqno is still -1. This commit removes both `sizeOfGensAboveSeqNoInBytes` and `uncommittedSizeInBytes` methods, then enforces an engine to use only `sizeInBytesByMinGen` method to evaluate the periodically flush condition. Closes #29097 Relates ##28350

redlus · 2018-04-09T09:25:13Z

Commit #29125 changes the mechanism controlling the flush operation, and although it has been committed to the master, the latest elasticsearch version does not include it. We've therefore built v6.2.4-SNAPSHOT from the master and have been running it for the last two weeks in production-scale. The issue seems to have been solved, and everything works as expected:

Thank you for the quick response and handling of the issue!

bleskes · 2018-04-09T09:29:09Z

@redlus thanks for testing and letting us know. Happy to hear it worked.

jasontedor added the :Core/Infra/Resiliency Keep running when everything is ok. Die quickly if things go horribly wrong. label Mar 16, 2018

ArielCoralogix mentioned this issue Mar 16, 2018

Introducing a translog deletion policy #24950

Merged

dnhatn mentioned this issue Mar 17, 2018

Harden periodically check to avoid endless flush loop #29125

Merged

dnhatn added :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. and removed :Core/Infra/Resiliency Keep running when everything is ok. Die quickly if things go horribly wrong. labels Mar 17, 2018

dnhatn closed this as completed in #29125 Mar 22, 2018

danopia mentioned this issue Apr 12, 2018

Index translogs can get stuck after node failure, doesn't get smaller when flushed anymore #29488

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Elasticsearch 6.2.2 nodes crash after reaching ulimit setting #29097

Elasticsearch 6.2.2 nodes crash after reaching ulimit setting #29097

redlus commented Mar 15, 2018

jasontedor commented Mar 16, 2018

elasticmachine commented Mar 16, 2018

redlus commented Mar 16, 2018

farin99 commented Mar 16, 2018

bleskes commented Mar 16, 2018

ArielCoralogix commented Mar 16, 2018

amnons77 commented Mar 16, 2018

amnons77 commented Mar 16, 2018

bleskes commented Mar 16, 2018

ArielCoralogix commented Mar 16, 2018

jasontedor commented Mar 17, 2018

dnhatn commented Mar 17, 2018

dnhatn commented Mar 17, 2018

elasticmachine commented Mar 17, 2018

redlus commented Mar 22, 2018

redlus commented Apr 9, 2018

bleskes commented Apr 9, 2018

Elasticsearch 6.2.2 nodes crash after reaching ulimit setting #29097

Elasticsearch 6.2.2 nodes crash after reaching ulimit setting #29097

Comments

redlus commented Mar 15, 2018

jasontedor commented Mar 16, 2018

elasticmachine commented Mar 16, 2018

redlus commented Mar 16, 2018

farin99 commented Mar 16, 2018

bleskes commented Mar 16, 2018

ArielCoralogix commented Mar 16, 2018

amnons77 commented Mar 16, 2018

amnons77 commented Mar 16, 2018

bleskes commented Mar 16, 2018

ArielCoralogix commented Mar 16, 2018

jasontedor commented Mar 17, 2018

dnhatn commented Mar 17, 2018

dnhatn commented Mar 17, 2018

elasticmachine commented Mar 17, 2018

redlus commented Mar 22, 2018

redlus commented Apr 9, 2018

bleskes commented Apr 9, 2018