Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race hangs while starting with many indices #320

Closed
kphelps opened this issue Sep 13, 2017 · 8 comments
Closed

Race hangs while starting with many indices #320

kphelps opened this issue Sep 13, 2017 · 8 comments
Labels
enhancement Improves the status quo :Load Driver Changes that affect the core of the load driver such as scheduling, the measurement approach etc. :Track Management New operations, changes in the track format, track download changes and the like

Comments

@kphelps
Copy link

kphelps commented Sep 13, 2017

Rally version (get with esrally --version):

esrally 0.7.1

Invoked command:

esrally race --track=small-indices --distribution-version=2.3.3

Configuration file (located in ~/.rally/rally.ini)):

[meta]
config.version = 10

[system]
env.name = local

[node]
root.dir = /Users/user/.rally/benchmarks

[source]
local.src.dir = /Users/user/.rally/benchmarks/src
remote.repo.url = https://github.com/elastic/elasticsearch.git

[build]
gradle.bin = /usr/local/bin/gradle

[runtime]
java8.home = /Library/Java/JavaVirtualMachines/jdk1.8.0_121.jdk/Contents/Home

[benchmarks]
local.dataset.cache = ${node:root.dir}/data

[reporting]
datastore.type = in-memory
datastore.host =
datastore.port =
datastore.secure =
datastore.user =
datastore.password =

[tracks]
default.url = <my-tracks-repo>

[teams]
default.url = <my-teams-repo>

[defaults]
preserve_benchmark_candidate = False

[distributions]
release.1.url = https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-{{VERSION}}.tar.gz
release.2.url = https://download.elasticsearch.org/elasticsearch/release/org/elasticsearch/distribution/tar/elasticsearch/{{VERSION}}/elasticsearch-{{VERSION}}.tar.gz
release.url = https://artifacts.elastic.co/downloads/elasticsearch/elasticsearch-{{VERSION}}.tar.gz
release.cache = true

JVM version:

java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)

OS version:

macOS Sierra 10.12.5

Description of the problem including expected versus actual behavior:

Attempting to start a race on a track with hundreds of indices never starts. If I reduce the number of indices, the race runs as expected. In the hanging case, the Driver never seems to receive the StartBenchmark message.

Steps to reproduce:

  1. Create a track with hundreds of indices.
  2. Attempt to start a race on the track.

Provide logs (if relevant):

...
2017-09-13 16:32:36,21 PID:88444 rally.metrics INFO Creating in-memory metrics store
2017-09-13 16:32:36,21 PID:88444 rally.metrics INFO Opening metrics store for invocation=[20170913T163217Z], track=[small-orgs], challenge=[basic], car=[defaults]
2017-09-13 16:32:36,21 PID:88444 rally.metrics INFO Creating file race store
2017-09-13 16:32:36,21 PID:88444 rally.racecontrol INFO Asking mechanic to start the engine.
2017-09-13 16:32:36,25 PID:88446 rally.mechanic INFO Received signal from race control to start engine.
2017-09-13 16:32:36,27 PID:88446 rally.metrics INFO Opening metrics store for invocation=[20170913T163217Z], track=[small-orgs], challenge=[basic], car=[defaults]
2017-09-13 16:32:36,27 PID:88446 rally.mechanic INFO Cluster consisting of [{'host': '127.0.0.1', 'port': 39200}] will be provisioned by Rally.
2017-09-13 16:32:36,31 PID:88442 rally.actor INFO Capabilities [{'coordinator': True, 'ip': '127.0.0.1', 'Convention Address.IPv4': '127.0.0.1:1900', 'Thespian ActorSystem Name': 'multiprocTCPBase', 'Thespian ActorSystem Version': 2, 'Thespian Watch Supported': True, 'Python Version': (3, 6, 2, 'final', 0), 'Thespian Generation': (3, 7), 'Thespian Version': '1505320337862'}] match requirements [{'coordinator': True}].
2017-09-13 16:32:36,31 PID:88442 rally.actor INFO Capabilities [{'coordinator': True, 'ip': '127.0.0.1', 'Convention Address.IPv4': '127.0.0.1:1900', 'Thespian ActorSystem Name': 'multiprocTCPBase', 'Thespian ActorSystem Version': 2, 'Thespian Watch Supported': True, 'Python Version': (3, 6, 2, 'final', 0), 'Thespian Generation': (3, 7), 'Thespian Version': '1505320337862'}] match requirements [{'coordinator': True}].
2017-09-13 16:32:36,50 PID:88641 rally.mechanic INFO Starting node(s) [0] on [127.0.0.1].
2017-09-13 16:32:36,54 PID:88641 rally.metrics INFO Opening metrics store for invocation=[20170913T163217Z], track=[small-orgs], challenge=[basic], car=[defaults]
2017-09-13 16:32:36,272 PID:88641 rally.repo INFO Checking out [2] in [/Users/kphelps/.rally/benchmarks/teams/default] for distribution version [2.3.3].
2017-09-13 16:32:36,291 PID:88641 rally.repo INFO Rebasing on [2] in [/Users/kphelps/.rally/benchmarks/teams/default] for distribution version [2.3.3].
2017-09-13 16:32:36,521 PID:88641 rally.supplier INFO Resolved download URL [https://download.elasticsearch.org/elasticsearch/release/org/elasticsearch/distribution/tar/elasticsearch/2.3.3/elasticsearch-2.3.3.tar.gz] for version [2.3.3]
2017-09-13 16:32:36,521 PID:88641 rally.supplier INFO Skipping download for version [2.3.3]. Found an existing binary locally at [/Users/kphelps/.rally/benchmarks/distributions/elasticsearch-2.3.3.tar.gz].
2017-09-13 16:32:36,522 PID:88641 rally.provisioner INFO Preparing candidate locally in [/Users/kphelps/.rally/benchmarks/races/2017-09-13-16-32-17/rally-node-0/install].
2017-09-13 16:32:36,523 PID:88641 rally.provisioner INFO Unzipping /Users/kphelps/.rally/benchmarks/distributions/elasticsearch-2.3.3.tar.gz to /Users/kphelps/.rally/benchmarks/races/2017-09-13-16-32-17/rally-node-0/install
2017-09-13 16:32:36,850 PID:88641 rally.provisioner INFO Deleting pre-bundled Elasticsearch configuration at [/Users/kphelps/.rally/benchmarks/races/2017-09-13-16-32-17/rally-node-0/install/elasticsearch-2.3.3/config]
2017-09-13 16:32:36,852 PID:88641 rally.provisioner INFO Reading config template file [/Users/kphelps/.rally/benchmarks/teams/default/cars/vanilla/config/elasticsearch.yml] and writing to [/Users/kphelps/.rally/benchmarks/races/2017-09-13-16-32-17/rally-node-0/install/elasticsearch-2.3.3/config/elasticsearch.yml].
2017-09-13 16:32:36,862 PID:88641 rally.provisioner INFO Reading config template file [/Users/kphelps/.rally/benchmarks/teams/default/cars/vanilla/config/logging.yml] and writing to [/Users/kphelps/.rally/benchmarks/races/2017-09-13-16-32-17/rally-node-0/install/elasticsearch-2.3.3/config/logging.yml].
2017-09-13 16:32:36,864 PID:88641 rally.process INFO Killing all processes which match [java], [elasticsearch] and [/Users/kphelps/.rally/benchmarks/races]
2017-09-13 16:32:36,890 PID:88641 rally.process INFO Skipping myself (PID [88641]).
2017-09-13 16:32:37,17 PID:88641 rally.launcher INFO Detected Java major version [8].
2017-09-13 16:32:37,17 PID:88641 rally.launcher INFO Starting node [rally-node-0] based on car [defaults].
2017-09-13 16:32:37,122 PID:88641 rally.process INFO b'java version "1.8.0_121"\n'
2017-09-13 16:32:37,123 PID:88641 rally.process INFO b'Java(TM) SE Runtime Environment (build 1.8.0_121-b13)\n'
2017-09-13 16:32:37,123 PID:88641 rally.process INFO b'Java HotSpot(TM) 64-Bit Server VM (build 25.121-b13, mixed mode)\n'
2017-09-13 16:32:37,131 PID:88641 rally.launcher INFO JVM supports [-XX:+ExitOnOutOfMemoryError]. Setting this option to detect out of memory errors during the benchmark.
2017-09-13 16:32:37,131 PID:88641 rally.launcher INFO env for [rally-node-0]: <omitted>
2017-09-13 16:32:37,881 PID:88641 rally.launcher INFO rally-node-0: [2017-09-13 12:32:37,880][INFO ][node                     ] [rally-node-0] version[2.3.3], pid[88719], build[218bdf1/2016-05-17T15:40:04Z]
2017-09-13 16:32:37,882 PID:88641 rally.launcher INFO rally-node-0: [2017-09-13 12:32:37,881][INFO ][node                     ] [rally-node-0] initializing ...
2017-09-13 16:32:38,717 PID:88641 rally.launcher INFO rally-node-0: [2017-09-13 12:32:38,717][INFO ][plugins                  ] [rally-node-0] modules [reindex, lang-expression, lang-groovy], plugins [], sites []
2017-09-13 16:32:38,742 PID:88641 rally.launcher INFO rally-node-0: [2017-09-13 12:32:38,742][INFO ][env                      ] [rally-node-0] using [1] data paths, mounts [[/ (/dev/disk0s2)]], net usable_space [48gb], net total_space [233.3gb], spins? [unknown], types [hfs]
2017-09-13 16:32:38,742 PID:88641 rally.launcher INFO rally-node-0: [2017-09-13 12:32:38,742][INFO ][env                      ] [rally-node-0] heap size [989.8mb], compressed ordinary object pointers [true]
2017-09-13 16:32:38,743 PID:88641 rally.launcher INFO rally-node-0: [2017-09-13 12:32:38,743][WARN ][env                      ] [rally-node-0] max file descriptors [10240] for elasticsearch process likely too low, consider increasing to at least [65536]
2017-09-13 16:32:40,750 PID:88641 rally.launcher INFO rally-node-0: [2017-09-13 12:32:40,750][INFO ][node                     ] [rally-node-0] initialized
2017-09-13 16:32:40,752 PID:88641 rally.launcher INFO rally-node-0: [2017-09-13 12:32:40,750][INFO ][node                     ] [rally-node-0] starting ...
2017-09-13 16:32:40,823 PID:88641 rally.launcher INFO rally-node-0: [2017-09-13 12:32:40,823][INFO ][transport                ] [rally-node-0] publish_address {127.0.0.1:39300}, bound_addresses {127.0.0.1:39300}
2017-09-13 16:32:40,827 PID:88641 rally.launcher INFO rally-node-0: [2017-09-13 12:32:40,827][INFO ][discovery                ] [rally-node-0] rally-benchmark/99DwO-p0QSGtwgefc6lZdg
2017-09-13 16:32:43,855 PID:88641 rally.launcher INFO rally-node-0: [2017-09-13 12:32:43,854][INFO ][cluster.service          ] [rally-node-0] new_master {rally-node-0}{99DwO-p0QSGtwgefc6lZdg}{127.0.0.1}{127.0.0.1:39300}{max_local_storage_nodes=1}, reason: zen-disco-join(elected_as_master, [0] joins received)
2017-09-13 16:32:43,868 PID:88641 rally.launcher INFO rally-node-0: [2017-09-13 12:32:43,868][INFO ][http                     ] [rally-node-0] publish_address {127.0.0.1:39200}, bound_addresses {127.0.0.1:39200}
2017-09-13 16:32:43,868 PID:88641 rally.launcher INFO rally-node-0: [2017-09-13 12:32:43,868][INFO ][node                     ] [rally-node-0] started
2017-09-13 16:32:43,869 PID:88641 rally.launcher INFO [rally-node-0] has successfully started.
2017-09-13 16:32:43,869 PID:88641 rally.launcher INFO Started node [rally-node-0] with PID [88719]
2017-09-13 16:32:43,869 PID:88641 rally.launcher INFO Node [rally-node-0] has successfully started. Attaching telemetry devices.
2017-09-13 16:32:43,900 PID:88641 rally.launcher INFO rally-node-0: [2017-09-13 12:32:43,900][INFO ][gateway                  ] [rally-node-0] recovered [0] indices into cluster_state
2017-09-13 16:32:44,36 PID:88641 rally.launcher INFO Telemetry devices are now attached to node [rally-node-0].
2017-09-13 16:32:44,38 PID:88446 rally.actor INFO [1] of [1] child actors have responded for transition from [starting] to [nodes_started].
2017-09-13 16:32:44,39 PID:88446 rally.actor INFO All [1] child actors have responded. Transitioning now from [starting] to [nodes_started].
2017-09-13 16:32:44,39 PID:88446 rally.client INFO Creating ES client connected to [{'host': '127.0.0.1', 'port': 39200}] with options [{'timeout': 60000, 'request_timeout': 60000}]
2017-09-13 16:32:44,41 PID:88446 rally.launcher INFO All cluster nodes have successfully started. Checking if REST API is available.
2017-09-13 16:32:44,153 PID:88446 rally.launcher INFO REST API is available. Attaching telemetry devices to cluster.
2017-09-13 16:32:44,211 PID:88446 rally.launcher INFO Telemetry devices are now attached to the cluster.
2017-09-13 16:32:44,211 PID:88446 rally.actor INFO Transitioning from [nodes_started] to [apply_meta_info].
2017-09-13 16:32:44,213 PID:88446 rally.actor INFO [1] of [1] child actors have responded for transition from [apply_meta_info] to [cluster_started].
2017-09-13 16:32:44,213 PID:88446 rally.actor INFO All [1] child actors have responded. Transitioning now from [apply_meta_info] to [cluster_started].
2017-09-13 16:32:44,214 PID:88444 rally.racecontrol INFO Mechanic has started engine successfully.
2017-09-13 16:32:44,214 PID:88444 rally.racecontrol INFO Starting lap [1/1]
2017-09-13 16:32:44,214 PID:88444 rally.racecontrol INFO Telling mechanic of benchmark start.
2017-09-13 16:32:44,215 PID:88444 rally.racecontrol INFO Telling driver to start benchmark.
2017-09-13 16:32:44,217 PID:88442 rally.actor INFO Capabilities [{'coordinator': True, 'ip': '127.0.0.1', 'Convention Address.IPv4': '127.0.0.1:1900', 'Thespian ActorSystem Name': 'multiprocTCPBase', 'Thespian ActorSystem Version': 2, 'Thespian Watch Supported': True, 'Python Version': (3, 6, 2, 'final', 0), 'Thespian Generation': (3, 7), 'Thespian Version': '1505320337862'}] match requirements [{'coordinator': True}].
2017-09-13 16:32:44,218 PID:88442 rally.actor INFO Capabilities [{'coordinator': True, 'ip': '127.0.0.1', 'Convention Address.IPv4': '127.0.0.1:1900', 'Thespian ActorSystem Name': 'multiprocTCPBase', 'Thespian ActorSystem Version': 2, 'Thespian Watch Supported': True, 'Python Version': (3, 6, 2, 'final', 0), 'Thespian Generation': (3, 7), 'Thespian Version': '1505320337862'}] match requirements [{'coordinator': True}].
2017-09-13 16:32:44,218 PID:88446 rally.telemetry INFO Gathering indices stats.
2017-09-13 16:32:44,227 PID:88446 rally.telemetry WARNING Could not determine value at path [merges,total_time_in_millis]. Returning default value [0]
2017-09-13 16:32:44,227 PID:88446 rally.telemetry WARNING Could not determine value at path [merges,total_throttled_time_in_millis]. Returning default value [0]
2017-09-13 16:32:44,227 PID:88446 rally.telemetry WARNING Could not determine value at path [indexing,index_time_in_millis]. Returning default value [0]
2017-09-13 16:32:44,227 PID:88446 rally.telemetry WARNING Could not determine value at path [refresh,total_time_in_millis]. Returning default value [0]
2017-09-13 16:32:44,228 PID:88446 rally.telemetry WARNING Could not determine value at path [flush,total_time_in_millis]. Returning default value [0]
2017-09-13 16:32:44,228 PID:88446 rally.actor INFO Transitioning from [cluster_started] to [benchmark_starting].
2017-09-13 16:32:44,230 PID:88641 rally.telemetry WARNING Process I/O counters are unsupported on this platform. Falling back to less accurate disk I/O counters.
2017-09-13 16:32:44,231 PID:88446 rally.actor INFO [1] of [1] child actors have responded for transition from [benchmark_starting] to [benchmark_started].
2017-09-13 16:32:44,232 PID:88446 rally.actor INFO All [1] child actors have responded. Transitioning now from [benchmark_starting] to [benchmark_started].
2017-09-13 16:32:44,235 PID:88446 rally.telemetry INFO Gathering indices stats.
2017-09-13 16:32:44,239 PID:88446 rally.telemetry WARNING Could not determine value at path [merges,total_time_in_millis]. Returning default value [0]
2017-09-13 16:32:44,239 PID:88446 rally.telemetry WARNING Could not determine value at path [merges,total_throttled_time_in_millis]. Returning default value [0]
2017-09-13 16:32:44,239 PID:88446 rally.telemetry WARNING Could not determine value at path [indexing,index_time_in_millis]. Returning default value [0]
2017-09-13 16:32:44,239 PID:88446 rally.telemetry WARNING Could not determine value at path [refresh,total_time_in_millis]. Returning default value [0]
2017-09-13 16:32:44,239 PID:88446 rally.telemetry WARNING Could not determine value at path [flush,total_time_in_millis]. Returning default value [0]
2017-09-13 16:32:52,221 PID:88444 rally.racecontrol INFO BenchmarkActor received unknown message [<esrally.mechanic.mechanic.BenchmarkStarted object at 0x1047842b0>] (ignoring).
@danielmitterdorfer
Copy link
Member

danielmitterdorfer commented Sep 14, 2017

I created the following track to reproduce the issue:

{% set index_count = 800 %}
{
  "short-description": "Demo benchmark",
  "description": "Demo benchmark with {{index_count}} indices.",
  "indices": [
    {% set comma = joiner() %}
    {% for item in range(index_count) %}
    {{ comma() }}
    {
      "name": "demo-{{item}}",
      "types": [
        {
          "name": "docs",
          "mapping": "mappings.json",
          "documents": "documents.json.bz2",
          "includes-action-and-meta-data": true,
          "document-count": 1,
          "compressed-bytes": 274,
          "uncompressed-bytes": 389
        }
      ]
    }
    {% endfor %}
  ],
  "operations": [
    {
      "name": "index-append",
      "operation-type": "index",
      "bulk-size": 1
    }
  ],
  "challenges": [
    {
      "name": "just-index",
      "description": "Just indexes a single document",
      "default": true,
      "index-settings": {
        "index.number_of_replicas": 0
      },
      "schedule": [
        {
          "operation": "index-append",
          "warmup-time-period": 0,
          "clients": 1
        }
      ]
    }
  ]
}

It creates 800 indices (which takes a while) but I managed to make it work on Mac OS X 10.11.6 (El Capitan). Initially I tried to run it with 1000 indices but Elasticsearch ran out of file descriptors:

2017-09-14 06:22:46,145 PID:2203 rally.driver INFO create index [demo-936] with body [{'settings': {'index.number_of_replicas': 0}, 'mappings': {'docs': {'dynamic': 'strict', '_all': {'enabled': False}, 'properties': {'elevation': {'type': 'integer'}, 'name': {'type': 'text', 'fields': {'raw': {'type': 'keyword'}}}, 'geonameid': {'type': 'long'}, 'feature_class': {'type': 'text', 'fields': {'raw': {'type': 'keyword'}}}, 'location': {'type': 'geo_point'}, 'cc2': {'type': 'text', 'fields': {'raw': {'type': 'keyword'}}}, 'timezone': {'type': 'text', 'fields': {'raw': {'type': 'keyword'}}}, 'dem': {'type': 'text', 'fields': {'raw': {'type': 'keyword'}}}, 'country_code': {'type': 'text', 'fielddata': True, 'fields': {'raw': {'type': 'keyword'}}}, 'admin1_code': {'type': 'text', 'fields': {'raw': {'type': 'keyword'}}}, 'admin2_code': {'type': 'text', 'fields': {'raw': {'type': 'keyword'}}}, 'admin3_code': {'type': 'text', 'fields': {'raw': {'type': 'keyword'}}}, 'admin4_code': {'type': 'text', 'fields': {'raw': {'type': 'keyword'}}}, 'feature_code': {'type': 'text', 'fields': {'raw': {'type': 'keyword'}}}, 'alternatenames': {'type': 'text', 'fields': {'raw': {'type': 'keyword'}}}, 'asciiname': {'type': 'text', 'fields': {'raw': {'type': 'keyword'}}}, 'population': {'type': 'long'}}}}}]
2017-09-14 06:22:46,149 PID:2114 rally.launcher INFO rally-node-0: [2017-09-14T08:22:46,148][INFO ][o.e.c.m.MetaDataCreateIndexService] [rally-node-0] [demo-936] creating index, cause [api], templates [], shards [5]/[0], mappings [docs]
2017-09-14 06:22:46,388 PID:2114 rally.launcher INFO rally-node-0: [2017-09-14T08:22:46,383][WARN ][o.e.i.c.IndicesClusterStateService] [rally-node-0] [[demo-936][2]] marking and sending shard failed due to [failed recovery]
2017-09-14 06:22:46,388 PID:2114 rally.launcher INFO rally-node-0: org.elasticsearch.indices.recovery.RecoveryFailedException: [demo-936][2]: Recovery failed on {rally-node-0}{GV6OZ-O0SWuy8SHOW5Fcuw}{esHPHfBeTbemqV87zAhv0A}{127.0.0.1}{127.0.0.1:39300}
2017-09-14 06:22:46,388 PID:2114 rally.launcher INFO rally-node-0: 	at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$1(IndexShard.java:1488) [elasticsearch-5.6.0.jar:5.6.0]
2017-09-14 06:22:46,389 PID:2114 rally.launcher INFO rally-node-0: 	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-5.6.0.jar:5.6.0]
2017-09-14 06:22:46,389 PID:2114 rally.launcher INFO rally-node-0: 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1161) [?:?]
2017-09-14 06:22:46,389 PID:2114 rally.launcher INFO rally-node-0: 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
2017-09-14 06:22:46,389 PID:2114 rally.launcher INFO rally-node-0: 	at java.lang.Thread.run(Thread.java:844) [?:?]
2017-09-14 06:22:46,389 PID:2114 rally.launcher INFO rally-node-0: Caused by: org.elasticsearch.index.shard.IndexShardRecoveryException: failed recovery
2017-09-14 06:22:46,389 PID:2114 rally.launcher INFO rally-node-0: 	at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:299) ~[elasticsearch-5.6.0.jar:5.6.0]
2017-09-14 06:22:46,389 PID:2114 rally.launcher INFO rally-node-0: 	at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:88) ~[elasticsearch-5.6.0.jar:5.6.0]
2017-09-14 06:22:46,390 PID:2114 rally.launcher INFO rally-node-0: 	at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1236) ~[elasticsearch-5.6.0.jar:5.6.0]
2017-09-14 06:22:46,390 PID:2114 rally.launcher INFO rally-node-0: 	at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$1(IndexShard.java:1484) ~[elasticsearch-5.6.0.jar:5.6.0]
2017-09-14 06:22:46,390 PID:2114 rally.launcher INFO rally-node-0: 	... 4 more
2017-09-14 06:22:46,390 PID:2114 rally.launcher INFO rally-node-0: Caused by: org.elasticsearch.index.translog.TranslogException: failed to create new translog file
2017-09-14 06:22:46,390 PID:2114 rally.launcher INFO rally-node-0: 	at org.elasticsearch.index.translog.Translog.createWriter(Translog.java:381) ~[elasticsearch-5.6.0.jar:5.6.0]
2017-09-14 06:22:46,390 PID:2114 rally.launcher INFO rally-node-0: 	at org.elasticsearch.index.translog.Translog.<init>(Translog.java:203) ~[elasticsearch-5.6.0.jar:5.6.0]
2017-09-14 06:22:46,390 PID:2114 rally.launcher INFO rally-node-0: 	at org.elasticsearch.index.engine.InternalEngine.openTranslog(InternalEngine.java:272) ~[elasticsearch-5.6.0.jar:5.6.0]
2017-09-14 06:22:46,390 PID:2114 rally.launcher INFO rally-node-0: 	at org.elasticsearch.index.engine.InternalEngine.<init>(InternalEngine.java:160) ~[elasticsearch-5.6.0.jar:5.6.0]
2017-09-14 06:22:46,390 PID:2114 rally.launcher INFO rally-node-0: 	at org.elasticsearch.index.engine.InternalEngineFactory.newReadWriteEngine(InternalEngineFactory.java:25) ~[elasticsearch-5.6.0.jar:5.6.0]
2017-09-14 06:22:46,391 PID:2114 rally.launcher INFO rally-node-0: 	at org.elasticsearch.index.shard.IndexShard.newEngine(IndexShard.java:1602) ~[elasticsearch-5.6.0.jar:5.6.0]
2017-09-14 06:22:46,391 PID:2114 rally.launcher INFO rally-node-0: 	at org.elasticsearch.index.shard.IndexShard.createNewEngine(IndexShard.java:1584) ~[elasticsearch-5.6.0.jar:5.6.0]
2017-09-14 06:22:46,391 PID:2114 rally.launcher INFO rally-node-0: 	at org.elasticsearch.index.shard.IndexShard.internalPerformTranslogRecovery(IndexShard.java:1027) ~[elasticsearch-5.6.0.jar:5.6.0]
2017-09-14 06:22:46,391 PID:2114 rally.launcher INFO rally-node-0: 	at org.elasticsearch.index.shard.IndexShard.performTranslogRecovery(IndexShard.java:987) ~[elasticsearch-5.6.0.jar:5.6.0]
2017-09-14 06:22:46,391 PID:2114 rally.launcher INFO rally-node-0: 	at org.elasticsearch.index.shard.StoreRecovery.internalRecoverFromStore(StoreRecovery.java:360) ~[elasticsearch-5.6.0.jar:5.6.0]
2017-09-14 06:22:46,391 PID:2114 rally.launcher INFO rally-node-0: 	at org.elasticsearch.index.shard.StoreRecovery.lambda$recoverFromStore$0(StoreRecovery.java:90) ~[elasticsearch-5.6.0.jar:5.6.0]
2017-09-14 06:22:46,391 PID:2114 rally.launcher INFO rally-node-0: 	at org.elasticsearch.index.shard.StoreRecovery.executeRecovery(StoreRecovery.java:257) ~[elasticsearch-5.6.0.jar:5.6.0]
2017-09-14 06:22:46,391 PID:2114 rally.launcher INFO rally-node-0: 	at org.elasticsearch.index.shard.StoreRecovery.recoverFromStore(StoreRecovery.java:88) ~[elasticsearch-5.6.0.jar:5.6.0]
2017-09-14 06:22:46,391 PID:2114 rally.launcher INFO rally-node-0: 	at org.elasticsearch.index.shard.IndexShard.recoverFromStore(IndexShard.java:1236) ~[elasticsearch-5.6.0.jar:5.6.0]
2017-09-14 06:22:46,391 PID:2114 rally.launcher INFO rally-node-0: 	at org.elasticsearch.index.shard.IndexShard.lambda$startRecovery$1(IndexShard.java:1484) ~[elasticsearch-5.6.0.jar:5.6.0]
2017-09-14 06:22:46,392 PID:2114 rally.launcher INFO rally-node-0: 	... 4 more
2017-09-14 06:22:46,392 PID:2114 rally.launcher INFO rally-node-0: Caused by: java.nio.file.FileSystemException: /Users/daniel/.rally/benchmarks/races/2017-09-14-06-14-55/rally-node-0/install/elasticsearch-5.6.0/data/nodes/0/indices/wDnyVMWSRtaPj_WnBiaDdw/2/translog/translog.ckp: Too many open files in system

(this one is on 5.6.0 but I also tried (and succeeded) on 2.3.3). So I wonder whether you're running out of file descriptors and that's why it hangs?

Can you please check the output of ulimit -n (max number of open files)? You can increase it with ulimit -n NEW_VALUE. On 10.11., Apple's system integrity protection wanted me to keep that value rather low but I managed to increase it at least to 10240 with ulimit -n 10240 (note that the limit is only applied to the current shell session). It would be great if you could raise the limit and retry with esrally race --track=small-indices --distribution-version=2.3.3 --logging=console (the last parameter is not perfect and is mainly intended for development but it allows you to watch without tailing log files in this simple case).

@danielmitterdorfer danielmitterdorfer added the feedback needed An open question blocks progress label Sep 14, 2017
@kphelps
Copy link
Author

kphelps commented Sep 14, 2017

I never ran into the file descriptor issue you described, I don't think rally ever started creating indices in ES. My ulimit is set to 7168.

I was able to track this down a bit more, though. I think the problem is actually related to my mapping files. Each of my indices has a different mapping file and many of them are fairly large (megabytes). Changing all of my indices to use the same small mapping file allowed me to run the race.

@danielmitterdorfer
Copy link
Member

I don't think rally ever started creating indices

with --logging=console you should see something along the lines of:

2017-09-14 06:22:46,145 PID:2203 rally.driver INFO create index [demo-936] with body [{'settings': {'index.number_of_replicas': 0}, 'mappings': {'docs': {'dynamic': 'strict', '_all': {'enabled': False}, 'properties': {'elevation': {'type': 'integer'}, 'name': {'type': 'text', 'fields': {'raw': {'type': 'keyword'}}}, 'geonameid': {'type': 'long'}, 'feature_class': {'type': 'text', 'fields': {'raw': {'type': 'keyword'}}}, 'location': {'type': 'geo_point'}, 'cc2': {'type': 'text', 'fields': {'raw': {'type': 'keyword'}}}, 'timezone': {'type': 'text', 'fields': {'raw': {'type': 'keyword'}}}, 'dem': {'type': 'text', 'fields': {'raw': {'type': 'keyword'}}}, 'country_code': {'type': 'text', 'fielddata': True, 'fields': {'raw': {'type': 'keyword'}}}, 'admin1_code': {'type': 'text', 'fields': {'raw': {'type': 'keyword'}}}, 'admin2_code': {'type': 'text', 'fields': {'raw': {'type': 'keyword'}}}, 'admin3_code': {'type': 'text', 'fields': {'raw': {'type': 'keyword'}}}, 'admin4_code': {'type': 'text', 'fields': {'raw': {'type': 'keyword'}}}, 'feature_code': {'type': 'text', 'fields': {'raw': {'type': 'keyword'}}}, 'alternatenames': {'type': 'text', 'fields': {'raw': {'type': 'keyword'}}}, 'asciiname': {'type': 'text', 'fields': {'raw': {'type': 'keyword'}}}, 'population': {'type': 'long'}}}}}]

for every index that is created. As you have huge mapping files we should probably define an upper limit for the log output in this case.

My ulimit is set to 7168

That should be sufficient for creating a few hundred indices (I managed to create more than 800 with 10240).

Each of my indices has a different mapping file and many of them are fairly large (megabytes).

Ok. For my reproduction I just reused the mapping file for the geonames track. A multi-megabyte mapping file might indeed be problematic.

I think this could be caused by several issues:

  1. Rally logs the whole mapping file once again.
  2. You're maybe hitting some default limit in the Elasticsearch Python client that Rally is using under the hood.

Can you please tell me the size of your largest mapping file and also whether Rally has actually logged something like the log statement that I've shown above? Then I'll continue to reproduce. However, setting up that many indices (with large mappings) will always be slow. Maybe we can apply some smarts and show a progress indication in case you are creating more indices than some threshold.

@danielmitterdorfer
Copy link
Member

I've created #321 to avoid logging multi-megabyte mappings.

@kphelps
Copy link
Author

kphelps commented Sep 14, 2017

My largest mapping is 5MB. I don't ever see the create index log message. Totally get it would be slow, a progress indicator sounds very helpful!

@danielmitterdorfer
Copy link
Member

danielmitterdorfer commented Sep 15, 2017

I'm gonna have a look what's going on (or rather not going on ;)).

@danielmitterdorfer danielmitterdorfer removed the feedback needed An open question blocks progress label Sep 15, 2017
@danielmitterdorfer danielmitterdorfer self-assigned this Sep 15, 2017
@danielmitterdorfer
Copy link
Member

I could reproduce it and I know what's going on: Rally loads the mapping file when loading the track. This happens (always) on the coordinator and the intention is that the track should be self-contained so it is easier to send it between Rally nodes (e.g. when you distribute the load driver). The problem is that Rally serializes the whole track internally to send it to the load driver and this happens even when you are on a local machine because the load driver is a separate (sub)process.

I think we can revert to lazy-loading for the mapping file because it should be sufficient to do all the index setup on the coordinating node. As this also requires changes to the node placement of specific actors this is a bit more effort than just moving the code to read the mapping file. Anyway, I attempt to tackle it but it will probably take a bit of time to be ready.

@danielmitterdorfer danielmitterdorfer added :Load Driver Changes that affect the core of the load driver such as scheduling, the measurement approach etc. :Track Management New operations, changes in the track format, track download changes and the like enhancement Improves the status quo labels Sep 19, 2017
@danielmitterdorfer danielmitterdorfer added this to the Pre 1.0 milestone Sep 19, 2017
@danielmitterdorfer
Copy link
Member

I wanted to keep this ticket open to think whether we can make this lazy-loading work. While we could implement it, it would prevent or complicate other use-cases, like the definition of tracks in simple files (as opposed to a fully-fledged track repository).

I'd really like to make both cases work but for now we need to decide whether we want to go either into one or the other direction. Therefore, I am closing this ticket without adding lazy-loading for now. However, should the internal infrastructure change, we can revisit and maybe add this later.

@danielmitterdorfer danielmitterdorfer removed their assignment Mar 16, 2018
@danielmitterdorfer danielmitterdorfer removed this from the Pre 1.0 milestone Mar 16, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Improves the status quo :Load Driver Changes that affect the core of the load driver such as scheduling, the measurement approach etc. :Track Management New operations, changes in the track format, track download changes and the like
Projects
None yet
Development

No branches or pull requests

2 participants