NullPointerException when getting list of snapshots on S3 #25878

eirc · 2017-07-25T10:27:08Z

Elasticsearch version: 5.5.0 (issue first appeared while on 5.4.1)

Plugins installed: [x-pack, repository-s3]

JVM version (java -version): openjdk version "1.8.0_131"
OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-1~bpo8+1-b11)
OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)

OS version (uname -a if on a Unix-like system): Linux ip-10-127-1-159 3.16.0-4-amd64 #1 SMP Debian 3.16.39-1 (2016-12-30) x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

Trying to get the list of available snapshots on an S3 backed repository fails with NullPointerException.

curl elasticsearch:9200/_snapshot/long_term/_all
{"error":{"root_cause":[{"type":"remote_transport_exception","reason":"[SVVyQPF][10.127.1.203:9300][cluster:admin/snapshot/get]"}],"type":"null_pointer_exception","reason":null},"status":500}

Elasticsearch logs:

[2017-07-25T12:01:47,038][WARN ][r.suppressed             ] path: /_snapshot/long_term/_all, params: {repository=long_term, snapshot=_all}
org.elasticsearch.transport.RemoteTransportException: [SVVyQPF][10.127.1.203:9300][cluster:admin/snapshot/get]
Caused by: java.lang.NullPointerException

I use curator to take the backups and after grabbing backups successfully it fails when it tries to delete old snapshots because that's when it requires a list too:

2017-07-25 11:53:02,191 ERROR     Failed to complete action: delete_snapshots.  <class 'curator.exceptions.FailedExecution'>: Unable to get snapshot information from repository: long_term.  Error: TransportError(500, 'null_pointer_exception', '[SVVyQPF][10.127.1.203:9300][cluster:admin/snapshot/get]')

I have a feeling this is due to some kind of timeout. I turned on debug logging and while I couldn't find a more specific reason this fails I noticed it made ~ 2K requests to S3 until it failed and it stopped at 90 seconds. Is this a configurable timeout?

In the past getting a list of snapshots took increasingly long but it eventually responded. Now it breaks earlier than that.

Also posted on the forums: https://discuss.elastic.co/t/nullpointerexception-when-getting-list-of-snapshots-on-s3/94458

The text was updated successfully, but these errors were encountered:

colings86 · 2017-07-25T10:30:58Z

Could you paste the full stack trace from the Elasticsearch server logs?

eirc · 2017-07-25T11:09:18Z

There's no more logs for the null pointer entry. There's a ton of logs for the headers and each of the 2K requests do you want me to post those? All of those responded with 200 OK though.

eirc · 2017-07-25T11:21:47Z

These should be the logs from the last request before the null pointer. I tried to sensor out any possibly sensitive info. Maybe the returned payload was what triggered the issue?

[2017-07-25T12:27:45,437][DEBUG][o.a.h.c.p.RequestAddCookies] CookieSpec selected: default
[2017-07-25T12:27:45,437][DEBUG][o.a.h.c.p.RequestAuthCache] Auth cache not set in the context
[2017-07-25T12:27:45,437][DEBUG][o.a.h.c.p.RequestProxyAuthentication] Proxy auth state: UNCHALLENGED
[2017-07-25T12:27:45,437][DEBUG][c.a.h.i.c.SdkHttpClient  ] Attempt 1 to execute request
[2017-07-25T12:27:45,437][DEBUG][o.a.h.i.c.DefaultClientConnection] Sending request: GET /long_term/snap-GRrT8CKjS7qdq42NZf3T2A.dat HTTP/1.1
[2017-07-25T12:27:45,437][DEBUG][o.a.h.wire               ]  >> "GET /long_term/snap-GRrT8CKjS7qdq42NZf3T2A.dat HTTP/1.1[\r][\n]"
[2017-07-25T12:27:45,437][DEBUG][o.a.h.wire               ]  >> "Host: elastic-stack-backupsbucket-*****************.s3-eu-west-1.amazonaws.com[\r][\n]"
[2017-07-25T12:27:45,437][DEBUG][o.a.h.wire               ]  >> "x-amz-content-sha256: *********************[\r][\n]"
[2017-07-25T12:27:45,437][DEBUG][o.a.h.wire               ]  >> "Authorization: AWS4-HMAC-SHA256 Credential=****************/20170725/eu-west-1/s3/aws4_request, SignedHeaders=amz-sdk-invocation-id;amz-sdk-retry;content-type;host;user-agent;x-amz-content-sha256;x-amz-d
ate;x-amz-security-token, Signature=***************************[\r][\n]"
[2017-07-25T12:27:45,437][DEBUG][o.a.h.wire               ]  >> "X-Amz-Date: 20170725T092745Z[\r][\n]"
[2017-07-25T12:27:45,437][DEBUG][o.a.h.wire               ]  >> "User-Agent: aws-sdk-java/1.10.69 Linux/3.16.0-4-amd64 OpenJDK_64-Bit_Server_VM/25.131-b11/1.8.0_131[\r][\n]"
[2017-07-25T12:27:45,437][DEBUG][o.a.h.wire               ]  >> "X-Amz-Security-Token: **********************************[\r][\n]"
[2017-07-25T12:27:45,437][DEBUG][o.a.h.wire               ]  >> "amz-sdk-invocation-id: 23f8b7a2-93bb-46f4-a492-cf692051dc43[\r][\n]"
[2017-07-25T12:27:45,437][DEBUG][o.a.h.wire               ]  >> "amz-sdk-retry: 0/0/[\r][\n]"
[2017-07-25T12:27:45,437][DEBUG][o.a.h.wire               ]  >> "Content-Type: application/octet-stream[\r][\n]"
[2017-07-25T12:27:45,437][DEBUG][o.a.h.wire               ]  >> "Connection: Keep-Alive[\r][\n]"
[2017-07-25T12:27:45,437][DEBUG][o.a.h.wire               ]  >> "[\r][\n]"
[2017-07-25T12:27:45,437][DEBUG][o.a.h.headers            ] >> GET /long_term/snap-GRrT8CKjS7qdq42NZf3T2A.dat HTTP/1.1
[2017-07-25T12:27:45,437][DEBUG][o.a.h.headers            ] >> Host: elastic-stack-backupsbucket-*****************.s3-eu-west-1.amazonaws.com
[2017-07-25T12:27:45,437][DEBUG][o.a.h.headers            ] >> x-amz-content-sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
[2017-07-25T12:27:45,437][DEBUG][o.a.h.headers            ] >> Authorization: AWS4-HMAC-SHA256 Credential=****************/20170725/eu-west-1/s3/aws4_request, SignedHeaders=amz-sdk-invocation-id;amz-sdk-retry;content-type;host;user-agent;x-amz-content-sha256;x-amz-dat
e;x-amz-security-token, Signature=***************************
[2017-07-25T12:27:45,437][DEBUG][o.a.h.headers            ] >> X-Amz-Date: 20170725T092745Z
[2017-07-25T12:27:45,437][DEBUG][o.a.h.headers            ] >> User-Agent: aws-sdk-java/1.10.69 Linux/3.16.0-4-amd64 OpenJDK_64-Bit_Server_VM/25.131-b11/1.8.0_131
[2017-07-25T12:27:45,437][DEBUG][o.a.h.headers            ] >> X-Amz-Security-Token: **********************************
[2017-07-25T12:27:45,437][DEBUG][o.a.h.headers            ] >> amz-sdk-invocation-id: 23f8b7a2-93bb-46f4-a492-cf692051dc43
[2017-07-25T12:27:45,437][DEBUG][o.a.h.headers            ] >> amz-sdk-retry: 0/0/
[2017-07-25T12:27:45,437][DEBUG][o.a.h.headers            ] >> Content-Type: application/octet-stream
[2017-07-25T12:27:45,437][DEBUG][o.a.h.headers            ] >> Connection: Keep-Alive
[2017-07-25T12:27:45,509][DEBUG][o.a.h.wire               ]  << "HTTP/1.1 200 OK[\r][\n]"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.wire               ]  << "x-amz-id-2: ************************[\r][\n]"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.wire               ]  << "x-amz-request-id: 3E117E943CA08991[\r][\n]"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.wire               ]  << "Date: Tue, 25 Jul 2017 09:27:46 GMT[\r][\n]"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.wire               ]  << "Last-Modified: Wed, 19 Jul 2017 01:07:25 GMT[\r][\n]"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.wire               ]  << "ETag: "8e87c087b7474433ba26057f74233e5a"[\r][\n]"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.wire               ]  << "Accept-Ranges: bytes[\r][\n]"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.wire               ]  << "Content-Type: application/octet-stream[\r][\n]"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.wire               ]  << "Content-Length: 302[\r][\n]"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.wire               ]  << "Server: AmazonS3[\r][\n]"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.wire               ]  << "[\r][\n]"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.i.c.DefaultClientConnection] Receiving response: HTTP/1.1 200 OK
[2017-07-25T12:27:45,509][DEBUG][o.a.h.headers            ] << HTTP/1.1 200 OK
[2017-07-25T12:27:45,509][DEBUG][o.a.h.headers            ] << x-amz-id-2: *************************************
[2017-07-25T12:27:45,509][DEBUG][o.a.h.headers            ] << x-amz-request-id: 3E117E943CA08991
[2017-07-25T12:27:45,509][DEBUG][o.a.h.headers            ] << Date: Tue, 25 Jul 2017 09:27:46 GMT
[2017-07-25T12:27:45,509][DEBUG][o.a.h.headers            ] << Last-Modified: Wed, 19 Jul 2017 01:07:25 GMT
[2017-07-25T12:27:45,509][DEBUG][o.a.h.headers            ] << ETag: "8e87c087b7474433ba26057f74233e5a"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.headers            ] << Accept-Ranges: bytes
[2017-07-25T12:27:45,509][DEBUG][o.a.h.headers            ] << Content-Type: application/octet-stream
[2017-07-25T12:27:45,509][DEBUG][o.a.h.headers            ] << Content-Length: 302
[2017-07-25T12:27:45,509][DEBUG][o.a.h.headers            ] << Server: AmazonS3
[2017-07-25T12:27:45,509][DEBUG][c.a.h.i.c.SdkHttpClient  ] Connection can be kept alive for 60000 MILLISECONDS
[2017-07-25T12:27:45,509][DEBUG][c.a.requestId            ] x-amzn-RequestId: not available
[2017-07-25T12:27:45,509][DEBUG][c.a.request              ] Received successful response: 200, AWS Request ID: 3E117E943CA08991
[2017-07-25T12:27:45,509][DEBUG][c.a.requestId            ] AWS Request ID: 3E117E943CA08991
[2017-07-25T12:27:45,509][DEBUG][o.a.h.wire               ]  << "?[0xd7]l[0x17][0x8]snapshot[0x0][0x0][0x0][0x1]DFL[0x0]l[0x92][0xcd]J[0x3]1[0x14][0x85][0xe9][0xc2][0x85][0xe2]SH]t[0xd1]![0xc9]Lg[0xda][0xee]tP[0x17]B[0x17][0xa6][0xed]B[0x90]!4[0x19][0x9a][0xd2]fln[0xd2][0x95]++[0xe]E[0x90]y[0xdc][0xfe]l[0x1c]DQ[0xe8][0x85]lr[0xf8][0xce][0xb9][0xf7][0x90][0xf4][g'[0xfb][0x12][0x8c]x[0x86]i[0xe1][0xf6]k#[0x16][0xea]IHY[0x18]h[0xff][0xaa]mFhB[0x12][0xda]#[0x94][0xc4][0x9d]h[0xed][0xbd][0x96][0xa3][0xbb][0x7];[0xec][0xa6][0xf7]3[0x9e],[0xe5]2b[0x83][0xc7]<[0x1c][0xb2][0xab][0xcd]JY[0xd0][0x85][0xc9][0xb4]l[0x9e][0xe][0xae]?[0xdf][0xb5][0x91]z[0xa2]`[0xcb]^?2W[0xf4];- [0xf5][0x89][0x10][0x91]v02[0xc1]H[0x8a][0x91]=[0x8c]D[0xed]1f?[0x9e][0x1e][0x7][0xec]83[0xe]B[0x82][0xd9][0x19]&v[0xb1][0x95][0xd0][0xee][0x18]I[0xb0][0x9a][0x14][0x9b]NCl:Z[0x13]#)[0xdb][0xbd][0x81][0x13]N[0xdd][0xf2]Q[0x9a][0xde]p[0xbe][0xa9]o[0xd6]eN/[0xd4]e#[0x18][0x9f]_[0xbc][0xbc][0x96][0xca][0xc8]?[0xa1]e[0xaa][0xf]W81[0xcf]`*[0xac][0x84]f[0xa3][0xaa][0xc0]O[0xea][0xe7][0x86][0xdc][0xff][0x13][0xcb]\[0xe8][0xb9][0xb7][0xf5]/[0xd8][0x1d][0xe]_[0x0][0x0][0x0][0xff][0xff][0x3][0x0][0xc0]([0x93][0xe8][0x0][0x0][0x0][0x0][0x0][0x0][0x0][0x0][0xf4][0x1f]J[0xbe]"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.i.c.PoolingClientConnectionManager] Connection [id: 4949][route: {s}->https://elastic-stack-backupsbucket-*************.s3-eu-west-1.amazonaws.com:443] can be kept alive for 60000 MILLISECONDS
[2017-07-25T12:27:45,509][DEBUG][o.a.h.i.c.PoolingClientConnectionManager] Connection released: [id: 4949][route: {s}->https://elastic-stack-backupsbucket-******************.s3-eu-west-1.amazonaws.com:443][total kept alive: 1; route allocated: 1 of 50; total allocated: 1 of 50]

eirc · 2017-07-25T14:06:20Z

FYI we use a coordinating node and 3 data nodes. I do the snapshot requests to the coordinating node, and all the S3 requests seem to originate from the data node that's currently the master (10.127.1.203).

Some more logs:

I see ~ 1k of these logs 15 sec after start of the request and ~ 500 at the end:

[2017-07-25T12:27:46,968][DEBUG][o.e.s.SearchService      ] [SVVyQPF] freeing search context [1977515], time [225057509], lastAccessTime [224978476], keepAlive [30000]

These pop up between requests:

[2017-07-25T12:27:45,374][DEBUG][o.a.h.i.c.PoolingClientConnectionManager] Connection released: [id: 4949][route: {s}->https://elastic-stack-backupsbucket-**********.s3-eu-west-1.amazonaws.com:443][total kept alive: 1; route allocated: 1 of 50; total allocated: 1 of 50]
[2017-07-25T12:27:45,374][DEBUG][o.a.h.i.c.PoolingClientConnectionManager] Connection [id: 4949][route: {s}->https://elastic-stack-backupsbucket-**********.s3-eu-west-1.amazonaws.com:443] can be kept alive for 60000 MILLISECONDS

These are the things logged on the master node around the time the coordinating node logged the exception (excluding the freeing search context logs mentioned above):

[2017-07-25T12:27:45,509][DEBUG][o.a.h.i.c.DefaultClientConnection] Receiving response: HTTP/1.1 200 OK
[2017-07-25T12:27:45,509][DEBUG][o.a.h.i.c.PoolingClientConnectionManager] Connection [id: 4949][route: {s}->https://elastic-stack-backupsbucket-***********.s3-eu-west-1.amazonaws.com:443] can be kept alive for 60000 MILLISECONDS
[2017-07-25T12:27:45,509][DEBUG][c.a.requestId            ] x-amzn-RequestId: not available
[2017-07-25T12:27:45,541][DEBUG][o.e.m.j.JvmGcMonitorService] [SVVyQPF] [gc][221514] overhead, spent [106ms] collecting in the last [1s]
[2017-07-25T12:27:47,497][DEBUG][o.e.x.m.a.GetDatafeedsStatsAction$TransportAction] [SVVyQPF] Get stats for datafeed '_all'
[2017-07-25T12:27:47,652][DEBUG][o.e.x.m.e.l.LocalExporter] monitoring index templates and pipelines are installed on master node, service can start
[2017-07-25T12:27:48,542][DEBUG][o.e.m.j.JvmGcMonitorService] [SVVyQPF] [gc][221517] overhead, spent [111ms] collecting in the last [1s]

imotov · 2017-07-26T12:39:20Z

Hmm, I don't see any smoking gun here. I am not really sure how to move forward with this without knowing where this NPE occurs or being able to reproduce this issue locally.

eirc · 2017-07-26T12:44:44Z

Ok as I understand it there should have been a stack trace after the "caused by" line right? Maybe we can look into why that's not present and then we'll have more info for the specific issue? Also there's that r.suppressed thing. That would at least point the to class in which the NPE occurred but that's not available either. Can I configure something to make that visible?

ywelsch · 2017-07-26T12:58:09Z

@eirc you said that

These should be the logs from the last request before the null pointer

but the timestamp from these logs are 12:27 whereas the NPE has a timestamp of 12:01.
Can you provide the full logs from both the master node and the coordinating node? (You can share them in private with us if you don't want to post them publicly)

imotov · 2017-07-26T13:06:03Z

@eirc, @ywelsch and I discussed this more and we have a couple of other things we would like you to try:

could you execute curl elasticsearch:9200/_snapshot/long_term/_all?error_trace=true and see if the stack trace shows up there
could you execute curl localhost:9200/_snapshot/long_term/_all on the current master node. And if it works, but still fails when you execute it against a coordinating node we would really appreciate this output as well.

eirc · 2017-07-26T13:43:49Z

Regarding the time discrepancies, the NPE happens every time I request a listing. At 12:27 I had debug logging on so that's why most of the logs are from that time. At 12:01 was probably one of the first tests. The same NPE log appeared at 12:27 and every time I did a listing request.

eirc · 2017-07-26T13:50:13Z

Ok now there's some light at the end of the tunnel!

First if I get the listing from the master node it actually works! By requesting on the coordinating (or any other) node it fails with that same behaviour. Adding error_trace=true to the request yields some useful info finally:

{
    "error": {
        "root_cause": [{
            "type": "remote_transport_exception",
            "reason": "[SVVyQPF][10.127.1.203:9300][cluster:admin/snapshot/get]",
            "stack_trace": "[[SVVyQPF][10.127.1.203:9300][cluster:admin/snapshot/get]]; nested: RemoteTransportException[[SVVyQPF][10.127.1.203:9300][cluster:admin/snapshot/get]]; nested: NullPointerException;\n\tat org.elasticsearch.ElasticsearchException.guessRootCauses(ElasticsearchException.java:618)\n\tat org.elasticsearch.ElasticsearchException.generateFailureXContent(ElasticsearchException.java:563)\n\tat org.elasticsearch.rest.BytesRestResponse.build(BytesRestResponse.java:138)\n\tat org.elasticsearch.rest.BytesRestResponse.<init>(BytesRestResponse.java:96)\n\tat org.elasticsearch.rest.BytesRestResponse.<init>(BytesRestResponse.java:91)\n\tat org.elasticsearch.rest.action.RestActionListener.onFailure(RestActionListener.java:58)\n\tat org.elasticsearch.action.support.TransportAction$1.onFailure(TransportAction.java:94)\n\tat org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$3.handleException(TransportMasterNodeAction.java:185)\n\tat org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1067)\n\tat org.elasticsearch.transport.TcpTransport.lambda$handleException$16(TcpTransport.java:1467)\n\tat org.elasticsearch.common.util.concurrent.EsExecutors$1.execute(EsExecutors.java:110)\n\tat org.elasticsearch.transport.TcpTransport.handleException(TcpTransport.java:1465)\n\tat org.elasticsearch.transport.TcpTransport.handlerResponseError(TcpTransport.java:1457)\n\tat org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1401)\n\tat org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:74)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)\n\tat io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)\n\tat io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)\n\tat io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297)\n\tat io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413)\n\tat io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)\n\tat io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)\n\tat io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)\n\tat io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)\n\tat io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498)\n\tat io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)\n\tat io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)\n\tat java.lang.Thread.run(Thread.java:748)\nCaused by: RemoteTransportException[[SVVyQPF][10.127.1.203:9300][cluster:admin/snapshot/get]]; nested: NullPointerException;\nCaused by: java.lang.NullPointerException\n"
        }],
        "type": "null_pointer_exception",
        "reason": null,
        "stack_trace": "java.lang.NullPointerException\n"
    },
    "status": 500
}

Here's the formatted stack trace for your convenience:

[[SVVyQPF][10.127.1.203:9300][cluster:admin/snapshot/get]]; nested: RemoteTransportException[[SVVyQPF][10.127.1.203:9300][cluster:admin/snapshot/get]]; nested: NullPointerException;
        at org.elasticsearch.ElasticsearchException.guessRootCauses(ElasticsearchException.java:618)
        at org.elasticsearch.ElasticsearchException.generateFailureXContent(ElasticsearchException.java:563)
        at org.elasticsearch.rest.BytesRestResponse.build(BytesRestResponse.java:138)
        at org.elasticsearch.rest.BytesRestResponse.<init>(BytesRestResponse.java:96)
        at org.elasticsearch.rest.BytesRestResponse.<init>(BytesRestResponse.java:91)
        at org.elasticsearch.rest.action.RestActionListener.onFailure(RestActionListener.java:58)
        at org.elasticsearch.action.support.TransportAction.onFailure(TransportAction.java:94)
        at org.elasticsearch.action.support.master.TransportMasterNodeAction.handleException(TransportMasterNodeAction.java:185)
        at org.elasticsearch.transport.TransportService.handleException(TransportService.java:1067)
        at org.elasticsearch.transport.TcpTransport.lambda(TcpTransport.java:1467)
        at org.elasticsearch.common.util.concurrent.EsExecutors.execute(EsExecutors.java:110)
        at org.elasticsearch.transport.TcpTransport.handleException(TcpTransport.java:1465)
        at org.elasticsearch.transport.TcpTransport.handlerResponseError(TcpTransport.java:1457)
        at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1401)
        at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:74)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297)
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.channel.DefaultChannelPipeline.channelRead(DefaultChannelPipeline.java:1334)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
        at io.netty.channel.nio.AbstractNioByteChannel.read(AbstractNioByteChannel.java:134)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
        at io.netty.util.concurrent.SingleThreadEventExecutor.run(SingleThreadEventExecutor.java:858)
        at java.lang.Thread.run(Thread.java:748)
Caused by: RemoteTransportException[[SVVyQPF][10.127.1.203:9300][cluster:admin/snapshot/get]]; nested: NullPointerException;
Caused by: java.lang.NullPointerException

imotov · 2017-07-26T14:07:20Z

@eirc any chance you can email me the output that you get from master? My mail is igor at elastic.co. If not, could you try getting one snapshot at a time on the coordinating node and checking what's different between the snapshots that can be retrieved and the snapshots that cause this NPE?

By the way, does the coordinating node have a different es version?

eirc · 2017-07-26T14:13:11Z

Just confirmed all elasticsearches are on 5.5.0. Can I check the version of plugins someway? When I upgraded the stack I remember I had to remove and reinstall plugins to be of proper versions.

I'll make a script to pull each snapshot individually and see which one(s) are breaking now.

imotov · 2017-07-26T14:32:44Z

In 5.5.0 all plugins should be 5.5.0. Otherwise, elasticsearch wouldn't work. In any case, based on what we know so far, I don't think it's a plugin-related issue. Our current theory is that snapshot info serialization code breaks on one or more snapshots that you have in your repository. However, we just reviewed this code and couldn't find any obvious issues. That's why we would like to figure out which snapshot information master is trying to send to the coordinating node in order to reproduce and fix the problem.

eirc · 2017-07-26T14:59:14Z

I emailed you the full snapshot list. My script ~~managed to successfully grab each snapshot individually from the coordinating node~~ (where grabbing them all failed). I noticed some of the snapshots have some shard failures but that shouldn't be an issue right? Maybe it's the size of the response that's the issue here? I got ~2k snapshots and the response is 1.2 MB.

eirc · 2017-07-26T15:04:41Z

No scratch that, there is a single snapshot which produces the NPE when I get it on it's own.

eirc · 2017-07-26T15:13:34Z

Here is the JSON I can get from the master but not from other nodes:

{
  "snapshots": [
    {
      "snapshot": "wsj-snapshot-20170720085856",
      "uuid": "yIbELYjgQN-_BgjRd4Vb0A",
      "version_id": 5040199,
      "version": "5.4.1",
      "indices": [
        "wsj-2017.07.19",
        "wsj-iis-2017.07.11",
        "wsj-2017.07.08",
        "wsj-2017.07.15",
        "wsj-2017.07.11",
        "wsj-2017.07.12",
        "wsj-2017.07.02",
        "wsj-2017.07.10",
        "wsj-2017.07.06",
        "wsj-2017.06.30",
        "wsj-2017.07.05",
        "wsj-2017.07.14",
        "wsj-2017.07.03",
        "wsj-2017.07.16",
        "wsj-2017.07.17",
        "wsj-2017.07.07",
        "wsj-2017.07.01",
        "wsj-2017.07.09",
        "wsj-2017.07.04",
        "wsj-2017.07.18",
        "wsj-2017.07.13"
      ],
      "state": "PARTIAL",
      "start_time": "2017-07-20T08:58:57.243Z",
      "start_time_in_millis": 1500541137243,
      "end_time": "2017-07-20T11:52:37.938Z",
      "end_time_in_millis": 1500551557938,
      "duration_in_millis": 10420695,
      "failures": [
        {
          "index": "wsj-2017.07.16",
          "index_uuid": "wsj-2017.07.16",
          "shard_id": 0,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.18",
          "index_uuid": "wsj-2017.07.18",
          "shard_id": 1,
          "reason": "IndexNotFoundException[no such index]",
          "node_id": "GhOdYtKNTIOYMFVRHQHn_Q",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.06.30",
          "index_uuid": "wsj-2017.06.30",
          "shard_id": 0,
          "reason": "IndexNotFoundException[no such index]",
          "node_id": "GhOdYtKNTIOYMFVRHQHn_Q",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-iis-2017.07.11",
          "index_uuid": "wsj-iis-2017.07.11",
          "shard_id": 4,
          "reason": "IndexNotFoundException[no such index]",
          "node_id": "GhOdYtKNTIOYMFVRHQHn_Q",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.18",
          "index_uuid": "wsj-2017.07.18",
          "shard_id": 0,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.06",
          "index_uuid": "wsj-2017.07.06",
          "shard_id": 0,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-iis-2017.07.11",
          "index_uuid": "wsj-iis-2017.07.11",
          "shard_id": 0,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.19",
          "index_uuid": "wsj-2017.07.19",
          "shard_id": 4,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.03",
          "index_uuid": "wsj-2017.07.03",
          "shard_id": 4,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-iis-2017.07.11",
          "index_uuid": "wsj-iis-2017.07.11",
          "shard_id": 3,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.03",
          "index_uuid": "wsj-2017.07.03",
          "shard_id": 0,
          "reason": "IndexNotFoundException[no such index]",
          "node_id": "GhOdYtKNTIOYMFVRHQHn_Q",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.16",
          "index_uuid": "wsj-2017.07.16",
          "shard_id": 3,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.01",
          "index_uuid": "wsj-2017.07.01",
          "shard_id": 1,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.18",
          "index_uuid": "wsj-2017.07.18",
          "shard_id": 4,
          "reason": "IndexNotFoundException[no such index]",
          "node_id": "GhOdYtKNTIOYMFVRHQHn_Q",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.05",
          "index_uuid": "wsj-2017.07.05",
          "shard_id": 4,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.05",
          "index_uuid": "wsj-2017.07.05",
          "shard_id": 1,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.06.30",
          "index_uuid": "wsj-2017.06.30",
          "shard_id": 1,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.06.30",
          "index_uuid": "wsj-2017.06.30",
          "shard_id": 4,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.18",
          "index_uuid": "wsj-2017.07.18",
          "shard_id": 3,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.01",
          "index_uuid": "wsj-2017.07.01",
          "shard_id": 4,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.03",
          "index_uuid": "wsj-2017.07.03",
          "shard_id": 3,
          "reason": "IndexNotFoundException[no such index]",
          "node_id": "GhOdYtKNTIOYMFVRHQHn_Q",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-iis-2017.07.11",
          "index_uuid": "wsj-iis-2017.07.11",
          "shard_id": 1,
          "reason": "IndexNotFoundException[no such index]",
          "node_id": "GhOdYtKNTIOYMFVRHQHn_Q",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.19",
          "index_uuid": "wsj-2017.07.19",
          "shard_id": 0,
          "reason": "IndexNotFoundException[no such index]",
          "node_id": "GhOdYtKNTIOYMFVRHQHn_Q",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.19",
          "index_uuid": "wsj-2017.07.19",
          "shard_id": 3,
          "reason": "IndexNotFoundException[no such index]",
          "node_id": "GhOdYtKNTIOYMFVRHQHn_Q",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.19",
          "index_uuid": "wsj-2017.07.19",
          "shard_id": 1,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.16",
          "index_uuid": "wsj-2017.07.16",
          "shard_id": 4,
          "reason": "IndexNotFoundException[no such index]",
          "node_id": "GhOdYtKNTIOYMFVRHQHn_Q",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.03",
          "index_uuid": "wsj-2017.07.03",
          "shard_id": 1,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        }
      ],
      "shards": {
        "total": 27,
        "failed": 27,
        "successful": 0
      }
    }
  ]
}

Interestingly this snapshot includes that wsj-iis-2017.07.11 index which we then deleted (since due to the naming it would get mixed up a lot with the wsj-* indices) and recreated with another name. Those IndexNotFoundException[no such index] messages look weird though because the mentioned indices do exist, are still on the cluster and I can query them.

eirc · 2017-07-26T15:28:41Z

🏆 deleted the offending snapshot and the listing now works! 🥇

If you need any more info on the "bug" itself I'll be happy to provide. Also my issue is solved but I'll leave this for you to close in case you want to follow the thread deeper.

imotov · 2017-07-26T15:33:30Z

Thanks @eirc. We have found the line that is causing this NPE. We are just doing some root cause analysis at the moment to see if there is more to it. It's definitely a bug. Thanks a lot for very detailed information and your willingness to work with us on it!

imotov · 2017-07-26T22:55:15Z

@eirc I spent some time trying to reproduce the issue, but no matter what I try I cannot get my snapshot into the state where it produces nulls in shard failures. It looks like the snapshot in question took place a week ago. Do you remember, by any chance, what was going on with the cluster during this time? Do you still have log files from that day?

eirc · 2017-07-26T23:06:33Z

My current best guess is that that index I mentioned we deleted (wsj-iis) was deleted during the backup process and maybe that mucked up things somehow. I can check the logs at the time for more concrete info but that has to until tomorrow when i get back to work :)

imotov · 2017-07-26T23:12:51Z

Yes, deletion of indices during a snapshot is the first thing I tried. It is producing a slightly different snapshot info that doesn't contain any nulls. It seems that I am missing some key ingredient here. I am done for today as well, but it would be awesome if you could check the logs tomorrow.

ywelsch · 2017-07-27T07:22:36Z

The issue I see is that the code incorrectly assumes that reason is non-null in case where there is a SnapshotShardFailure. The failure is constructed from a ShardSnapshotStatus object that is in a "failed" state (one of FAILED, ABORTED, MISSING). I see two places where we can possibly have a ShardSnapshotStatus object with "failed" state and where the "reason" can be null:

cluster state serialization (to be precise: SnapshotsInProgress), because we don't serialize the "reason". This means that on master failover it can become null. This scenario can be verified by adding the assertion reason != null to the SnapshotShardFailure constructor and running the (currently disabled) test testMasterShutdownDuringFailedSnapshot a few times.
the call shardsBuilder.put(shardEntry.key, new ShardSnapshotStatus(status.nodeId(), State.ABORTED)) when aborting a snapshot. Here it's more difficult to come up with a scenario. But unless we can rule that one out, I would still consider it an issue.

I think the easiest fix for now would be to assume that reason is Nullable and adapt the serialization code accordingly. WDYT @imotov ?

eirc · 2017-07-27T09:21:53Z

Seems like that index was actually deleted a few days later after all so that was probably a red herring.

Ok there's a huge spike of logs during that snapshot's creation time, I'll try to aggregate what I see as most important:

Related to the snapshot itself (ie searching for "20170720085856")

29 occurrences of

[2017-07-20T14:44:49,461][WARN ][o.e.s.SnapshotShardsService] [Ht8LDxX] [[wsj-iis-2017.07.11][2]] [long_term:wsj-snapshot-20170720085856/yIbELYjgQN-_BgjRd4Vb0A] failed to create snapshot
org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException: Failed to snapshot
	at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:397) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.snapshots.SnapshotShardsService.access$200(SnapshotShardsService.java:88) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:335) [elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.4.1.jar:5.4.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: org.apache.lucene.store.AlreadyClosedException: engine is closed
	at org.elasticsearch.index.shard.IndexShard.getEngine(IndexShard.java:1446) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.index.shard.IndexShard.acquireIndexCommit(IndexShard.java:836) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:380) ~[elasticsearch-5.4.1.jar:5.4.1]
	... 7 more

and 2 of

[2017-07-20T14:44:49,459][WARN ][o.e.s.SnapshotShardsService] [Ht8LDxX] [[wsj-2017.07.19][2]] [long_term:wsj-snapshot-20170720085856/yIbELYjgQN-_BgjRd4Vb0A] failed to create snapshot
org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException: Aborted
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext$AbortableInputStream.checkAborted(BlobStoreRepository.java:1501) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext$AbortableInputStream.read(BlobStoreRepository.java:1494) ~[elasticsearch-5.4.1.jar:5.4.1]
	at java.io.FilterInputStream.read(FilterInputStream.java:107) ~[?:1.8.0_131]
	at org.elasticsearch.common.io.Streams.copy(Streams.java:76) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.common.io.Streams.copy(Streams.java:57) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.repositories.s3.S3BlobContainer.writeBlob(S3BlobContainer.java:100) ~[?:?]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshotFile(BlobStoreRepository.java:1428) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshot(BlobStoreRepository.java:1370) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:967) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:382) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.snapshots.SnapshotShardsService.access$200(SnapshotShardsService.java:88) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:335) [elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.4.1.jar:5.4.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]

no index state found

1702 occurrences of the following from one data node:

[2017-07-20T14:51:22,103][WARN ][o.e.c.u.IndexFolderUpgrader] [/mnt/elasticsearch-data-02/nodes/0/indices/8oH-hwzeQAmJR7TZkUxf1w] no index state found - ignoring

and one similar from another host

unexpected error while indexing monitoring document

a spike of ~ 2.5k of those at the start of the snapshot:

[2017-07-20T14:44:48,526][WARN ][o.e.x.m.e.l.LocalExporter] unexpected error while indexing monitoring document
org.elasticsearch.xpack.monitoring.exporter.ExportException: NodeClosedException[node closed {Ht8LDxX}{Ht8LDxXGQAGEna893aC57w}{vq-tK9uISPexLeENQ82FRw}{10.127.1.207}{10.127.1.207:9300}{ml.enabled=true}]
	at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$throwExportException$2(LocalBulk.java:131) ~[?:?]
	at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) ~[?:1.8.0_131]
	at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) ~[?:1.8.0_131]
	at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) ~[?:1.8.0_131]
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) ~[?:1.8.0_131]
	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) ~[?:1.8.0_131]
	at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) ~[?:1.8.0_131]
	at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) ~[?:1.8.0_131]
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:1.8.0_131]
	at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) ~[?:1.8.0_131]
	at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.throwExportException(LocalBulk.java:132) ~[?:?]
	at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$doFlush$0(LocalBulk.java:115) ~[?:?]
	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:59) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:88) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:84) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.bulk.TransportBulkAction$BulkRequestModifier.lambda$wrapActionListenerIfNeeded$0(TransportBulkAction.java:583) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:59) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.finishHim(TransportBulkAction.java:389) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.onFailure(TransportBulkAction.java:384) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.TransportAction$1.onFailure(TransportAction.java:94) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.finishAsFailed(TransportReplicationAction.java:827) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$2.onClusterServiceClose(TransportReplicationAction.java:810) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onClusterServiceClose(ClusterStateObserver.java:304) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onClose(ClusterStateObserver.java:224) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.cluster.service.ClusterService.addTimeoutListener(ClusterService.java:385) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.cluster.ClusterStateObserver.waitForNextChange(ClusterStateObserver.java:166) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.cluster.ClusterStateObserver.waitForNextChange(ClusterStateObserver.java:111) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.cluster.ClusterStateObserver.waitForNextChange(ClusterStateObserver.java:103) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retry(TransportReplicationAction.java:802) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$1.handleException(TransportReplicationAction.java:781) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1050) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.transport.TransportService$Adapter.lambda$onConnectionClosed$6(TransportService.java:876) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-5.4.1.jar:5.4.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: org.elasticsearch.node.NodeClosedException: node closed {Ht8LDxX}{Ht8LDxXGQAGEna893aC57w}{vq-tK9uISPexLeENQ82FRw}{10.127.1.207}{10.127.1.207:9300}{ml.enabled=true}
	... 15 more

and a similar number of those at the end of the snapshot:

[2017-07-20T14:51:05,408][WARN ][o.e.x.m.e.l.LocalExporter] unexpected error while indexing monitoring document
org.elasticsearch.xpack.monitoring.exporter.ExportException: TransportException[transport stopped, action: indices:data/write/bulk[s][p]]
	at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$throwExportException$2(LocalBulk.java:131) ~[?:?]
	at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) ~[?:1.8.0_131]
	at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) ~[?:1.8.0_131]
	at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) ~[?:1.8.0_131]
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) ~[?:1.8.0_131]
	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) ~[?:1.8.0_131]
	at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) ~[?:1.8.0_131]
	at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) ~[?:1.8.0_131]
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:1.8.0_131]
	at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) ~[?:1.8.0_131]
	at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.throwExportException(LocalBulk.java:132) ~[?:?]
	at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$doFlush$0(LocalBulk.java:115) ~[?:?]
	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:59) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:88) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:84) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.bulk.TransportBulkAction$BulkRequestModifier.lambda$wrapActionListenerIfNeeded$0(TransportBulkAction.java:583) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:59) [elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.finishHim(TransportBulkAction.java:389) [elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.onFailure(TransportBulkAction.java:384) [elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.TransportAction$1.onFailure(TransportAction.java:94) [elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.finishAsFailed(TransportReplicationAction.java:827) [elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$1.handleException(TransportReplicationAction.java:783) [elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1050) [elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:247) [elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.4.1.jar:5.4.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: org.elasticsearch.transport.TransportException: transport stopped, action: indices:data/write/bulk[s][p]
	at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:246) ~[elasticsearch-5.4.1.jar:5.4.1]
	... 5 more

node not connected

got 9 of those with at least one for each node

[2017-07-20T14:44:47,437][WARN ][o.e.a.a.c.n.i.TransportNodesInfoAction] [zYawxs4] not accumulating exceptions, excluding exception from response
org.elasticsearch.action.FailedNodeException: Failed node [Ht8LDxXGQAGEna893aC57w]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.onFailure(TransportNodesAction.java:246) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.access$200(TransportNodesAction.java:160) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleException(TransportNodesAction.java:218) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:493) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.start(TransportNodesAction.java:204) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.nodes.TransportNodesAction.doExecute(TransportNodesAction.java:89) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.nodes.TransportNodesAction.doExecute(TransportNodesAction.java:52) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:170) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:142) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:84) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.client.node.NodeClient.executeLocally(NodeClient.java:83) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.client.node.NodeClient.doExecute(NodeClient.java:72) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:408) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.client.support.AbstractClient$ClusterAdmin.execute(AbstractClient.java:730) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.client.support.AbstractClient$ClusterAdmin.nodesInfo(AbstractClient.java:811) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.rest.action.admin.cluster.RestNodesInfoAction.lambda$prepareRequest$0(RestNodesInfoAction.java:109) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.rest.BaseRestHandler.handleRequest(BaseRestHandler.java:80) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.rest.RestController.dispatchRequest(RestController.java:260) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.rest.RestController.dispatchRequest(RestController.java:199) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.http.netty4.Netty4HttpServerTransport.dispatchRequest(Netty4HttpServerTransport.java:505) ~[transport-netty4-5.4.1.jar:5.4.1]
	at org.elasticsearch.http.netty4.Netty4HttpRequestHandler.channelRead0(Netty4HttpRequestHandler.java:72) ~[transport-netty4-5.4.1.jar:5.4.1]
	at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at org.elasticsearch.http.netty4.pipelining.HttpPipeliningHandler.channelRead(HttpPipeliningHandler.java:63) ~[transport-netty4-5.4.1.jar:5.4.1]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) ~[netty-codec-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.handler.codec.MessageToMessageCodec.channelRead(MessageToMessageCodec.java:111) ~[netty-codec-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) ~[netty-codec-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) ~[netty-codec-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310) [netty-codec-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:284) [netty-codec-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) [netty-common-4.1.11.Final.jar:4.1.11.Final]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: org.elasticsearch.transport.NodeNotConnectedException: [Ht8LDxX][10.127.1.207:9300] Node not connected
	at org.elasticsearch.transport.TcpTransport.getConnection(TcpTransport.java:630) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.transport.TcpTransport.getConnection(TcpTransport.java:116) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:513) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:489) ~[elasticsearch-5.4.1.jar:5.4.1]
	... 58 more

Exception when closing export bulk

3 of those

[2017-07-20T14:44:48,536][WARN ][o.e.x.m.MonitoringService] [Ht8LDxX] monitoring execution failed
org.elasticsearch.xpack.monitoring.exporter.ExportException: Exception when closing export bulk
	at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$1$1.<init>(ExportBulk.java:106) ~[?:?]
	at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$1.onFailure(ExportBulk.java:104) ~[?:?]
	at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound$1.onResponse(ExportBulk.java:217) ~[?:?]
	at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound$1.onResponse(ExportBulk.java:211) ~[?:?]
	at org.elasticsearch.xpack.common.IteratingActionListener.onResponse(IteratingActionListener.java:108) ~[?:?]
	at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound.lambda$null$0(ExportBulk.java:175) ~[?:?]
	at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:67) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.throwExportException(LocalBulk.java:138) ~[?:?]
	at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$doFlush$0(LocalBulk.java:115) ~[?:?]
	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:59) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:88) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:84) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.bulk.TransportBulkAction$BulkRequestModifier.lambda$wrapActionListenerIfNeeded$0(TransportBulkAction.java:583) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:59) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.finishHim(TransportBulkAction.java:389) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.onFailure(TransportBulkAction.java:384) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.TransportAction$1.onFailure(TransportAction.java:94) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.finishAsFailed(TransportReplicationAction.java:827) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$2.onClusterServiceClose(TransportReplicationAction.java:810) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onClusterServiceClose(ClusterStateObserver.java:304) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onClose(ClusterStateObserver.java:224) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.cluster.service.ClusterService.addTimeoutListener(ClusterService.java:385) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.cluster.ClusterStateObserver.waitForNextChange(ClusterStateObserver.java:166) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.cluster.ClusterStateObserver.waitForNextChange(ClusterStateObserver.java:111) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.cluster.ClusterStateObserver.waitForNextChange(ClusterStateObserver.java:103) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retry(TransportReplicationAction.java:802) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$1.handleException(TransportReplicationAction.java:781) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1050) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.transport.TransportService$Adapter.lambda$onConnectionClosed$6(TransportService.java:876) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-5.4.1.jar:5.4.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: org.elasticsearch.xpack.monitoring.exporter.ExportException: failed to flush export bulks
	at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound.lambda$null$0(ExportBulk.java:167) ~[?:?]
	... 27 more
Caused by: org.elasticsearch.xpack.monitoring.exporter.ExportException: bulk [default_local] reports failures when exporting documents
	at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.throwExportException(LocalBulk.java:127) ~[?:?]
	... 25 more

I'm sure there's more stuff in there but I don't know if this actually helps you because I can't make sense of it due to the log volume being that high so I may be missing the important ones. Is there something more specific I could search for that would help? Maybe I should just export all elasticsearch logs for that day and mail them to you?

imotov · 2017-07-27T13:23:17Z

I think the easiest fix for now would be to assume that reason is Nullable and adapt the serialization code accordingly. WDYT @imotov ?

@ywelsch Yes fixing it like this would be easy, I just didn't want to assume anything, I wanted to have a test that creates this problem so we can fix it for sure. So, that's why I spent some time trying to reproduce it. You are right about it being null in SnapshotsInProgress, and I tried to reproduce it this way but it looks like it's a completely different path that doesn't get resolved into shard failure object, so this seems to be a dead end. So, I think ABORTED path is more promising and after thinking about for a while, I think the scenario is snapshot gets stuck on a master, gets aborted, then another master takes over, and somehow generates these nulls. The problem with this scenario is that if a snapshot is aborted, it should be deleted afterwards. So, based on the information that @eirc provided, it looks like it might be a combination of stuck snapshot combined with some sort of node failure that prevented the aborted snapshot from being cleaned up, which might be quite difficult to reproduce.

Maybe I should just export all elasticsearch logs for that day and mail them to you?

@eirc that would be very helpful. Thanks!

imotov · 2017-07-27T14:05:51Z

Just a quick update. @ywelsch and I discussed the issue and came up with a plan how to modify testMasterShutdownDuringFailedSnapshot to potentially reproduce the issue. I will try implementing it.

…red in CS The failure reason for snapshot shard failures might not be propagated properly if the master node changes after the errors were reported by other data nodes. This commits ensures that the snapshot shard failure reason is preserved properly and adds workaround for reading old snapshot files where this information might not have been preserved. Closes elastic#25878

…red in CS (#25941) The failure reason for snapshot shard failures might not be propagated properly if the master node changes after the errors were reported by other data nodes. This commits ensures that the snapshot shard failure reason is preserved properly and adds workaround for reading old snapshot files where this information might not have been preserved. Closes #25878

Updating the version in SnapshotsInProgress serialization method to reflect that #25941 was backported to 6.0.0-beta1. Relates to #25878

…red in CS (#26127) The failure reasons for snapshot shard failures might not be propagated properly if the master node changes after errors were reported by other data nodes, which causes them to be stored as null in snapshot files. This commits adds a workaround for reading such snapshot files where this information might not have been preserved and makes sure that the reason is not null if it gets cluster state from another master. This is a partial backport of #25941 to 5.6. Closes #25878

dadoonet added :Plugin Repository S3 >bug labels Jul 25, 2017

colings86 added the feedback_needed label Jul 25, 2017

imotov self-assigned this Jul 26, 2017

imotov added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs and removed feedback_needed :Plugin Repository S3 labels Jul 26, 2017

imotov mentioned this issue Jul 27, 2017

Snapshot/Restore: Ensure that shard failure reasons are correctly stored in CS #25941

Merged

imotov closed this as completed in #25941 Jul 28, 2017

imotov added a commit that referenced this issue Aug 3, 2017

Snapshot/Restore: Update version of shard failure reason serialization

c9bb686

Updating the version in SnapshotsInProgress serialization method to reflect that #25941 was backported to 6.0.0-beta1. Relates to #25878

imotov mentioned this issue Aug 9, 2017

Snapshot/Restore: fix NPE while handling null failure reasons #26127

Merged

tlrx mentioned this issue Apr 24, 2018

Creating a snapshot fails at the very last stage with NullPointerException #29649

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NullPointerException when getting list of snapshots on S3 #25878

NullPointerException when getting list of snapshots on S3 #25878

eirc commented Jul 25, 2017

colings86 commented Jul 25, 2017 •

edited by dadoonet

Loading

eirc commented Jul 25, 2017

eirc commented Jul 25, 2017

eirc commented Jul 25, 2017

imotov commented Jul 26, 2017

eirc commented Jul 26, 2017

ywelsch commented Jul 26, 2017

imotov commented Jul 26, 2017

eirc commented Jul 26, 2017

eirc commented Jul 26, 2017

imotov commented Jul 26, 2017

eirc commented Jul 26, 2017

imotov commented Jul 26, 2017

eirc commented Jul 26, 2017 •

edited

Loading

eirc commented Jul 26, 2017

eirc commented Jul 26, 2017

eirc commented Jul 26, 2017

imotov commented Jul 26, 2017

imotov commented Jul 26, 2017

eirc commented Jul 26, 2017

imotov commented Jul 26, 2017

ywelsch commented Jul 27, 2017

eirc commented Jul 27, 2017 •

edited

Loading

imotov commented Jul 27, 2017

imotov commented Jul 27, 2017

NullPointerException when getting list of snapshots on S3 #25878

NullPointerException when getting list of snapshots on S3 #25878

Comments

eirc commented Jul 25, 2017

colings86 commented Jul 25, 2017 • edited by dadoonet Loading

eirc commented Jul 25, 2017

eirc commented Jul 25, 2017

eirc commented Jul 25, 2017

imotov commented Jul 26, 2017

eirc commented Jul 26, 2017

ywelsch commented Jul 26, 2017

imotov commented Jul 26, 2017

eirc commented Jul 26, 2017

eirc commented Jul 26, 2017

imotov commented Jul 26, 2017

eirc commented Jul 26, 2017

imotov commented Jul 26, 2017

eirc commented Jul 26, 2017 • edited Loading

eirc commented Jul 26, 2017

eirc commented Jul 26, 2017

eirc commented Jul 26, 2017

imotov commented Jul 26, 2017

imotov commented Jul 26, 2017

eirc commented Jul 26, 2017

imotov commented Jul 26, 2017

ywelsch commented Jul 27, 2017

eirc commented Jul 27, 2017 • edited Loading

Related to the snapshot itself (ie searching for "20170720085856")

no index state found

unexpected error while indexing monitoring document

node not connected

Exception when closing export bulk

imotov commented Jul 27, 2017

imotov commented Jul 27, 2017

colings86 commented Jul 25, 2017 •

edited by dadoonet

Loading

eirc commented Jul 26, 2017 •

edited

Loading

eirc commented Jul 27, 2017 •

edited

Loading