Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NullPointerException when getting list of snapshots on S3 #25878

Closed
eirc opened this issue Jul 25, 2017 · 25 comments
Closed

NullPointerException when getting list of snapshots on S3 #25878

eirc opened this issue Jul 25, 2017 · 25 comments
Assignees
Labels
>bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs

Comments

@eirc
Copy link

eirc commented Jul 25, 2017

Elasticsearch version: 5.5.0 (issue first appeared while on 5.4.1)

Plugins installed: [x-pack, repository-s3]

JVM version (java -version): openjdk version "1.8.0_131"
OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-1~bpo8+1-b11)
OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)

OS version (uname -a if on a Unix-like system): Linux ip-10-127-1-159 3.16.0-4-amd64 #1 SMP Debian 3.16.39-1 (2016-12-30) x86_64 GNU/Linux

Description of the problem including expected versus actual behavior:

Trying to get the list of available snapshots on an S3 backed repository fails with NullPointerException.

curl elasticsearch:9200/_snapshot/long_term/_all
{"error":{"root_cause":[{"type":"remote_transport_exception","reason":"[SVVyQPF][10.127.1.203:9300][cluster:admin/snapshot/get]"}],"type":"null_pointer_exception","reason":null},"status":500}

Elasticsearch logs:

[2017-07-25T12:01:47,038][WARN ][r.suppressed             ] path: /_snapshot/long_term/_all, params: {repository=long_term, snapshot=_all}
org.elasticsearch.transport.RemoteTransportException: [SVVyQPF][10.127.1.203:9300][cluster:admin/snapshot/get]
Caused by: java.lang.NullPointerException

I use curator to take the backups and after grabbing backups successfully it fails when it tries to delete old snapshots because that's when it requires a list too:

2017-07-25 11:53:02,191 ERROR     Failed to complete action: delete_snapshots.  <class 'curator.exceptions.FailedExecution'>: Unable to get snapshot information from repository: long_term.  Error: TransportError(500, 'null_pointer_exception', '[SVVyQPF][10.127.1.203:9300][cluster:admin/snapshot/get]')

I have a feeling this is due to some kind of timeout. I turned on debug logging and while I couldn't find a more specific reason this fails I noticed it made ~ 2K requests to S3 until it failed and it stopped at 90 seconds. Is this a configurable timeout?

In the past getting a list of snapshots took increasingly long but it eventually responded. Now it breaks earlier than that.

Also posted on the forums: https://discuss.elastic.co/t/nullpointerexception-when-getting-list-of-snapshots-on-s3/94458

@colings86
Copy link
Contributor

colings86 commented Jul 25, 2017

Could you paste the full stack trace from the Elasticsearch server logs?

@eirc
Copy link
Author

eirc commented Jul 25, 2017

There's no more logs for the null pointer entry. There's a ton of logs for the headers and each of the 2K requests do you want me to post those? All of those responded with 200 OK though.

@eirc
Copy link
Author

eirc commented Jul 25, 2017

These should be the logs from the last request before the null pointer. I tried to sensor out any possibly sensitive info. Maybe the returned payload was what triggered the issue?

[2017-07-25T12:27:45,437][DEBUG][o.a.h.c.p.RequestAddCookies] CookieSpec selected: default
[2017-07-25T12:27:45,437][DEBUG][o.a.h.c.p.RequestAuthCache] Auth cache not set in the context
[2017-07-25T12:27:45,437][DEBUG][o.a.h.c.p.RequestProxyAuthentication] Proxy auth state: UNCHALLENGED
[2017-07-25T12:27:45,437][DEBUG][c.a.h.i.c.SdkHttpClient  ] Attempt 1 to execute request
[2017-07-25T12:27:45,437][DEBUG][o.a.h.i.c.DefaultClientConnection] Sending request: GET /long_term/snap-GRrT8CKjS7qdq42NZf3T2A.dat HTTP/1.1
[2017-07-25T12:27:45,437][DEBUG][o.a.h.wire               ]  >> "GET /long_term/snap-GRrT8CKjS7qdq42NZf3T2A.dat HTTP/1.1[\r][\n]"
[2017-07-25T12:27:45,437][DEBUG][o.a.h.wire               ]  >> "Host: elastic-stack-backupsbucket-*****************.s3-eu-west-1.amazonaws.com[\r][\n]"
[2017-07-25T12:27:45,437][DEBUG][o.a.h.wire               ]  >> "x-amz-content-sha256: *********************[\r][\n]"
[2017-07-25T12:27:45,437][DEBUG][o.a.h.wire               ]  >> "Authorization: AWS4-HMAC-SHA256 Credential=****************/20170725/eu-west-1/s3/aws4_request, SignedHeaders=amz-sdk-invocation-id;amz-sdk-retry;content-type;host;user-agent;x-amz-content-sha256;x-amz-d
ate;x-amz-security-token, Signature=***************************[\r][\n]"
[2017-07-25T12:27:45,437][DEBUG][o.a.h.wire               ]  >> "X-Amz-Date: 20170725T092745Z[\r][\n]"
[2017-07-25T12:27:45,437][DEBUG][o.a.h.wire               ]  >> "User-Agent: aws-sdk-java/1.10.69 Linux/3.16.0-4-amd64 OpenJDK_64-Bit_Server_VM/25.131-b11/1.8.0_131[\r][\n]"
[2017-07-25T12:27:45,437][DEBUG][o.a.h.wire               ]  >> "X-Amz-Security-Token: **********************************[\r][\n]"
[2017-07-25T12:27:45,437][DEBUG][o.a.h.wire               ]  >> "amz-sdk-invocation-id: 23f8b7a2-93bb-46f4-a492-cf692051dc43[\r][\n]"
[2017-07-25T12:27:45,437][DEBUG][o.a.h.wire               ]  >> "amz-sdk-retry: 0/0/[\r][\n]"
[2017-07-25T12:27:45,437][DEBUG][o.a.h.wire               ]  >> "Content-Type: application/octet-stream[\r][\n]"
[2017-07-25T12:27:45,437][DEBUG][o.a.h.wire               ]  >> "Connection: Keep-Alive[\r][\n]"
[2017-07-25T12:27:45,437][DEBUG][o.a.h.wire               ]  >> "[\r][\n]"
[2017-07-25T12:27:45,437][DEBUG][o.a.h.headers            ] >> GET /long_term/snap-GRrT8CKjS7qdq42NZf3T2A.dat HTTP/1.1
[2017-07-25T12:27:45,437][DEBUG][o.a.h.headers            ] >> Host: elastic-stack-backupsbucket-*****************.s3-eu-west-1.amazonaws.com
[2017-07-25T12:27:45,437][DEBUG][o.a.h.headers            ] >> x-amz-content-sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
[2017-07-25T12:27:45,437][DEBUG][o.a.h.headers            ] >> Authorization: AWS4-HMAC-SHA256 Credential=****************/20170725/eu-west-1/s3/aws4_request, SignedHeaders=amz-sdk-invocation-id;amz-sdk-retry;content-type;host;user-agent;x-amz-content-sha256;x-amz-dat
e;x-amz-security-token, Signature=***************************
[2017-07-25T12:27:45,437][DEBUG][o.a.h.headers            ] >> X-Amz-Date: 20170725T092745Z
[2017-07-25T12:27:45,437][DEBUG][o.a.h.headers            ] >> User-Agent: aws-sdk-java/1.10.69 Linux/3.16.0-4-amd64 OpenJDK_64-Bit_Server_VM/25.131-b11/1.8.0_131
[2017-07-25T12:27:45,437][DEBUG][o.a.h.headers            ] >> X-Amz-Security-Token: **********************************
[2017-07-25T12:27:45,437][DEBUG][o.a.h.headers            ] >> amz-sdk-invocation-id: 23f8b7a2-93bb-46f4-a492-cf692051dc43
[2017-07-25T12:27:45,437][DEBUG][o.a.h.headers            ] >> amz-sdk-retry: 0/0/
[2017-07-25T12:27:45,437][DEBUG][o.a.h.headers            ] >> Content-Type: application/octet-stream
[2017-07-25T12:27:45,437][DEBUG][o.a.h.headers            ] >> Connection: Keep-Alive
[2017-07-25T12:27:45,509][DEBUG][o.a.h.wire               ]  << "HTTP/1.1 200 OK[\r][\n]"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.wire               ]  << "x-amz-id-2: ************************[\r][\n]"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.wire               ]  << "x-amz-request-id: 3E117E943CA08991[\r][\n]"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.wire               ]  << "Date: Tue, 25 Jul 2017 09:27:46 GMT[\r][\n]"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.wire               ]  << "Last-Modified: Wed, 19 Jul 2017 01:07:25 GMT[\r][\n]"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.wire               ]  << "ETag: "8e87c087b7474433ba26057f74233e5a"[\r][\n]"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.wire               ]  << "Accept-Ranges: bytes[\r][\n]"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.wire               ]  << "Content-Type: application/octet-stream[\r][\n]"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.wire               ]  << "Content-Length: 302[\r][\n]"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.wire               ]  << "Server: AmazonS3[\r][\n]"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.wire               ]  << "[\r][\n]"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.i.c.DefaultClientConnection] Receiving response: HTTP/1.1 200 OK
[2017-07-25T12:27:45,509][DEBUG][o.a.h.headers            ] << HTTP/1.1 200 OK
[2017-07-25T12:27:45,509][DEBUG][o.a.h.headers            ] << x-amz-id-2: *************************************
[2017-07-25T12:27:45,509][DEBUG][o.a.h.headers            ] << x-amz-request-id: 3E117E943CA08991
[2017-07-25T12:27:45,509][DEBUG][o.a.h.headers            ] << Date: Tue, 25 Jul 2017 09:27:46 GMT
[2017-07-25T12:27:45,509][DEBUG][o.a.h.headers            ] << Last-Modified: Wed, 19 Jul 2017 01:07:25 GMT
[2017-07-25T12:27:45,509][DEBUG][o.a.h.headers            ] << ETag: "8e87c087b7474433ba26057f74233e5a"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.headers            ] << Accept-Ranges: bytes
[2017-07-25T12:27:45,509][DEBUG][o.a.h.headers            ] << Content-Type: application/octet-stream
[2017-07-25T12:27:45,509][DEBUG][o.a.h.headers            ] << Content-Length: 302
[2017-07-25T12:27:45,509][DEBUG][o.a.h.headers            ] << Server: AmazonS3
[2017-07-25T12:27:45,509][DEBUG][c.a.h.i.c.SdkHttpClient  ] Connection can be kept alive for 60000 MILLISECONDS
[2017-07-25T12:27:45,509][DEBUG][c.a.requestId            ] x-amzn-RequestId: not available
[2017-07-25T12:27:45,509][DEBUG][c.a.request              ] Received successful response: 200, AWS Request ID: 3E117E943CA08991
[2017-07-25T12:27:45,509][DEBUG][c.a.requestId            ] AWS Request ID: 3E117E943CA08991
[2017-07-25T12:27:45,509][DEBUG][o.a.h.wire               ]  << "?[0xd7]l[0x17][0x8]snapshot[0x0][0x0][0x0][0x1]DFL[0x0]l[0x92][0xcd]J[0x3]1[0x14][0x85][0xe9][0xc2][0x85][0xe2]SH]t[0xd1]![0xc9]Lg[0xda][0xee]tP[0x17]B[0x17][0xa6][0xed]B[0x90]!4[0x19][0x9a][0xd2]fln[0xd2][0x95]++[0xe]E[0x90]y[0xdc][0xfe]l[0x1c]DQ[0xe8][0x85]lr[0xf8][0xce][0xb9][0xf7][0x90][0xf4][g'[0xfb][0x12][0x8c]x[0x86]i[0xe1][0xf6]k#[0x16][0xea]IHY[0x18]h[0xff][0xaa]mFhB[0x12][0xda]#[0x94][0xc4][0x9d]h[0xed][0xbd][0x96][0xa3][0xbb][0x7];[0xec][0xa6][0xf7]3[0x9e],[0xe5]2b[0x83][0xc7]<[0x1c][0xb2][0xab][0xcd]JY[0xd0][0x85][0xc9][0xb4]l[0x9e][0xe][0xae]?[0xdf][0xb5][0x91]z[0xa2]`[0xcb]^?2W[0xf4];- [0xf5][0x89][0x10][0x91]v02[0xc1]H[0x8a][0x91]=[0x8c]D[0xed]1f?[0x9e][0x1e][0x7][0xec]83[0xe]B[0x82][0xd9][0x19]&v[0xb1][0x95][0xd0][0xee][0x18]I[0xb0][0x9a][0x14][0x9b]NCl:Z[0x13]#)[0xdb][0xbd][0x81][0x13]N[0xdd][0xf2]Q[0x9a][0xde]p[0xbe][0xa9]o[0xd6]eN/[0xd4]e#[0x18][0x9f]_[0xbc][0xbc][0x96][0xca][0xc8]?[0xa1]e[0xaa][0xf]W81[0xcf]`*[0xac][0x84]f[0xa3][0xaa][0xc0]O[0xea][0xe7][0x86][0xdc][0xff][0x13][0xcb]\[0xe8][0xb9][0xb7][0xf5]/[0xd8][0x1d][0xe]_[0x0][0x0][0x0][0xff][0xff][0x3][0x0][0xc0]([0x93][0xe8][0x0][0x0][0x0][0x0][0x0][0x0][0x0][0x0][0xf4][0x1f]J[0xbe]"
[2017-07-25T12:27:45,509][DEBUG][o.a.h.i.c.PoolingClientConnectionManager] Connection [id: 4949][route: {s}->https://elastic-stack-backupsbucket-*************.s3-eu-west-1.amazonaws.com:443] can be kept alive for 60000 MILLISECONDS
[2017-07-25T12:27:45,509][DEBUG][o.a.h.i.c.PoolingClientConnectionManager] Connection released: [id: 4949][route: {s}->https://elastic-stack-backupsbucket-******************.s3-eu-west-1.amazonaws.com:443][total kept alive: 1; route allocated: 1 of 50; total allocated: 1 of 50]

@eirc
Copy link
Author

eirc commented Jul 25, 2017

FYI we use a coordinating node and 3 data nodes. I do the snapshot requests to the coordinating node, and all the S3 requests seem to originate from the data node that's currently the master (10.127.1.203).

Some more logs:

I see ~ 1k of these logs 15 sec after start of the request and ~ 500 at the end:

[2017-07-25T12:27:46,968][DEBUG][o.e.s.SearchService      ] [SVVyQPF] freeing search context [1977515], time [225057509], lastAccessTime [224978476], keepAlive [30000]

These pop up between requests:

[2017-07-25T12:27:45,374][DEBUG][o.a.h.i.c.PoolingClientConnectionManager] Connection released: [id: 4949][route: {s}->https://elastic-stack-backupsbucket-**********.s3-eu-west-1.amazonaws.com:443][total kept alive: 1; route allocated: 1 of 50; total allocated: 1 of 50]
[2017-07-25T12:27:45,374][DEBUG][o.a.h.i.c.PoolingClientConnectionManager] Connection [id: 4949][route: {s}->https://elastic-stack-backupsbucket-**********.s3-eu-west-1.amazonaws.com:443] can be kept alive for 60000 MILLISECONDS

These are the things logged on the master node around the time the coordinating node logged the exception (excluding the freeing search context logs mentioned above):

[2017-07-25T12:27:45,509][DEBUG][o.a.h.i.c.DefaultClientConnection] Receiving response: HTTP/1.1 200 OK
[2017-07-25T12:27:45,509][DEBUG][o.a.h.i.c.PoolingClientConnectionManager] Connection [id: 4949][route: {s}->https://elastic-stack-backupsbucket-***********.s3-eu-west-1.amazonaws.com:443] can be kept alive for 60000 MILLISECONDS
[2017-07-25T12:27:45,509][DEBUG][c.a.requestId            ] x-amzn-RequestId: not available
[2017-07-25T12:27:45,541][DEBUG][o.e.m.j.JvmGcMonitorService] [SVVyQPF] [gc][221514] overhead, spent [106ms] collecting in the last [1s]
[2017-07-25T12:27:47,497][DEBUG][o.e.x.m.a.GetDatafeedsStatsAction$TransportAction] [SVVyQPF] Get stats for datafeed '_all'
[2017-07-25T12:27:47,652][DEBUG][o.e.x.m.e.l.LocalExporter] monitoring index templates and pipelines are installed on master node, service can start
[2017-07-25T12:27:48,542][DEBUG][o.e.m.j.JvmGcMonitorService] [SVVyQPF] [gc][221517] overhead, spent [111ms] collecting in the last [1s]

@imotov
Copy link
Contributor

imotov commented Jul 26, 2017

Hmm, I don't see any smoking gun here. I am not really sure how to move forward with this without knowing where this NPE occurs or being able to reproduce this issue locally.

@eirc
Copy link
Author

eirc commented Jul 26, 2017

Ok as I understand it there should have been a stack trace after the "caused by" line right? Maybe we can look into why that's not present and then we'll have more info for the specific issue? Also there's that r.suppressed thing. That would at least point the to class in which the NPE occurred but that's not available either. Can I configure something to make that visible?

@ywelsch
Copy link
Contributor

ywelsch commented Jul 26, 2017

@eirc you said that

These should be the logs from the last request before the null pointer

but the timestamp from these logs are 12:27 whereas the NPE has a timestamp of 12:01.
Can you provide the full logs from both the master node and the coordinating node? (You can share them in private with us if you don't want to post them publicly)

@imotov
Copy link
Contributor

imotov commented Jul 26, 2017

@eirc, @ywelsch and I discussed this more and we have a couple of other things we would like you to try:

  1. could you execute curl elasticsearch:9200/_snapshot/long_term/_all?error_trace=true and see if the stack trace shows up there

  2. could you execute curl localhost:9200/_snapshot/long_term/_all on the current master node. And if it works, but still fails when you execute it against a coordinating node we would really appreciate this output as well.

@eirc
Copy link
Author

eirc commented Jul 26, 2017

Regarding the time discrepancies, the NPE happens every time I request a listing. At 12:27 I had debug logging on so that's why most of the logs are from that time. At 12:01 was probably one of the first tests. The same NPE log appeared at 12:27 and every time I did a listing request.

@eirc
Copy link
Author

eirc commented Jul 26, 2017

Ok now there's some light at the end of the tunnel!

First if I get the listing from the master node it actually works! By requesting on the coordinating (or any other) node it fails with that same behaviour. Adding error_trace=true to the request yields some useful info finally:

{
    "error": {
        "root_cause": [{
            "type": "remote_transport_exception",
            "reason": "[SVVyQPF][10.127.1.203:9300][cluster:admin/snapshot/get]",
            "stack_trace": "[[SVVyQPF][10.127.1.203:9300][cluster:admin/snapshot/get]]; nested: RemoteTransportException[[SVVyQPF][10.127.1.203:9300][cluster:admin/snapshot/get]]; nested: NullPointerException;\n\tat org.elasticsearch.ElasticsearchException.guessRootCauses(ElasticsearchException.java:618)\n\tat org.elasticsearch.ElasticsearchException.generateFailureXContent(ElasticsearchException.java:563)\n\tat org.elasticsearch.rest.BytesRestResponse.build(BytesRestResponse.java:138)\n\tat org.elasticsearch.rest.BytesRestResponse.<init>(BytesRestResponse.java:96)\n\tat org.elasticsearch.rest.BytesRestResponse.<init>(BytesRestResponse.java:91)\n\tat org.elasticsearch.rest.action.RestActionListener.onFailure(RestActionListener.java:58)\n\tat org.elasticsearch.action.support.TransportAction$1.onFailure(TransportAction.java:94)\n\tat org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$3.handleException(TransportMasterNodeAction.java:185)\n\tat org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1067)\n\tat org.elasticsearch.transport.TcpTransport.lambda$handleException$16(TcpTransport.java:1467)\n\tat org.elasticsearch.common.util.concurrent.EsExecutors$1.execute(EsExecutors.java:110)\n\tat org.elasticsearch.transport.TcpTransport.handleException(TcpTransport.java:1465)\n\tat org.elasticsearch.transport.TcpTransport.handlerResponseError(TcpTransport.java:1457)\n\tat org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1401)\n\tat org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:74)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)\n\tat io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)\n\tat io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)\n\tat io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297)\n\tat io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413)\n\tat io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)\n\tat io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)\n\tat io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)\n\tat io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)\n\tat io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)\n\tat io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544)\n\tat io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498)\n\tat io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)\n\tat io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)\n\tat java.lang.Thread.run(Thread.java:748)\nCaused by: RemoteTransportException[[SVVyQPF][10.127.1.203:9300][cluster:admin/snapshot/get]]; nested: NullPointerException;\nCaused by: java.lang.NullPointerException\n"
        }],
        "type": "null_pointer_exception",
        "reason": null,
        "stack_trace": "java.lang.NullPointerException\n"
    },
    "status": 500
}

Here's the formatted stack trace for your convenience:

[[SVVyQPF][10.127.1.203:9300][cluster:admin/snapshot/get]]; nested: RemoteTransportException[[SVVyQPF][10.127.1.203:9300][cluster:admin/snapshot/get]]; nested: NullPointerException;
        at org.elasticsearch.ElasticsearchException.guessRootCauses(ElasticsearchException.java:618)
        at org.elasticsearch.ElasticsearchException.generateFailureXContent(ElasticsearchException.java:563)
        at org.elasticsearch.rest.BytesRestResponse.build(BytesRestResponse.java:138)
        at org.elasticsearch.rest.BytesRestResponse.<init>(BytesRestResponse.java:96)
        at org.elasticsearch.rest.BytesRestResponse.<init>(BytesRestResponse.java:91)
        at org.elasticsearch.rest.action.RestActionListener.onFailure(RestActionListener.java:58)
        at org.elasticsearch.action.support.TransportAction.onFailure(TransportAction.java:94)
        at org.elasticsearch.action.support.master.TransportMasterNodeAction.handleException(TransportMasterNodeAction.java:185)
        at org.elasticsearch.transport.TransportService.handleException(TransportService.java:1067)
        at org.elasticsearch.transport.TcpTransport.lambda(TcpTransport.java:1467)
        at org.elasticsearch.common.util.concurrent.EsExecutors.execute(EsExecutors.java:110)
        at org.elasticsearch.transport.TcpTransport.handleException(TcpTransport.java:1465)
        at org.elasticsearch.transport.TcpTransport.handlerResponseError(TcpTransport.java:1457)
        at org.elasticsearch.transport.TcpTransport.messageReceived(TcpTransport.java:1401)
        at org.elasticsearch.transport.netty4.Netty4MessageChannelHandler.channelRead(Netty4MessageChannelHandler.java:74)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310)
        at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:297)
        at io.netty.handler.codec.ByteToMessageDecoder.callDecode(ByteToMessageDecoder.java:413)
        at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:265)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340)
        at io.netty.channel.DefaultChannelPipeline.channelRead(DefaultChannelPipeline.java:1334)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362)
        at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348)
        at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926)
        at io.netty.channel.nio.AbstractNioByteChannel.read(AbstractNioByteChannel.java:134)
        at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544)
        at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498)
        at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458)
        at io.netty.util.concurrent.SingleThreadEventExecutor.run(SingleThreadEventExecutor.java:858)
        at java.lang.Thread.run(Thread.java:748)
Caused by: RemoteTransportException[[SVVyQPF][10.127.1.203:9300][cluster:admin/snapshot/get]]; nested: NullPointerException;
Caused by: java.lang.NullPointerException

@imotov
Copy link
Contributor

imotov commented Jul 26, 2017

@eirc any chance you can email me the output that you get from master? My mail is igor at elastic.co. If not, could you try getting one snapshot at a time on the coordinating node and checking what's different between the snapshots that can be retrieved and the snapshots that cause this NPE?

By the way, does the coordinating node have a different es version?

@eirc
Copy link
Author

eirc commented Jul 26, 2017

Just confirmed all elasticsearches are on 5.5.0. Can I check the version of plugins someway? When I upgraded the stack I remember I had to remove and reinstall plugins to be of proper versions.

I'll make a script to pull each snapshot individually and see which one(s) are breaking now.

@imotov
Copy link
Contributor

imotov commented Jul 26, 2017

In 5.5.0 all plugins should be 5.5.0. Otherwise, elasticsearch wouldn't work. In any case, based on what we know so far, I don't think it's a plugin-related issue. Our current theory is that snapshot info serialization code breaks on one or more snapshots that you have in your repository. However, we just reviewed this code and couldn't find any obvious issues. That's why we would like to figure out which snapshot information master is trying to send to the coordinating node in order to reproduce and fix the problem.

@eirc
Copy link
Author

eirc commented Jul 26, 2017

I emailed you the full snapshot list. My script managed to successfully grab each snapshot individually from the coordinating node (where grabbing them all failed). I noticed some of the snapshots have some shard failures but that shouldn't be an issue right? Maybe it's the size of the response that's the issue here? I got ~2k snapshots and the response is 1.2 MB.

@eirc
Copy link
Author

eirc commented Jul 26, 2017

No scratch that, there is a single snapshot which produces the NPE when I get it on it's own.

@eirc
Copy link
Author

eirc commented Jul 26, 2017

Here is the JSON I can get from the master but not from other nodes:

{
  "snapshots": [
    {
      "snapshot": "wsj-snapshot-20170720085856",
      "uuid": "yIbELYjgQN-_BgjRd4Vb0A",
      "version_id": 5040199,
      "version": "5.4.1",
      "indices": [
        "wsj-2017.07.19",
        "wsj-iis-2017.07.11",
        "wsj-2017.07.08",
        "wsj-2017.07.15",
        "wsj-2017.07.11",
        "wsj-2017.07.12",
        "wsj-2017.07.02",
        "wsj-2017.07.10",
        "wsj-2017.07.06",
        "wsj-2017.06.30",
        "wsj-2017.07.05",
        "wsj-2017.07.14",
        "wsj-2017.07.03",
        "wsj-2017.07.16",
        "wsj-2017.07.17",
        "wsj-2017.07.07",
        "wsj-2017.07.01",
        "wsj-2017.07.09",
        "wsj-2017.07.04",
        "wsj-2017.07.18",
        "wsj-2017.07.13"
      ],
      "state": "PARTIAL",
      "start_time": "2017-07-20T08:58:57.243Z",
      "start_time_in_millis": 1500541137243,
      "end_time": "2017-07-20T11:52:37.938Z",
      "end_time_in_millis": 1500551557938,
      "duration_in_millis": 10420695,
      "failures": [
        {
          "index": "wsj-2017.07.16",
          "index_uuid": "wsj-2017.07.16",
          "shard_id": 0,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.18",
          "index_uuid": "wsj-2017.07.18",
          "shard_id": 1,
          "reason": "IndexNotFoundException[no such index]",
          "node_id": "GhOdYtKNTIOYMFVRHQHn_Q",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.06.30",
          "index_uuid": "wsj-2017.06.30",
          "shard_id": 0,
          "reason": "IndexNotFoundException[no such index]",
          "node_id": "GhOdYtKNTIOYMFVRHQHn_Q",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-iis-2017.07.11",
          "index_uuid": "wsj-iis-2017.07.11",
          "shard_id": 4,
          "reason": "IndexNotFoundException[no such index]",
          "node_id": "GhOdYtKNTIOYMFVRHQHn_Q",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.18",
          "index_uuid": "wsj-2017.07.18",
          "shard_id": 0,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.06",
          "index_uuid": "wsj-2017.07.06",
          "shard_id": 0,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-iis-2017.07.11",
          "index_uuid": "wsj-iis-2017.07.11",
          "shard_id": 0,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.19",
          "index_uuid": "wsj-2017.07.19",
          "shard_id": 4,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.03",
          "index_uuid": "wsj-2017.07.03",
          "shard_id": 4,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-iis-2017.07.11",
          "index_uuid": "wsj-iis-2017.07.11",
          "shard_id": 3,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.03",
          "index_uuid": "wsj-2017.07.03",
          "shard_id": 0,
          "reason": "IndexNotFoundException[no such index]",
          "node_id": "GhOdYtKNTIOYMFVRHQHn_Q",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.16",
          "index_uuid": "wsj-2017.07.16",
          "shard_id": 3,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.01",
          "index_uuid": "wsj-2017.07.01",
          "shard_id": 1,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.18",
          "index_uuid": "wsj-2017.07.18",
          "shard_id": 4,
          "reason": "IndexNotFoundException[no such index]",
          "node_id": "GhOdYtKNTIOYMFVRHQHn_Q",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.05",
          "index_uuid": "wsj-2017.07.05",
          "shard_id": 4,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.05",
          "index_uuid": "wsj-2017.07.05",
          "shard_id": 1,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.06.30",
          "index_uuid": "wsj-2017.06.30",
          "shard_id": 1,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.06.30",
          "index_uuid": "wsj-2017.06.30",
          "shard_id": 4,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.18",
          "index_uuid": "wsj-2017.07.18",
          "shard_id": 3,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.01",
          "index_uuid": "wsj-2017.07.01",
          "shard_id": 4,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.03",
          "index_uuid": "wsj-2017.07.03",
          "shard_id": 3,
          "reason": "IndexNotFoundException[no such index]",
          "node_id": "GhOdYtKNTIOYMFVRHQHn_Q",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-iis-2017.07.11",
          "index_uuid": "wsj-iis-2017.07.11",
          "shard_id": 1,
          "reason": "IndexNotFoundException[no such index]",
          "node_id": "GhOdYtKNTIOYMFVRHQHn_Q",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.19",
          "index_uuid": "wsj-2017.07.19",
          "shard_id": 0,
          "reason": "IndexNotFoundException[no such index]",
          "node_id": "GhOdYtKNTIOYMFVRHQHn_Q",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.19",
          "index_uuid": "wsj-2017.07.19",
          "shard_id": 3,
          "reason": "IndexNotFoundException[no such index]",
          "node_id": "GhOdYtKNTIOYMFVRHQHn_Q",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.19",
          "index_uuid": "wsj-2017.07.19",
          "shard_id": 1,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.16",
          "index_uuid": "wsj-2017.07.16",
          "shard_id": 4,
          "reason": "IndexNotFoundException[no such index]",
          "node_id": "GhOdYtKNTIOYMFVRHQHn_Q",
          "status": "INTERNAL_SERVER_ERROR"
        },
        {
          "index": "wsj-2017.07.03",
          "index_uuid": "wsj-2017.07.03",
          "shard_id": 1,
          "reason": null,
          "node_id": "eIcWA_QQTByXWDrUlUOFAA",
          "status": "INTERNAL_SERVER_ERROR"
        }
      ],
      "shards": {
        "total": 27,
        "failed": 27,
        "successful": 0
      }
    }
  ]
}

Interestingly this snapshot includes that wsj-iis-2017.07.11 index which we then deleted (since due to the naming it would get mixed up a lot with the wsj-* indices) and recreated with another name. Those IndexNotFoundException[no such index] messages look weird though because the mentioned indices do exist, are still on the cluster and I can query them.

@eirc
Copy link
Author

eirc commented Jul 26, 2017

🏆 deleted the offending snapshot and the listing now works! 🥇

If you need any more info on the "bug" itself I'll be happy to provide. Also my issue is solved but I'll leave this for you to close in case you want to follow the thread deeper.

@imotov
Copy link
Contributor

imotov commented Jul 26, 2017

Thanks @eirc. We have found the line that is causing this NPE. We are just doing some root cause analysis at the moment to see if there is more to it. It's definitely a bug. Thanks a lot for very detailed information and your willingness to work with us on it!

@imotov imotov self-assigned this Jul 26, 2017
@imotov imotov added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs and removed feedback_needed :Plugin Repository S3 labels Jul 26, 2017
@imotov
Copy link
Contributor

imotov commented Jul 26, 2017

@eirc I spent some time trying to reproduce the issue, but no matter what I try I cannot get my snapshot into the state where it produces nulls in shard failures. It looks like the snapshot in question took place a week ago. Do you remember, by any chance, what was going on with the cluster during this time? Do you still have log files from that day?

@eirc
Copy link
Author

eirc commented Jul 26, 2017

My current best guess is that that index I mentioned we deleted (wsj-iis) was deleted during the backup process and maybe that mucked up things somehow. I can check the logs at the time for more concrete info but that has to until tomorrow when i get back to work :)

@imotov
Copy link
Contributor

imotov commented Jul 26, 2017

Yes, deletion of indices during a snapshot is the first thing I tried. It is producing a slightly different snapshot info that doesn't contain any nulls. It seems that I am missing some key ingredient here. I am done for today as well, but it would be awesome if you could check the logs tomorrow.

@ywelsch
Copy link
Contributor

ywelsch commented Jul 27, 2017

The issue I see is that the code incorrectly assumes that reason is non-null in case where there is a SnapshotShardFailure. The failure is constructed from a ShardSnapshotStatus object that is in a "failed" state (one of FAILED, ABORTED, MISSING). I see two places where we can possibly have a ShardSnapshotStatus object with "failed" state and where the "reason" can be null:

  • cluster state serialization (to be precise: SnapshotsInProgress), because we don't serialize the "reason". This means that on master failover it can become null. This scenario can be verified by adding the assertion reason != null to the SnapshotShardFailure constructor and running the (currently disabled) test testMasterShutdownDuringFailedSnapshot a few times.
  • the call shardsBuilder.put(shardEntry.key, new ShardSnapshotStatus(status.nodeId(), State.ABORTED)) when aborting a snapshot. Here it's more difficult to come up with a scenario. But unless we can rule that one out, I would still consider it an issue.

I think the easiest fix for now would be to assume that reason is Nullable and adapt the serialization code accordingly. WDYT @imotov ?

@eirc
Copy link
Author

eirc commented Jul 27, 2017

Seems like that index was actually deleted a few days later after all so that was probably a red herring.

Ok there's a huge spike of logs during that snapshot's creation time, I'll try to aggregate what I see as most important:

Related to the snapshot itself (ie searching for "20170720085856")

29 occurrences of

[2017-07-20T14:44:49,461][WARN ][o.e.s.SnapshotShardsService] [Ht8LDxX] [[wsj-iis-2017.07.11][2]] [long_term:wsj-snapshot-20170720085856/yIbELYjgQN-_BgjRd4Vb0A] failed to create snapshot
org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException: Failed to snapshot
	at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:397) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.snapshots.SnapshotShardsService.access$200(SnapshotShardsService.java:88) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:335) [elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.4.1.jar:5.4.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: org.apache.lucene.store.AlreadyClosedException: engine is closed
	at org.elasticsearch.index.shard.IndexShard.getEngine(IndexShard.java:1446) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.index.shard.IndexShard.acquireIndexCommit(IndexShard.java:836) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:380) ~[elasticsearch-5.4.1.jar:5.4.1]
	... 7 more

and 2 of

[2017-07-20T14:44:49,459][WARN ][o.e.s.SnapshotShardsService] [Ht8LDxX] [[wsj-2017.07.19][2]] [long_term:wsj-snapshot-20170720085856/yIbELYjgQN-_BgjRd4Vb0A] failed to create snapshot
org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException: Aborted
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext$AbortableInputStream.checkAborted(BlobStoreRepository.java:1501) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext$AbortableInputStream.read(BlobStoreRepository.java:1494) ~[elasticsearch-5.4.1.jar:5.4.1]
	at java.io.FilterInputStream.read(FilterInputStream.java:107) ~[?:1.8.0_131]
	at org.elasticsearch.common.io.Streams.copy(Streams.java:76) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.common.io.Streams.copy(Streams.java:57) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.repositories.s3.S3BlobContainer.writeBlob(S3BlobContainer.java:100) ~[?:?]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshotFile(BlobStoreRepository.java:1428) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository$SnapshotContext.snapshot(BlobStoreRepository.java:1370) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:967) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:382) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.snapshots.SnapshotShardsService.access$200(SnapshotShardsService.java:88) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.snapshots.SnapshotShardsService$1.doRun(SnapshotShardsService.java:335) [elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.4.1.jar:5.4.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]

no index state found

1702 occurrences of the following from one data node:

[2017-07-20T14:51:22,103][WARN ][o.e.c.u.IndexFolderUpgrader] [/mnt/elasticsearch-data-02/nodes/0/indices/8oH-hwzeQAmJR7TZkUxf1w] no index state found - ignoring

and one similar from another host

unexpected error while indexing monitoring document

a spike of ~ 2.5k of those at the start of the snapshot:

[2017-07-20T14:44:48,526][WARN ][o.e.x.m.e.l.LocalExporter] unexpected error while indexing monitoring document
org.elasticsearch.xpack.monitoring.exporter.ExportException: NodeClosedException[node closed {Ht8LDxX}{Ht8LDxXGQAGEna893aC57w}{vq-tK9uISPexLeENQ82FRw}{10.127.1.207}{10.127.1.207:9300}{ml.enabled=true}]
	at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$throwExportException$2(LocalBulk.java:131) ~[?:?]
	at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) ~[?:1.8.0_131]
	at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) ~[?:1.8.0_131]
	at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) ~[?:1.8.0_131]
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) ~[?:1.8.0_131]
	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) ~[?:1.8.0_131]
	at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) ~[?:1.8.0_131]
	at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) ~[?:1.8.0_131]
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:1.8.0_131]
	at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) ~[?:1.8.0_131]
	at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.throwExportException(LocalBulk.java:132) ~[?:?]
	at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$doFlush$0(LocalBulk.java:115) ~[?:?]
	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:59) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:88) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:84) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.bulk.TransportBulkAction$BulkRequestModifier.lambda$wrapActionListenerIfNeeded$0(TransportBulkAction.java:583) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:59) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.finishHim(TransportBulkAction.java:389) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.onFailure(TransportBulkAction.java:384) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.TransportAction$1.onFailure(TransportAction.java:94) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.finishAsFailed(TransportReplicationAction.java:827) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$2.onClusterServiceClose(TransportReplicationAction.java:810) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onClusterServiceClose(ClusterStateObserver.java:304) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onClose(ClusterStateObserver.java:224) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.cluster.service.ClusterService.addTimeoutListener(ClusterService.java:385) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.cluster.ClusterStateObserver.waitForNextChange(ClusterStateObserver.java:166) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.cluster.ClusterStateObserver.waitForNextChange(ClusterStateObserver.java:111) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.cluster.ClusterStateObserver.waitForNextChange(ClusterStateObserver.java:103) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retry(TransportReplicationAction.java:802) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$1.handleException(TransportReplicationAction.java:781) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1050) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.transport.TransportService$Adapter.lambda$onConnectionClosed$6(TransportService.java:876) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-5.4.1.jar:5.4.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: org.elasticsearch.node.NodeClosedException: node closed {Ht8LDxX}{Ht8LDxXGQAGEna893aC57w}{vq-tK9uISPexLeENQ82FRw}{10.127.1.207}{10.127.1.207:9300}{ml.enabled=true}
	... 15 more

and a similar number of those at the end of the snapshot:

[2017-07-20T14:51:05,408][WARN ][o.e.x.m.e.l.LocalExporter] unexpected error while indexing monitoring document
org.elasticsearch.xpack.monitoring.exporter.ExportException: TransportException[transport stopped, action: indices:data/write/bulk[s][p]]
	at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$throwExportException$2(LocalBulk.java:131) ~[?:?]
	at java.util.stream.ReferencePipeline$3$1.accept(ReferencePipeline.java:193) ~[?:1.8.0_131]
	at java.util.stream.ReferencePipeline$2$1.accept(ReferencePipeline.java:175) ~[?:1.8.0_131]
	at java.util.Spliterators$ArraySpliterator.forEachRemaining(Spliterators.java:948) ~[?:1.8.0_131]
	at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:481) ~[?:1.8.0_131]
	at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471) ~[?:1.8.0_131]
	at java.util.stream.ForEachOps$ForEachOp.evaluateSequential(ForEachOps.java:151) ~[?:1.8.0_131]
	at java.util.stream.ForEachOps$ForEachOp$OfRef.evaluateSequential(ForEachOps.java:174) ~[?:1.8.0_131]
	at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234) ~[?:1.8.0_131]
	at java.util.stream.ReferencePipeline.forEach(ReferencePipeline.java:418) ~[?:1.8.0_131]
	at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.throwExportException(LocalBulk.java:132) ~[?:?]
	at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$doFlush$0(LocalBulk.java:115) ~[?:?]
	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:59) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:88) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:84) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.bulk.TransportBulkAction$BulkRequestModifier.lambda$wrapActionListenerIfNeeded$0(TransportBulkAction.java:583) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:59) [elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.finishHim(TransportBulkAction.java:389) [elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.onFailure(TransportBulkAction.java:384) [elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.TransportAction$1.onFailure(TransportAction.java:94) [elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.finishAsFailed(TransportReplicationAction.java:827) [elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$1.handleException(TransportReplicationAction.java:783) [elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1050) [elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:247) [elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:638) [elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) [elasticsearch-5.4.1.jar:5.4.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: org.elasticsearch.transport.TransportException: transport stopped, action: indices:data/write/bulk[s][p]
	at org.elasticsearch.transport.TransportService$4.doRun(TransportService.java:246) ~[elasticsearch-5.4.1.jar:5.4.1]
	... 5 more

node not connected

got 9 of those with at least one for each node

[2017-07-20T14:44:47,437][WARN ][o.e.a.a.c.n.i.TransportNodesInfoAction] [zYawxs4] not accumulating exceptions, excluding exception from response
org.elasticsearch.action.FailedNodeException: Failed node [Ht8LDxXGQAGEna893aC57w]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.onFailure(TransportNodesAction.java:246) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.access$200(TransportNodesAction.java:160) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction$1.handleException(TransportNodesAction.java:218) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:493) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.nodes.TransportNodesAction$AsyncAction.start(TransportNodesAction.java:204) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.nodes.TransportNodesAction.doExecute(TransportNodesAction.java:89) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.nodes.TransportNodesAction.doExecute(TransportNodesAction.java:52) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.TransportAction$RequestFilterChain.proceed(TransportAction.java:170) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:142) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.TransportAction.execute(TransportAction.java:84) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.client.node.NodeClient.executeLocally(NodeClient.java:83) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.client.node.NodeClient.doExecute(NodeClient.java:72) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.client.support.AbstractClient.execute(AbstractClient.java:408) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.client.support.AbstractClient$ClusterAdmin.execute(AbstractClient.java:730) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.client.support.AbstractClient$ClusterAdmin.nodesInfo(AbstractClient.java:811) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.rest.action.admin.cluster.RestNodesInfoAction.lambda$prepareRequest$0(RestNodesInfoAction.java:109) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.rest.BaseRestHandler.handleRequest(BaseRestHandler.java:80) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.rest.RestController.dispatchRequest(RestController.java:260) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.rest.RestController.dispatchRequest(RestController.java:199) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.http.netty4.Netty4HttpServerTransport.dispatchRequest(Netty4HttpServerTransport.java:505) ~[transport-netty4-5.4.1.jar:5.4.1]
	at org.elasticsearch.http.netty4.Netty4HttpRequestHandler.channelRead0(Netty4HttpRequestHandler.java:72) ~[transport-netty4-5.4.1.jar:5.4.1]
	at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at org.elasticsearch.http.netty4.pipelining.HttpPipeliningHandler.channelRead(HttpPipeliningHandler.java:63) ~[transport-netty4-5.4.1.jar:5.4.1]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) ~[netty-codec-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.handler.codec.MessageToMessageCodec.channelRead(MessageToMessageCodec.java:111) ~[netty-codec-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) ~[netty-codec-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) ~[netty-codec-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:310) [netty-codec-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:284) [netty-codec-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.ChannelInboundHandlerAdapter.channelRead(ChannelInboundHandlerAdapter.java:86) ~[netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:340) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1334) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:362) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:348) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:926) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:134) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:644) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:544) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:498) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:458) [netty-transport-4.1.11.Final.jar:4.1.11.Final]
	at io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858) [netty-common-4.1.11.Final.jar:4.1.11.Final]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: org.elasticsearch.transport.NodeNotConnectedException: [Ht8LDxX][10.127.1.207:9300] Node not connected
	at org.elasticsearch.transport.TcpTransport.getConnection(TcpTransport.java:630) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.transport.TcpTransport.getConnection(TcpTransport.java:116) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:513) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:489) ~[elasticsearch-5.4.1.jar:5.4.1]
	... 58 more

Exception when closing export bulk

3 of those

[2017-07-20T14:44:48,536][WARN ][o.e.x.m.MonitoringService] [Ht8LDxX] monitoring execution failed
org.elasticsearch.xpack.monitoring.exporter.ExportException: Exception when closing export bulk
	at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$1$1.<init>(ExportBulk.java:106) ~[?:?]
	at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$1.onFailure(ExportBulk.java:104) ~[?:?]
	at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound$1.onResponse(ExportBulk.java:217) ~[?:?]
	at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound$1.onResponse(ExportBulk.java:211) ~[?:?]
	at org.elasticsearch.xpack.common.IteratingActionListener.onResponse(IteratingActionListener.java:108) ~[?:?]
	at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound.lambda$null$0(ExportBulk.java:175) ~[?:?]
	at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:67) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.throwExportException(LocalBulk.java:138) ~[?:?]
	at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.lambda$doFlush$0(LocalBulk.java:115) ~[?:?]
	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:59) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:88) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:84) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.bulk.TransportBulkAction$BulkRequestModifier.lambda$wrapActionListenerIfNeeded$0(TransportBulkAction.java:583) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:59) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.finishHim(TransportBulkAction.java:389) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.bulk.TransportBulkAction$BulkOperation$1.onFailure(TransportBulkAction.java:384) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.TransportAction$1.onFailure(TransportAction.java:94) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.finishAsFailed(TransportReplicationAction.java:827) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$2.onClusterServiceClose(TransportReplicationAction.java:810) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onClusterServiceClose(ClusterStateObserver.java:304) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onClose(ClusterStateObserver.java:224) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.cluster.service.ClusterService.addTimeoutListener(ClusterService.java:385) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.cluster.ClusterStateObserver.waitForNextChange(ClusterStateObserver.java:166) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.cluster.ClusterStateObserver.waitForNextChange(ClusterStateObserver.java:111) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.cluster.ClusterStateObserver.waitForNextChange(ClusterStateObserver.java:103) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase.retry(TransportReplicationAction.java:802) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.action.support.replication.TransportReplicationAction$ReroutePhase$1.handleException(TransportReplicationAction.java:781) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1050) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.transport.TransportService$Adapter.lambda$onConnectionClosed$6(TransportService.java:876) ~[elasticsearch-5.4.1.jar:5.4.1]
	at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:569) [elasticsearch-5.4.1.jar:5.4.1]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) [?:1.8.0_131]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) [?:1.8.0_131]
	at java.lang.Thread.run(Thread.java:748) [?:1.8.0_131]
Caused by: org.elasticsearch.xpack.monitoring.exporter.ExportException: failed to flush export bulks
	at org.elasticsearch.xpack.monitoring.exporter.ExportBulk$Compound.lambda$null$0(ExportBulk.java:167) ~[?:?]
	... 27 more
Caused by: org.elasticsearch.xpack.monitoring.exporter.ExportException: bulk [default_local] reports failures when exporting documents
	at org.elasticsearch.xpack.monitoring.exporter.local.LocalBulk.throwExportException(LocalBulk.java:127) ~[?:?]
	... 25 more

I'm sure there's more stuff in there but I don't know if this actually helps you because I can't make sense of it due to the log volume being that high so I may be missing the important ones. Is there something more specific I could search for that would help? Maybe I should just export all elasticsearch logs for that day and mail them to you?

@imotov
Copy link
Contributor

imotov commented Jul 27, 2017

I think the easiest fix for now would be to assume that reason is Nullable and adapt the serialization code accordingly. WDYT @imotov ?

@ywelsch Yes fixing it like this would be easy, I just didn't want to assume anything, I wanted to have a test that creates this problem so we can fix it for sure. So, that's why I spent some time trying to reproduce it. You are right about it being null in SnapshotsInProgress, and I tried to reproduce it this way but it looks like it's a completely different path that doesn't get resolved into shard failure object, so this seems to be a dead end. So, I think ABORTED path is more promising and after thinking about for a while, I think the scenario is snapshot gets stuck on a master, gets aborted, then another master takes over, and somehow generates these nulls. The problem with this scenario is that if a snapshot is aborted, it should be deleted afterwards. So, based on the information that @eirc provided, it looks like it might be a combination of stuck snapshot combined with some sort of node failure that prevented the aborted snapshot from being cleaned up, which might be quite difficult to reproduce.

Maybe I should just export all elasticsearch logs for that day and mail them to you?

@eirc that would be very helpful. Thanks!

@imotov
Copy link
Contributor

imotov commented Jul 27, 2017

Just a quick update. @ywelsch and I discussed the issue and came up with a plan how to modify testMasterShutdownDuringFailedSnapshot to potentially reproduce the issue. I will try implementing it.

imotov added a commit to imotov/elasticsearch that referenced this issue Jul 27, 2017
…red in CS

The failure reason for snapshot shard failures might not be propagated properly if the master node changes after the errors were reported by other data nodes. This commits ensures that the snapshot shard failure reason is preserved properly and adds workaround for reading old snapshot files where this information might not have been preserved.

Closes elastic#25878
imotov added a commit that referenced this issue Jul 28, 2017
…red in CS (#25941)

The failure reason for snapshot shard failures might not be propagated properly if the master node changes after the errors were reported by other data nodes. This commits ensures that the snapshot shard failure reason is preserved properly and adds workaround for reading old snapshot files where this information might not have been preserved.

Closes #25878
imotov added a commit that referenced this issue Aug 3, 2017
…red in CS (#25941)

The failure reason for snapshot shard failures might not be propagated properly if the master node changes after the errors were reported by other data nodes. This commits ensures that the snapshot shard failure reason is preserved properly and adds workaround for reading old snapshot files where this information might not have been preserved.

Closes #25878
imotov added a commit that referenced this issue Aug 3, 2017
…red in CS (#25941)

The failure reason for snapshot shard failures might not be propagated properly if the master node changes after the errors were reported by other data nodes. This commits ensures that the snapshot shard failure reason is preserved properly and adds workaround for reading old snapshot files where this information might not have been preserved.

Closes #25878
imotov added a commit that referenced this issue Aug 3, 2017
Updating the version in SnapshotsInProgress serialization method to reflect that #25941 was backported to 6.0.0-beta1.

Relates to #25878
imotov added a commit that referenced this issue Aug 14, 2017
…red in CS (#26127)

The failure reasons for snapshot shard failures might not be propagated properly if the master node changes after errors were reported by other data nodes, which causes them to be stored as null in snapshot files. This commits adds a workaround for reading such snapshot files where this information might not have been preserved and makes sure that the reason is not null if it gets cluster state from another master. This is a partial backport of #25941 to 5.6.

Closes #25878
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs
Projects
None yet
Development

No branches or pull requests

5 participants