-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
NullPointerException when getting list of snapshots on S3 #25878
Comments
Could you paste the full stack trace from the Elasticsearch server logs? |
There's no more logs for the null pointer entry. There's a ton of logs for the headers and each of the 2K requests do you want me to post those? All of those responded with 200 OK though. |
These should be the logs from the last request before the null pointer. I tried to sensor out any possibly sensitive info. Maybe the returned payload was what triggered the issue?
|
FYI we use a coordinating node and 3 data nodes. I do the snapshot requests to the coordinating node, and all the S3 requests seem to originate from the data node that's currently the master (10.127.1.203). Some more logs: I see ~ 1k of these logs 15 sec after start of the request and ~ 500 at the end:
These pop up between requests:
These are the things logged on the master node around the time the coordinating node logged the exception (excluding the freeing search context logs mentioned above):
|
Hmm, I don't see any smoking gun here. I am not really sure how to move forward with this without knowing where this NPE occurs or being able to reproduce this issue locally. |
Ok as I understand it there should have been a stack trace after the "caused by" line right? Maybe we can look into why that's not present and then we'll have more info for the specific issue? Also there's that |
@eirc you said that
but the timestamp from these logs are |
@eirc, @ywelsch and I discussed this more and we have a couple of other things we would like you to try:
|
Regarding the time discrepancies, the NPE happens every time I request a listing. At 12:27 I had debug logging on so that's why most of the logs are from that time. At 12:01 was probably one of the first tests. The same NPE log appeared at 12:27 and every time I did a listing request. |
Ok now there's some light at the end of the tunnel! First if I get the listing from the master node it actually works! By requesting on the coordinating (or any other) node it fails with that same behaviour. Adding error_trace=true to the request yields some useful info finally:
Here's the formatted stack trace for your convenience:
|
@eirc any chance you can email me the output that you get from master? My mail is igor at elastic.co. If not, could you try getting one snapshot at a time on the coordinating node and checking what's different between the snapshots that can be retrieved and the snapshots that cause this NPE? By the way, does the coordinating node have a different es version? |
Just confirmed all elasticsearches are on 5.5.0. Can I check the version of plugins someway? When I upgraded the stack I remember I had to remove and reinstall plugins to be of proper versions. I'll make a script to pull each snapshot individually and see which one(s) are breaking now. |
In 5.5.0 all plugins should be 5.5.0. Otherwise, elasticsearch wouldn't work. In any case, based on what we know so far, I don't think it's a plugin-related issue. Our current theory is that snapshot info serialization code breaks on one or more snapshots that you have in your repository. However, we just reviewed this code and couldn't find any obvious issues. That's why we would like to figure out which snapshot information master is trying to send to the coordinating node in order to reproduce and fix the problem. |
I emailed you the full snapshot list. My script |
No scratch that, there is a single snapshot which produces the NPE when I get it on it's own. |
Here is the JSON I can get from the master but not from other nodes:
Interestingly this snapshot includes that |
🏆 deleted the offending snapshot and the listing now works! 🥇 If you need any more info on the "bug" itself I'll be happy to provide. Also my issue is solved but I'll leave this for you to close in case you want to follow the thread deeper. |
Thanks @eirc. We have found the line that is causing this NPE. We are just doing some root cause analysis at the moment to see if there is more to it. It's definitely a bug. Thanks a lot for very detailed information and your willingness to work with us on it! |
@eirc I spent some time trying to reproduce the issue, but no matter what I try I cannot get my snapshot into the state where it produces |
My current best guess is that that index I mentioned we deleted (wsj-iis) was deleted during the backup process and maybe that mucked up things somehow. I can check the logs at the time for more concrete info but that has to until tomorrow when i get back to work :) |
Yes, deletion of indices during a snapshot is the first thing I tried. It is producing a slightly different snapshot info that doesn't contain any nulls. It seems that I am missing some key ingredient here. I am done for today as well, but it would be awesome if you could check the logs tomorrow. |
The issue I see is that the code incorrectly assumes that
I think the easiest fix for now would be to assume that reason is Nullable and adapt the serialization code accordingly. WDYT @imotov ? |
Seems like that index was actually deleted a few days later after all so that was probably a red herring. Ok there's a huge spike of logs during that snapshot's creation time, I'll try to aggregate what I see as most important: Related to the snapshot itself (ie searching for "20170720085856")29 occurrences of
and 2 of
no index state found1702 occurrences of the following from one data node:
and one similar from another host unexpected error while indexing monitoring documenta spike of ~ 2.5k of those at the start of the snapshot:
and a similar number of those at the end of the snapshot:
node not connectedgot 9 of those with at least one for each node
Exception when closing export bulk3 of those
I'm sure there's more stuff in there but I don't know if this actually helps you because I can't make sense of it due to the log volume being that high so I may be missing the important ones. Is there something more specific I could search for that would help? Maybe I should just export all elasticsearch logs for that day and mail them to you? |
@ywelsch Yes fixing it like this would be easy, I just didn't want to assume anything, I wanted to have a test that creates this problem so we can fix it for sure. So, that's why I spent some time trying to reproduce it. You are right about it being null in SnapshotsInProgress, and I tried to reproduce it this way but it looks like it's a completely different path that doesn't get resolved into shard failure object, so this seems to be a dead end. So, I think ABORTED path is more promising and after thinking about for a while, I think the scenario is snapshot gets stuck on a master, gets aborted, then another master takes over, and somehow generates these nulls. The problem with this scenario is that if a snapshot is aborted, it should be deleted afterwards. So, based on the information that @eirc provided, it looks like it might be a combination of stuck snapshot combined with some sort of node failure that prevented the aborted snapshot from being cleaned up, which might be quite difficult to reproduce.
@eirc that would be very helpful. Thanks! |
Just a quick update. @ywelsch and I discussed the issue and came up with a plan how to modify |
…red in CS The failure reason for snapshot shard failures might not be propagated properly if the master node changes after the errors were reported by other data nodes. This commits ensures that the snapshot shard failure reason is preserved properly and adds workaround for reading old snapshot files where this information might not have been preserved. Closes elastic#25878
…red in CS (#25941) The failure reason for snapshot shard failures might not be propagated properly if the master node changes after the errors were reported by other data nodes. This commits ensures that the snapshot shard failure reason is preserved properly and adds workaround for reading old snapshot files where this information might not have been preserved. Closes #25878
…red in CS (#25941) The failure reason for snapshot shard failures might not be propagated properly if the master node changes after the errors were reported by other data nodes. This commits ensures that the snapshot shard failure reason is preserved properly and adds workaround for reading old snapshot files where this information might not have been preserved. Closes #25878
…red in CS (#25941) The failure reason for snapshot shard failures might not be propagated properly if the master node changes after the errors were reported by other data nodes. This commits ensures that the snapshot shard failure reason is preserved properly and adds workaround for reading old snapshot files where this information might not have been preserved. Closes #25878
…red in CS (#26127) The failure reasons for snapshot shard failures might not be propagated properly if the master node changes after errors were reported by other data nodes, which causes them to be stored as null in snapshot files. This commits adds a workaround for reading such snapshot files where this information might not have been preserved and makes sure that the reason is not null if it gets cluster state from another master. This is a partial backport of #25941 to 5.6. Closes #25878
Elasticsearch version: 5.5.0 (issue first appeared while on 5.4.1)
Plugins installed: [x-pack, repository-s3]
JVM version (
java -version
): openjdk version "1.8.0_131"OpenJDK Runtime Environment (build 1.8.0_131-8u131-b11-1~bpo8+1-b11)
OpenJDK 64-Bit Server VM (build 25.131-b11, mixed mode)
OS version (
uname -a
if on a Unix-like system): Linux ip-10-127-1-159 3.16.0-4-amd64 #1 SMP Debian 3.16.39-1 (2016-12-30) x86_64 GNU/LinuxDescription of the problem including expected versus actual behavior:
Trying to get the list of available snapshots on an S3 backed repository fails with NullPointerException.
Elasticsearch logs:
I use curator to take the backups and after grabbing backups successfully it fails when it tries to delete old snapshots because that's when it requires a list too:
I have a feeling this is due to some kind of timeout. I turned on debug logging and while I couldn't find a more specific reason this fails I noticed it made ~ 2K requests to S3 until it failed and it stopped at 90 seconds. Is this a configurable timeout?
In the past getting a list of snapshots took increasingly long but it eventually responded. Now it breaks earlier than that.
Also posted on the forums: https://discuss.elastic.co/t/nullpointerexception-when-getting-list-of-snapshots-on-s3/94458
The text was updated successfully, but these errors were encountered: