-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IndexWriter should clean up unreferenced files when segment merge fails due to disk full #12228
Comments
as soon as you open a new indexwriter on the index, it will delete the unnecessary files. i don't think indexwriter should try to be a superhero and do dangerous things such as deleting files when handling exceptions. |
@rmuir Thanks for response. For non tragic exceptions, IndexWriter even now deletes unreferenced files. With the above commit I believe it is not deleting un-referenced files for tragedy. Also we tried to delete unreferenced files incase segment merge fails by creating a new index writer instance. Issue here is Segment merge thread inside Lucene removes lock asynchronously after tragedy event is set and response is returned to OpenSearch. So there was race condition scenario that when we tried to create new index writer instance it failed because that previous instance of index writer (on which we called force merge) has not released the lock yet. Let me know if you have any suggestions |
on a tragedy it is unsafe to delete files or anything like that. Tragic exception means just that, time to shut down. For example it can be triggered by |
I agree exceptions like Disk Full does not necessarily means there is an issue on the system/disk and operation like deleting files should be allowed. IOException thrown due to disk full should not be considered as tragedy as it will cause unreferenced file (some files are huge) to continue to occupy space on the node though it is not available for query/update. If wasting IO/CPU cycles is the only concern as mentioned in this issue, why not modify the merge policy itself to not allow segment merge in case enough space is not available. Let me know your thoughts. |
After reviewing the original issue again, I think the current behavior is correct, it should be tragic and requires human intervention to fix so that there is enough space. Treating it as non-fatal would just means that resources are wasted as @mikemccand describes on the original issue. Modifying the merge policy to "do nothing" if there is no disk space available is not a good solution: you can't just pretend there is no problem and continue to go on indexing without any merges happening, this will break down (not just performance but you'll run out of open file handles and hit other problems too). |
Description
Current Issue
Currently, if segment merge/force merge fails because of disk full, IndexWriter does not clean up unreferenced files created during the current segment merge. I believe this change got introduced as a part of this patch. Lucene considers I/O exception because of disk full as a tragedy and does not clean up unreferenced file to avoid wasting IO/CPU cycles (which can happen due to segment merge retry filling up disk again).
Issue here is these unreferenced files can be huge, and it unnecessarily continues to occupy space on the node. Further, in case the disk gets 100% filled up on a node, OpenSearch mark it as unhealthy. This is one of the major reason for node drop (unhealthy nodes) for our Cx.
Solution
Lucene should delete these unreferenced files in case segment merge fails (OpenSearch cannot handle this as Lucene closes IndexWriter in case of tragic exception). To avoid wasting IO/CPU cycles, we can change the merge policy to not allow segment merge in case enough amount of space is not available (segment merge can cause disk to get full).
The text was updated successfully, but these errors were encountered: