Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IndexWriter should clean up unreferenced files when segment merge fails due to disk full #12228

Closed
RS146BIJAY opened this issue Apr 10, 2023 · 5 comments
Labels

Comments

@RS146BIJAY
Copy link

RS146BIJAY commented Apr 10, 2023

Description

Current Issue
Currently, if segment merge/force merge fails because of disk full, IndexWriter does not clean up unreferenced files created during the current segment merge. I believe this change got introduced as a part of this patch. Lucene considers I/O exception because of disk full as a tragedy and does not clean up unreferenced file to avoid wasting IO/CPU cycles (which can happen due to segment merge retry filling up disk again).

Issue here is these unreferenced files can be huge, and it unnecessarily continues to occupy space on the node. Further, in case the disk gets 100% filled up on a node, OpenSearch mark it as unhealthy. This is one of the major reason for node drop (unhealthy nodes) for our Cx.

Solution

Lucene should delete these unreferenced files in case segment merge fails (OpenSearch cannot handle this as Lucene closes IndexWriter in case of tragic exception). To avoid wasting IO/CPU cycles, we can change the merge policy to not allow segment merge in case enough amount of space is not available (segment merge can cause disk to get full).

@rmuir
Copy link
Member

rmuir commented Apr 10, 2023

as soon as you open a new indexwriter on the index, it will delete the unnecessary files.

i don't think indexwriter should try to be a superhero and do dangerous things such as deleting files when handling exceptions.

@RS146BIJAY
Copy link
Author

RS146BIJAY commented Apr 10, 2023

@rmuir Thanks for response. For non tragic exceptions, IndexWriter even now deletes unreferenced files. With the above commit I believe it is not deleting un-referenced files for tragedy.

Also we tried to delete unreferenced files incase segment merge fails by creating a new index writer instance. Issue here is Segment merge thread inside Lucene removes lock asynchronously after tragedy event is set and response is returned to OpenSearch. So there was race condition scenario that when we tried to create new index writer instance it failed because that previous instance of index writer (on which we called force merge) has not released the lock yet.

Let me know if you have any suggestions

@rmuir
Copy link
Member

rmuir commented Apr 10, 2023

on a tragedy it is unsafe to delete files or anything like that. Tragic exception means just that, time to shut down. For example it can be triggered by VirtualMachineError, which means it is unsafe to trust the state of the JVM, so we need to avoid doing anything destructive such as deleting files.

@RS146BIJAY RS146BIJAY changed the title IndexWriter should clean up unreferenced files when segment merge fails IndexWriter should clean up unreferenced files when segment merge fails due to disk full Apr 10, 2023
@RS146BIJAY
Copy link
Author

RS146BIJAY commented Apr 10, 2023

I agree exceptions like VirtualMachineError should be considered as tragic and we should avoid deleting files in those case. Actually my major concern was why segment merge failure due to disk full is considered tragedy (updated description)?

Disk Full does not necessarily means there is an issue on the system/disk and operation like deleting files should be allowed. IOException thrown due to disk full should not be considered as tragedy as it will cause unreferenced file (some files are huge) to continue to occupy space on the node though it is not available for query/update.

If wasting IO/CPU cycles is the only concern as mentioned in this issue, why not modify the merge policy itself to not allow segment merge in case enough space is not available.

Let me know your thoughts.

@rmuir
Copy link
Member

rmuir commented Apr 10, 2023

After reviewing the original issue again, I think the current behavior is correct, it should be tragic and requires human intervention to fix so that there is enough space. Treating it as non-fatal would just means that resources are wasted as @mikemccand describes on the original issue.

Modifying the merge policy to "do nothing" if there is no disk space available is not a good solution: you can't just pretend there is no problem and continue to go on indexing without any merges happening, this will break down (not just performance but you'll run out of open file handles and hit other problems too).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants