-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Cleanup unrefrenced files incase segment merge fails #8024
Comments
Thanks @RS146BIJAY for this. Option 2 makes sense as Option 1 doesn't look like Lucene's problem to solve from the issue you opened.
For remote store enabled indices, we will always trigger recovery from remote . However since all the data is written durably in translog and segment store, we should be good here . But we should verify this behavior for remote enabled and disabled indices. |
Thanks @gbbafna for feedback. Will verify this behaviour while working on POC. |
Would appreciate feedback from @shwetathareja, @nknize, @Bukhtawar, @sachinpkale, @muralikpbhat, and @reta as well, so tagging more folks for visibility. Thanks! |
If the disk fills up, then pending writes for all shards writing to the given volume will fail. (Well, at least flushing those pending writes will fail.) Can we just close and reopen the IndexWriter for all shards (on the given volume)? That should clean up the unused files. |
Actually we did explored that solution. But it seems we need to handle a lot of edge cases and approach became quite complicated. Following is the approach along with pros and cons: Fail just the IndexWriterAnother approach to handle unreferenced file cleanup is allow only IndexWriter to close without allowing entire shard to fail (without failing We will perform a check in this function whether merge failure is because of no space left on the device. If it is, we will skip failing the engine and create a new instance of There are few edge cases which we need to consider for the above approach. Since Besides this, OpenSearch periodically refreshes the reader to provide “near real-time” searching, where writes made during an Also during the time when IndexWriter is closed and before it is reopened, Reader refresh should not be allowed. This is because Reader refresh during this period will throw an Pros
Cons
|
@RS146BIJAY after reading the Apache Lucene issue(s) threads, I would agree with @gbbafna that the second option (failing shard + cleanup inside the OpenSearch) is (to me) the way to move forward, with the caveat:
It was highlighted there (apache/lucene#12228 (comment)), we should not ignore that but do hard failure (keep shard UNASSIGNED? move index to read-only?) |
Thanks @RS146BIJAY for the proposal and evaluating different approaches. The Option 2 makes sense, considering Option 1 is not feasible and simply reopening the IndexWriter has multiple edge cases. To @reta 's point
+1 if the disk is indeed full, then we shouldn't continue to take writes when segment merges are failing / stopped by the merge policy in the background. But Also this error needs to surface to user that merges have been stopped due to not enough space available for merges. |
Thanks @reta and @shwetathareja for the feedback. Just one query, why do we think we need to do hard failure after cleanup? Till the time cleanup happens and shard recovery completes, we won't be performing any writes. Also we are avoiding subsequent merges after cleanup through modifying merge policy to not to allow merge if enough amount of space is not present. Is it because number of segments can grow in this approach because merges will not happen after certain point of time? |
That is correct (merges are vital for Apache Lucene) |
Can we put a write block on merge failure till we cleanup unreferenced files to avoid piling up more merges. Now if we don't fail the shard will it continue to go into loops of merge failures? If yes at some point we need to re-allocate or worst case fail the shard. If we reclaim enough space we remove the block else let the DiskThresholdMonitor lazily remove it |
Overview
OpenSearch periodically merges multiple smaller segment into larger segments to keep the index size at bay and to expunge deletes. An entire merge operation requires at least 3x times space of the segments that are merged (with maximum of 3x times shard size in case of force merge with max segment = 1). In case enough space is not available for the segment merge to go through, the merge will fail at any of the intermediate step. Unreferenced files created during segment merge failure can take up a lot of space, leading to a full data volume and potential node drops.
Unreferenced Files
In Lucene, unreferenced files are shard files which are no longer live and used by the shard. Any Lucene commits (
segments_N
files) does not reference these files. Neither they are actively updated by IndexWriter. When segment merges fail because of disk getting full, multiple unreferenced files can get generated. This continues to occupy a lot of space in data volume. Lucene intentionally does not delete these unreferenced files by marking disk full issue as a tragedy. Lucene avoid wasting unnecessary IO/CPU cycles by doing this (in case Lucene cleans up these files, the same segment merge will be retired, filling up the space and again clearing unreferenced files in a loop wasting CPU/IO cycles).Proposed Solutions
This document analyses multiple proposals to cleanup these unreferenced files in case segment merge fails.
Approach 1 (Inside Lucene)
One of the approach can be Lucene handling cleanup of unreferenced files in case segment merge fails because of disk full. As discussed above, in case segment merge fails because of disk full, Lucene considers it as a Tragedy. In such scenarios, Lucene skips deleting the unreferenced files and closes the IndexWriter and later, OpenSearch closes the corresponding shard.
Since disk getting 100% full does not necessarily mean the system cannot recover (like in case of VirtualMachineError), IOException because of disk full should not be tragic. Instead, in such a scenario, Lucene should delete the unreferenced files generated during the failed segment merge.
Once cleanup completes, Lucene can close the IndexWriter instance as usual to avoid any parallel operation (like parallel writes, relocation, etc) which can change the shard state during unreferenced files cleanup. After this, OpenSearch will close InternalEngine and mark the shard as unassigned. Prior to which it will trigger a local shard recovery, which will reopen the InternalEngine and IndexWriter bringing the shard state to STARTED.
In order to avoid above operations to get in a loop (segment merge → clean up → segment merge), we can further change Lucene merge policy to not allow segment merges in case enough amount of space is not available. This will prevent IO/CPU cycles wastage.
Discussion Thread
apache/lucene#12228
Pros
Cons
Approach 2 (Inside OpenSearch)
Another approach to handle unreferenced file cleanup is to handle it within OpenSearch itself. In this approach, OpenSearch will perform the cleanup once the shard is closed and marked as unassigned at the end of the
failEngine
function insideInternalEngine
. Before we clean the unreferenced files, we will validate whether the engine failed because of data volume getting full during segment merge. If yes, we will perform the cleanup.Since only Lucene have information about which files are unreferenced (Lucene maintains this info inside
IndexFileDeleter
), we will remove unreferenced files inside OpenSearch by creating an instance ofIndexWriter
inside a try with resource block (since this approach is already being used in OpenSearch). Internally, it creates a new instance of IndexFileDeleter which will remove the files not referenced by any of the commits.Once the shard gets closed and cleanup is completed, OpenSearch marks this shard as unassigned. Since this is a cluster state change (shard state changed to unassigned) and there is already a valid copy of the shard on the node, OpenSearch will trigger a local recovery for this shard. Once Recovery is completed and the translog is played on this shard, shard state is marked as
STARTED
.In order to prevent cleanup operation and segment merge to not to get in a loop (segment merge → clean up → segment merge), we can further change OpenSearch merge policy to not allow segment merges in case enough amount of space is not available. This will prevent IO/CPU cycles wastage.
Pros
Cons
How Can You Help?
Next Steps
We will incorporate feedback and continue with more concrete prototypes.
The text was updated successfully, but these errors were encountered: