Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up temporary files created during segment merge incase force segment merge fails #6324

Merged
merged 1 commit into from
Feb 15, 2023

Conversation

RS146BIJAY
Copy link
Contributor

@RS146BIJAY RS146BIJAY commented Feb 14, 2023

Description

Currently, during segment merge (both during auto merge and force merge), OpenSearch creates multiple temporary files while merging individual segment components. Once individual components of the segments are merged, IndexWriter creates a Compound file (*.cfs) containing all the Lucene components for a segment. At the end, OpenSearch deletes these temporary files.

Now, in case these temporary files consume all the space on the node (because enough amount of space is unavailable for segment merge to go through), segment merge will fail. Also, the files which got created during segment merge do not get cleaned up and it continues to occupy space on the node. This can cause FSHealthService checks to fail for this node, ultimately causing these nodes to be removed from cluster.

In case Force segment merge fails, OpenSearch should remove the temporary files created during segment merge.

Issues Resolved

#5710

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@pranikum
Copy link
Contributor

Looks ok.

@@ -2009,6 +2009,7 @@ public void forceMerge(
throw ex;
} catch (Exception e) {
try {
indexWriter.flush();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flush() calls internally flush(boolean triggerMerge, boolean applyAllDeletes). In cases where the forceMerge is failing, should we just do a flush(false, true) rather than flush(true, true)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flush(false, true) is not a public function. We cannot call it from outside InternalEngine (Lucene).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, yes. May be let's just check and confirm that the triggerMerge is not triggering merge again after the temporary files clean up.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should move this below maybeFailEngine("force merge", e); . In corruption cases, flush is not going to help out .

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flush is optional for a given force merge . Does it guarantee for clean up the temporary files ?

Copy link
Contributor Author

@RS146BIJAY RS146BIJAY Feb 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As discussed offline, InternalEngine flush() is different from IndexWriter flush(). Optional force merge option is for InternalEngine flush(). Deletion of temporary new files is happening inside IndexWriter deleteNewFiles which is called from IndexWriter flush().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed the comments.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@gbbafna gbbafna merged commit 6b35c32 into opensearch-project:main Feb 15, 2023
@gbbafna gbbafna added the backport 2.x Backport to 2.x branch label Feb 15, 2023
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-6324-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 6b35c32a9ede76d6e8aca79095e42b9ec67d22c1
# Push it to GitHub
git push --set-upstream origin backport/backport-6324-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-6324-to-2.x.

@@ -2010,6 +2010,7 @@ public void forceMerge(
} catch (Exception e) {
try {
maybeFailEngine("force merge", e);
indexWriter.flush();
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are invoking a second merge within the context of a failed merge, which might be tricky and non-deterministic. Essentially we are saying that a second merge(merge + delete new files) will succeed when the first one failed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had raised the same - #6324 (comment). @RS146BIJAY you had mentioned that the merge does not get triggered when in exception block. Hoping that we have confirmed the same.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As of now, lucene only allows a flush call where a regular segment merge is triggered after we delete new lucene files. It does not exposes another overload of flush(boolean triggerMerge, boolean applyAllDeletes) (where we can set triggerMerge to false to not allow segment merge along with file deletion).

As per the corresponding Lucene issue link, this is kept intentionally that if we delete some file we perform a regular segment merge after it.

+1 on the patch and agreed that we should only expose a no-argument flush method, it's actually nice that the patch decreased the visibility of flush(boolean, boolean).

I don't think we should add anything but a no-arg flush() here.
If its going to be more engineered than that, then this is the wrong direction: lets just not expose it at all.

@RS146BIJAY RS146BIJAY deleted the segment-merge branch February 21, 2023 06:52
@dblock
Copy link
Member

dblock commented Feb 28, 2023

Did you want to backport this to 2.x @RS146BIJAY? Will need to be done manually if it wasn't / link that PR.

@andrross
Copy link
Member

This was reverted in #6403. The issue (#5710) remains open. @RS146BIJAY what is the status here?

@RS146BIJAY
Copy link
Contributor Author

RS146BIJAY commented Mar 2, 2023

Yes. There were few other issues started popping up because of this fix. So we reverted this change on the main branch. We are analysing right now how to send the fix for this issue correctly. @andrross @dblock

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants