Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add circuit breaker support for file cache #6591

Merged
merged 1 commit into from
Apr 5, 2023

Conversation

kotwanikunal
Copy link
Member

@kotwanikunal kotwanikunal commented Mar 8, 2023

Description

  • Adds circuit breaker logic for file cache
  • The logic re-utilizes one of the child circuit breakers to check if the parent has gone over the threshold defined as a part of HierarchyCircuitBreakerService

Two possible solutions were assessed -

  1. Add a new child memory breaker which will keep track of memory usage by the file cache
  2. Use a pre-existing child breaker to keep track of total memory usage (tracked by the parent breaker)

Approach 1 would be ideal if there was a fixed size of an entry for the cache, but it varies depending on the platform architecture, OS, JVM variations.

Approach 2 solves the problem by keeping tracking of total memory usage instead of individual entries, and checking the breaker as soon as an entry is added.


Issues Resolved

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@@ -586,6 +586,8 @@ protected Node(
pluginCircuitBreakers,
settingsModule.getClusterSettings()
);
// File cache will be initialized by the node once circuit breakers are in place.
nodeEnvironment.initializeFileCache(settings, circuitBreakerService.getBreaker(CircuitBreaker.REQUEST));
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The initialization had to be moved out here due to the CircuitBreakerService initialization. I realize this is not ideal, but the sequence of initialization forces this out.
Please chime in if you have a better approach.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Digging into this a bit, it seems to me that NodeEnvironment should not hold a fileCache instance and instead should just hold information about the path of the file cache, as NodeEnvironment is defined as "A component that holds all data paths for a single node."

I think this can be refactored as follows:

  • Remove the actual file cache instance from NodeEnvironment
  • Create the filecache instance in Node.java where are you currently calling nodeEnvironment.initializeFileCache
  • In Node.java, create a FileCacheCleaner instance using the FileCache instance (there is nothing per-index about the cleaner, so I think only a single instance is file). Pass that FileCacheCleaner instance to IndicesService to wire up as a listener.
  • Profit!

What do you think?

@github-actions
Copy link
Contributor

github-actions bot commented Mar 9, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Mar 9, 2023

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.repositories.s3.S3BlobContainerRetriesTests.testReadRangeBlobWithRetries

@codecov-commenter
Copy link

codecov-commenter commented Mar 9, 2023

Codecov Report

Merging #6591 (028a7b0) into main (b2b2b67) will decrease coverage by 0.03%.
The diff coverage is 100.00%.

📣 This organization is not using Codecov’s GitHub App Integration. We recommend you install it so Codecov can continue to function properly for your repositories. Learn more

@@             Coverage Diff              @@
##               main    #6591      +/-   ##
============================================
- Coverage     70.82%   70.79%   -0.03%     
- Complexity    59160    59166       +6     
============================================
  Files          4803     4803              
  Lines        283188   283200      +12     
  Branches      40837    40838       +1     
============================================
- Hits         200574   200502      -72     
- Misses        66129    66214      +85     
+ Partials      16485    16484       -1     
Impacted Files Coverage Δ
.../main/java/org/opensearch/env/NodeEnvironment.java 78.82% <100.00%> (+0.89%) ⬆️
...search/index/store/remote/filecache/FileCache.java 86.66% <100.00%> (+3.80%) ⬆️
...index/store/remote/filecache/FileCacheFactory.java 93.75% <100.00%> (ø)
server/src/main/java/org/opensearch/node/Node.java 83.59% <100.00%> (+0.02%) ⬆️

... and 492 files with indirect coverage changes

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

Copy link
Member

@andrross andrross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem with this implementation is that I could have zero active usage of the file cache, but non-active usage of a bajillion, and the circuit breaker would start to fail things but there would be nothing to start evicting the non-active entries in the cache even though they are eligible for eviction.

return theCache.put(filePath, indexInput);
CachedIndexInput cachedIndexInput = theCache.put(filePath, indexInput);
// This operation ensures that the PARENT breaker is not tripped
circuitBreaker.addEstimateBytesAndMaybeBreak(0, "filecache_entry");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fails after inserting the thing into the cache?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated the logic to maintain FC operations post a circuit breaker trip.

@kotwanikunal
Copy link
Member Author

The problem with this implementation is that I could have zero active usage of the file cache, but non-active usage of a bajillion, and the circuit breaker would start to fail things but there would be nothing to start evicting the non-active entries in the cache even though they are eligible for eviction.

In that case, we will be designing for two size constraints - storage (defined by capacity of the cache) and heap memory.
My idea here was that we will start failing requests for the client to notice and fix the heap config (upsizing ideally).

If we want the auto fixing behavior, we have no other choice than to track the size of entries going into the cache, which is how I had it designed originally.

I think it will be something on the lines of -

put() {
try {
circuitBreaker.addEstimateBytesAndMaybeBreak(<estimated_entry_size>, "filecache_entry");
} catch (CircuitBreakingException e) {
 this.prune();
}
...
...
}

The really messy part about it is the size of an entry - which could vary based on the platform/JVM/OS etc.

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness

@kotwanikunal
Copy link
Member Author

@reta Any thoughts on this?

@github-actions
Copy link
Contributor

github-actions bot commented Apr 4, 2023

Gradle Check (Jenkins) Run Completed with:

/**
* Returns the current {@link FileCacheStats}
*/
public FileCacheStats fileCacheStats() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a much better home for this method than NodeEnvironment, but we now do have both FileCache::stats and FileCache::fileCacheStats methods here. Perhaps something to revisit in a subsequent PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. I realized that and thought of the same. We can clean it up with another PR, but this looked like the right place for it to be in.

* If the user doesn't configure the cache size, it fails if the node is a data + search node.
* Else it configures the size to 80% of available capacity for a dedicated search node, if not explicitly defined.
*/
public void initializeFileCache(Settings settings, CircuitBreaker circuitBreaker) throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. Updated.

/**
* Returns the {@link FileCache} instance for remote search node
*/
public FileCache fileCache() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this only exposed for testing? If so, is there a way to make it protected or package-private? Otherwise, at least comment that it is exposed only for testing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment, it is visible for test purposes across packages.

@github-actions
Copy link
Contributor

github-actions bot commented Apr 4, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Apr 4, 2023

Gradle Check (Jenkins) Run Completed with:

@andrross andrross added the backport 2.x Backport to 2.x branch label Apr 4, 2023
@github-actions
Copy link
Contributor

github-actions bot commented Apr 4, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Apr 4, 2023

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

github-actions bot commented Apr 5, 2023

Gradle Check (Jenkins) Run Completed with:

@kotwanikunal kotwanikunal merged commit 06128a9 into opensearch-project:main Apr 5, 2023
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-6591-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 06128a904964ce43813ebc8164417a5633d414a2
# Push it to GitHub
git push --set-upstream origin backport/backport-6591-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-6591-to-2.x.

@kotwanikunal kotwanikunal added backport 2.x Backport to 2.x branch and removed backport 2.x Backport to 2.x branch labels Apr 5, 2023
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 128

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add ../.worktrees/backport-2.x 2.x
# Navigate to the new working tree
pushd ../.worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-6591-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 06128a904964ce43813ebc8164417a5633d414a2
# Push it to GitHub
git push --set-upstream origin backport/backport-6591-to-2.x
# Go back to the original working tree
popd
# Delete the working tree
git worktree remove ../.worktrees/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-6591-to-2.x.

mitrofmep pushed a commit to mitrofmep/OpenSearch that referenced this pull request Apr 5, 2023
kotwanikunal added a commit to kotwanikunal/OpenSearch that referenced this pull request Apr 5, 2023
@kotwanikunal
Copy link
Member Author

Backport PR: #7011

kotwanikunal added a commit to kotwanikunal/OpenSearch that referenced this pull request Apr 5, 2023
kotwanikunal added a commit that referenced this pull request Apr 5, 2023
(cherry picked from commit 06128a9)
Signed-off-by: Kunal Kotwani <[email protected]>
@kotwanikunal kotwanikunal deleted the circuit-breaker branch June 12, 2023 19:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch skip-changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add circuit breaker on heap memory usage for file cache
3 participants