-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Support for writable warm indices on Opensearch #12809
Comments
Thanks for creating this.
Keeping file cache same for hot/warm indices sounds great but currently the cache has locking semantics. Probably that will impact the existing hot indices performance specially around search since write will mean creating new segments at refresh interval ? I think this will help us to decide if FileCache should track hot indices files as well or not. One benefit with common FileCache is to manage the entire local disk space dynamically between hot/warm indices instead of defining a static configs. But probably we can achieve that by keeping a common tracking component for disk usage between hot/warm indices. FileCache can manage warm indices related local files and provide mechanisms to trigger eviction when invoked by this tracking component along with exposing the usage statistics ? This can be later used by disk usage based backpressure monitors or other consumers like migration components to decide if more indices can be moved to warm tier or not (both dedicated vs non-dedicated setup).
This will be dependent on index type or specific file of an index ? For
Not clear if this component will be used to prefetch these files before making a shard active or it will be sort of async prefetch and shard will be activated irrespective of files are prefetched or not. One thing which will be useful would be to evaluate what all shard level data/metadata is must-have to make it active and ready to serve different type of requests like search/indexing and including requests for segment/index level stats. Otherwise these requests may end up waiting on the download/prefetch to complete (which could take time when multiple shards are performing prefetch) and eventually fail even though shard is deemed active.
Users can still create hot remote store based indices without composite directory or we will make composite directory the default for such indices along with mechanism to migrate existing indices to use the composite directory ? |
We'll need to move away from Segmented Cache with LRU level write lock. We can use File level locking mechanisms instead which will ensure that there isn't any contention.
Regarding the common tracking component for disk usage, that is definitely one alternative we were considering. Can be discussed in more details when we start working on supporting warm indices without dedicated warm nodes.
Makes sense. We're starting with File Cache tracking only warm files for now. Exposing cache stats is also planned already which can be consumed by mechanisms that you're mentioned above.
Yes, this mostly depends on the state of the file. Apart from the 2 IndexInput that you mentioned,
Intention was to encapsulate the prefetch related logic here. For essential data, we'll start with not prefetching anything and then iterating on it based on performance benchmarking. This should help in deterministically define the essential files to prefetch before marking shard as active.
We'll need to make composite directory as the default if we really want to support seamless hot/warm migration. One way to do this can be behind a cluster level setting which enables this for all hot remote backed indices. |
From my understanding, having default composite directory for hot/warm indices vs using FileCache for both seems to be 2 different independent items. I am fine with keeping composite as default (while allowing existing indices to seamlessly move towards that). However, for FileCache lets call out this alternative and we can evaluate both of these. With file level locks there could be additional overhead of locks per file and managing those (probably some benchmark will help to make these choices). So if the other alternative doesn't have any serious cons then we can use that.
I was assuming
Got it. This seems more of a FileCache semantic, to provide pinning mechanism for some of the key. Can it be implemented in the cache itself instead of exposing to IndexInput layer. That way the mechanism will be generic and independent on the value type stored in the cache.
Sure |
@ankitkala @sohami Regarding the locking semantics of the FileCache, we need to ensure we're not over-designing here. Conceptually, the file cache is an in-memory structure that maintains a small amount of state for each file on disk (a size, a reference count, some kind of ordering to implement LRU, etc). We have a microbenchmark that shows we can get over 1000 operations per millisecond out of a single segment of this cache. My intuition is that the overhead of maintaining the file cache accounting is pretty negligible in the grand scheme of things. |
@ankitkala will there be a new codec that will be added to support writable warm tier? By reading this, I feel we might be adding new codec which will for most of the code use the Lucene Codec but will also expose functions for doing prefetch. Is this understanding correct? |
Is your feature request related to a problem? Please describe
OpenSearch supports remote backed storage which allows users to backup the cluster's data on a remote storage thus providing stronger durability guarantee. The data is always
hot
(present locally on disk) and remote storage acts as a backup.We can also leverage the remote store to support shard level data tiering where all data is not guaranteed to be present locally but instead can be fetched at runtime on demand basis(let's say
warm
shard). This will allow the storage to scale separately from compute and user would be able to store more data with same number of nodes.Benefits:
hot
,warm
&cold
(cold being data archived on a durable storage. exact terminologies can be decided later)Describe the solution you'd like
What is Writable warm:
A warm index is an OpenSearch index where the entire index data is not guaranteed to be present locally on disk. The shards are always open & assigned on one of the eligible nodes, the index is active and the metadata is part of the cluster state.
Possible configurations:
warm
). This should help improve resource isolation between hot and warm shards and thus lower blast radius.Scope:
Includes:
Excludes(will be covered as a separate task):
Core Components:
Composite Directory:
The Directory abstraction provides a set of common APIs for creating, deleting, and modifying files within the directory, regardless of the underlying storage mechanism. Composite Directory will abstract out all the data tiering logic and makes the lucene index(aka shard) data locality agnostic.
File Cache:
File Cache will track the lucene files for all shards(using composite directory) at node level. Implementation for FileCache is already present (from searchable snapshot integration). We'll need to build support for additional features like cache eviction, file pinning, tracking hot indices, etc.
Index Inputs:
For all read operations composite directory should be able to create/send different index inputs. For e.g:
Prefetcher:
To maintain the optimal read & write performance, we want to cache some part of data to improve various shard level operations(read, write, flush, merges, etc). This component should be able to trigger prefetching of certain files/blocks when the shard opens or the replica refreshes.
User flows:
High level changes (more details to be added in individual issues):
Performance optimizations:
Other considerations:
Hot/warm migration strategy:
To be able to provide the best user experience for data tiering(hot/warm/cold), we should be able to seamlessly migrate the cluster from hot to warm. To support this, here are few key decisions:
InternalEngine
for primary &NRTReplicationEngine
for replica). We might still need to change few characteristics for warm shard(e.g. refresh interval, translog rollover duration, merge policy thresholds, etc)Note: Migration strategy to be covered in a separate issue.
Supporting custom codecs
One caveat with prefetching is that to prefetch certain blocks you might need to read the files using methods which are not really exposed by the respective readers in lucene. Few examples:
StoredFields
for certain documents which requires getting the file offset for the document ID.These use case require us to support methods which aren't exposed thus requiring us to maintain lucene version specific logic to support the prefetch operation.
This restriction also breaks the compatibility of prefetch logic with custom plugins. To get around this, we can disable prefetch for custom codes but it'll essentially restricts users from using other codecs. Alternatively(preferred), we can ask the codec owners to support the prefetch logic. We can expose the required methods behind a new plugin interface that should be implemented by codec owner if they want to support warm indices.
Future Enhancements
Here are some enhancements to explore after the performance benchmarking.
TieredMergePolicy
with separate merge settings for warm shard. Or, if required, we can explore a different policy that minimizes the number of segments on remote store.FAQs:
Q: How are files evicted from disk?
Files can be evicted from disk due to:
Q: How would merges work for warm shard?
Merges can require downloading the files from remote. Eventually we'd want to have support for offline merge on a dedicated node(related issue).
Q: How to determine the essential data to be fetched for a shard?
We can actively track the most used block files based on FileCache statistics. This information can be retained in remote store which creates the set of essential files to be downloaded while initializing a shard.
The text was updated successfully, but these errors were encountered: