-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Can we somehow make the engine's LiveVersionMap tracking optional? #19787
Comments
I think we can disable the live-version map since it has basically 2 purposes:
I think we can make it an option to have a no-op version map that simply doesn't do anything. |
Even in the append only use case, I was under the impression that versions were still necessary due to replication having a situation where a document could be sent more than once during network trouble, and |
the version will always be loaded from the index in that case that should be fine. The only problem is if you are using deletes here. If you do that we need to use the version map. |
Okay, thanks for explaining! |
one option would be to only keep the deletes in the map and therefor use less memory or no memory at all in the append only case. I think we can make this work if realtime GET is disabled somehow it should not make any difference. |
Maybe we should simply remove real-time get? Is near-real-time get really not good enough? We default refresh to every 1s. Or users can use "wait for refresh" (#17986), coming in 5.0.0. |
What about making it a per-index setting that defaults to disabled? I can think of some use cases for our plugins that make/made use of realtime get |
I guess what we all missed here is that documents that are not yet refreshed are held in the version map since we can't load it's version from the index, yet. I am sorry but it won't be that simple. :( |
simple it won't be indeed - I wrote the result of previous discussions on a new meta issue: #19813 I think we can close this one in favor of that, but I let @mikemccand make that call :) |
OK good I'll close this issue; thanks @bleskes. |
Actually, I think we should have separate issues to track the individual improvements: getting ES back to the indexing performance of raw Lucene is going to be a big project, with many separate improvements. We can use the meta issue #19813 to track overall progress, but I think we should keep separate issues like this one and #19913 open to track progress of each small step. |
) Today we do a lot of accounting inside the engine to maintain locations of documents inside the transaction log. This is only needed to ensure we can return the documents source from the engine if it hasn't been refreshed. Aside of the added complexity to be able to read from the currently writing translog, maintainance of pointers into the translog this also caused inconsistencies like different values of the `_ttl` field if it was read from the tlog or not. TermVectors are totally different if the document is fetched from the tranlog since copy fields are ignored etc. This chance will simply call `refresh` if the documents latest version is not in the index. This streamlines the semantics of the `_get` API and allows for more optimizations inside the engine and on the transaction log. Note: `_refresh` is only called iff the requested document is not refreshed yet but has recently been updated or added. #Relates to #19787
I am going to close this. The LiveVersionMap should have been optional since #27752. Thanks all! |
I'm opening this to discuss possible options:
I've been scrutinizing ES indexing performance on the NYC taxi data set (1.2 B taxi rides, numerics heavy: http://www.nyc.gov/html/tlc/html/about/trip_record_data.shtml).
These documents are small (24 fields, though a bit sparse with ~23% cells missing) and are almost entirely numbers (indexed as points + doc values).
As a "ceiling" for indexing performance I also indexed the same data set using Lucene's "thin wrapper" demo server (http://github.com/mikemccand/luceneserver), indexing the same documents as efficiently as I know how (see
indexTaxis.py
).The demo Lucene server has many differences vs. ES: it has no transaction log (does not periodically fsync), uses
addDocuments
notupdateDocument
, can index from a more compact documents source (190 GB CSV file, vs 512 GB json file for ES), does not add a costly_uid
field (nor_version
,_type
) , uses a streaming bulk API, etc. I disabled_all
and_source
in ES, but net/net ES is substantially slower than the demo Lucene server.So, one big thing I noticed that is maybe a lowish hanging fruit is that ES loses a lot of its indexing buffer to
LiveVersionMap
: if I give ES 1 GB indexing buffer, and index into only 1 shard, and disable refresh, the version map is taking ~2/3 of that buffer, leaving only ~1/3 for Lucene'sIndexWriter
:This also means ES is necessarily doing periodic refresh when I didn't ask it to.
This is quite frustrating because I don't need optimistic concurrency here, nor real-time gets, nor refreshes. However, I fear the version map might be required during recovery, to ensure when playing back indexing operations from the transaction log that they do not incorrectly overwrite newer indexing operations? But then, this use case is also append-only, so maybe during recovery we could safely skip that, if the user turns on this new setting.
The version map makes an entry in a
HashMap
for each document indexed, and the entry stores non-trivial information, creating at least 4 new objects, holding longs/ints, etc. If we can't make it turn-off-able maybe we should instead try to reduce its per-indexing-op overhead...The text was updated successfully, but these errors were encountered: