-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Don't read posting lists from disk when mutating indices. #3695
Conversation
Adding index mutations can be performed without reading the posting lists in disk. This change modifies the indexing process to avoid reading any list from disk. It should result in performance improvements, specially when trying to modify big index posting lists.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 3 of 3 files at r1.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @martinmr)
posting/lists.go, line 247 at r1 (raw file):
// GetFromDelta gets the cached version of the list without reading from disk // and only applies the existing deltas. This is used in situations where the // posting list will only be modified and not read (e.g adding index mutations).
Add a comment that this is largely a copy of the logic from Get. So, keep this in-sync with above.
Actually, let's refactor this. Internally, call a get func:
posting/lists.go, line 260 at r1 (raw file):
// immutable layer. pl := &List{} pl.key = key
Put it in the constructor:
&List{
key: key,
mutationMap: ...
}
posting/oracle.go, line 83 at r1 (raw file):
// If there's a cache miss, the posting list stored on disk will not be read. // See documentation for LocalCache.GetFromDelta for more info. func (txn *Txn) GetFromDelta(key []byte) (*List, error) {
No need to expose this func.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: all files reviewed, 3 unresolved discussions (waiting on @manishrjain)
posting/lists.go, line 247 at r1 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
Add a comment that this is largely a copy of the logic from Get. So, keep this in-sync with above.
Actually, let's refactor this. Internally, call a get func:
Done, refactored both methods into an internal method.
posting/lists.go, line 260 at r1 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
Put it in the constructor:
&List{ key: key, mutationMap: ... }
Done.
posting/oracle.go, line 83 at r1 (raw file):
Previously, manishrjain (Manish R Jain) wrote…
No need to expose this func.
Done.
Full test suite passes: https://teamcity.dgraph.io/viewLog.html?buildId=16792&buildTypeId=Dgraph_CiFullTestSuite |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 2 files reviewed, 4 unresolved discussions (waiting on @manishrjain and @martinmr)
posting/index.go, line 940 at r2 (raw file):
// Ensure that list is in the cache run by txn. Otherwise, nothing would // get updated. pl = txn.cache.Set(string(pl.key), pl)
I don't understand the logic behind setting the result back to pl.
Also it seems the code can be improved so that in builder.Run the posting list is NOT pre-loaded, but lazily loaded whenever it's needed later through txn.Get(key), which takes care of putting the posting list into the txn.cache as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 2 files reviewed, 4 unresolved discussions (waiting on @gitlw and @manishrjain)
posting/index.go, line 940 at r2 (raw file):
Previously, gitlw (Lucas Wang) wrote…
I don't understand the logic behind setting the result back to pl.
Also it seems the code can be improved so that in builder.Run the posting list is NOT pre-loaded, but lazily loaded whenever it's needed later through txn.Get(key), which takes care of putting the posting list into the txn.cache as well.
There shouldn't be any functional change. This is just to highlight that Set returns a value.
I have done the change you suggested. I need to check with Manish that the cache will work fine in the case of a rebuild.
Ran some benchmarks by letting the live loader run for 2.5 minutes. With the current changes:
With the current master:
The new version is a bit more than 4X faster. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 2 files reviewed, 4 unresolved discussions (waiting on @gitlw and @manishrjain)
posting/index.go, line 940 at r2 (raw file):
Previously, martinmr (Martin Martinez Rivera) wrote…
There shouldn't be any functional change. This is just to highlight that Set returns a value.
I have done the change you suggested. I need to check with Manish that the cache will work fine in the case of a rebuild.
Actually, I ended up reverting the changes because I need to check with Manish first whether the cache might grow without bounds and cause an OOM.
Also, I measured the performance without and with your suggestion and it's pretty much the same so I don't think it makes a huge difference.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: 0 of 2 files reviewed, 4 unresolved discussions (waiting on @gitlw and @manishrjain)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 2 of 2 files at r4.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @gitlw)
To further #3695 , only read posting lists in certain cases. For e.g., when they are indexed, or when there's a delete operation or when it's a single UID predicate, etc. For the rest, don't read them. This further improves the performance of live data loading. Changes: * Avoid retrieving data keys if we don't need them to execute the mutation. * Fix a test failure by retrieving in case of DEL operation.
Adding index mutations can be performed without reading the posting
lists in disk. This change modifies the indexing process to avoid
reading any list from disk. It should result in performance
improvements, specially when trying to modify big index posting lists.
This change is