-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve AccountID string conversion caching #4181
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good. I left a suggesting that might help with lock contention.
I'm assuming that the caching is purely to reduce execution time. To that end, I timed execution of I took measurements with three different sets of code:
For timing measurements I sorted all of the execution times into buckets. I also summed all of the execution times and computed an average at the very end. I also kept track of the longest execution time. Here are the results I got on my Mac mini:
Both this pull request and @seelabs' code came pretty close to one another. This pull request averaged 2 ns less than @seelabs' version. But the worst case conversion time was significantly shorter with @seelabs' code, presumably because it doesn't take any locks. I think the measurement results are worth considering while we're deciding whether to proceed with this pull request. Just in case anyone has questions about the timing measurement methodology, I've appended the code.
|
Gah. There is, of course, a bug in my timing code. The good news is that it does not obviate the results. All of the results above are in units of Here's the fix in case anyone cares:
|
Having thought about this for a little while, I'm inclined toward adopting @seelabs' change rather than this one. @seelabs' change will improve both encoding and decoding of all base58 values, not just |
I hear what you're saying about @seelabs's change and agree that it makes the encoder better which, of course, improves all base58 encodings. Taking a step back, I think we agree that converting With that in mind, I decided to tweak this a bit further to eliminate the need for and overhead of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Love the new design! Indeed it is better than using an unordered_map.
@greg7mdp, @scottschurr, @seelabs: I took your feedback to hard and updated the code. I think it's dramatically simpler and cleaner now. I would appreciate your feedback. This PR now is on top of #4201, to leverage the new |
Looks great and very simple indeed. I don't know enough about spinlocks to comment on using it instead of std::mutex. |
Just a thought. Maybe we never need to wait? We could just try to acquire the mutex (try_lock), and if it fails just call |
I really like the changes that were made to this pull request. I especially like replacing the hash table with an array. Very clever. Given the changes I decided that it made sense to redo the timings I did previously. I used the same technique, but I included the fix so actual nanoseconds are reported, not So I decided there must be something wrong with my measurement technique. Ah ha! The optimizer must be moving where What I learned from this is that accessing Timing of develop (1.9.1) with the fences in place:
Timing of Nik's latest code with fences in place:
Timing of Scott D's base58 work with the fences in place:
Meh. The cache does give the best result. But it's still only saving 4 ns over the non-cached code in 1.9.1. If we subtract out the (approximately) 10 ns added by the fences, then we'd be going from an average of 43 ns to an average of 39 ns. About a 10% improvement. I'm not convinced the cache is worth it for a reduction that small. It's certainly possible there's a problem with my measurement technique. Good measurements are hard. But I can't figure out what that problem might be. Thoughts? |
Thanks @scottschurr, very useful. Some random thoughts I have.
|
I appreciate you redoing the timings, @scottschurr. There's a couple of things I'd like to point out: first, I suspect that we're already reaping benefits from the existing cache. An interesting approach would be to disable it entirely and run measurements then. Second, even if this change is, roughly, as fast the existing mechanism, I still think it's a net gain: the code is simpler and avoids doing work at runtime and that can only be a good thing. And I don't think that this is such a huge hot spot that it's worth squeezing every last drop of perfomance out of it. With that said, why is the performance not significantly better? A couple of thoughts there: First, despite optimizing the cache itself, we can't just hand a raw pointer in the cache without pinning the memory, so we still end up having to allocate a Second, I suspect the test scenario you're running is probably not very representative, despite being run against real data. I think that the biggest impact will likely be felt on servers that handle a large volume of RPC requests and have to do these conversions hundreds of times per one RPC call (imagine, for example, responding to an Apropos of this, I think it's worth examining the size of the cache. Greg has suggested that the defined sizes may be too large, and perhaps that's true. If a cache that's half the size is, say, 90% as efficient, why go larger? So to sum up my thoughts on this: I feel this commit actually cleans up the code a lot and that alone may be sufficient to get it merged. We could make the cache a runtime configurable option so operators can choose whether to use it or not but that may be overkill. Lastly, I'm not opposed to just removing the cache entirely but I think that before we do that we ought to be very sure that performance doesn't suffer on the servers which actually stand to benefit the most from the cache; I think it's worth running this branch on s1 or s2 and capturing performance over a large amount of queries and comparing results. |
My guess is that the longest ns can be attributed to at least a couple of sources:
I would expect thread swapping to have the largest impact of these two. So the longest ns value may effectively be noise. But it's hard to know for sure.
The test case is not really a test case as such. I'm installing the timer in the rippled source code and then running that instance of rippled agains the main network. The AccountIDs being used should be representative of the Account IDs seen by a normally running rippled instance. Similarly, the randomness of the access should be the same that a regular rippled instance would see. I take over 1 million samples to try and average out the uses from test run to test run. But it's also true that the specifics of the test are far from identical between test cases and from run-to-run. @nbougalis noted:
Just to be clear, the timings I made of develop (1.9.1) do not benefit from any cache. The call being timed is to...
There's no AccountID caching in any of those calls.
That's a worthwhile point. Can we demonstrate the benefit? Could we, for example, install an instrumented 1.9.1 rippled and an instrumented version of the code-with-caching in two of our client handlers? I'd prefer to actually see the results of the measurements than to shrug and say we're measuring the wrong stuff. I'm not saying the code isn't clean. But if we're going to put in a cache it seems like we ought to demonstrate that we get significant benefit from it. |
I think this should be doable. Can you link me to your code please? |
@nbougalis wrote...
Sure thing. Also feel free to adjust the code in case you think there might be problems with the way it is doing the timing. Here's a branch that does timing for 1.9.1: https://github.com/scottschurr/rippled/tree/to-base-58-timing-1-9-1 Here's a branch that does identical timing for the most recent version of this Be aware that the timing results are written to standard out, not to the log. The log is not available at the point we are measuring the timing. So you may need to redirect the server's output to a file or the like. Thanks for doing the timing. I'm curious to see what the results tell us. |
So, I've been running this branch against a server that's been serving public RPC/WebSocket queries for a little over 2 weeks. Over that time, the cache has a hit ratio that's just shy of 89%, which is significant. About 25% of the calls were ≤ 50ns; over 96% of the rest were ≤ 100 ns. |
Caching the base58check encoded version of an `AccountID` has performance advantages, because because of the computationally heavy cost associated with the conversion, which requires the application of SHA-256 twice. This commit makes the cache significantly more efficient in terms of memory used: it eliminates the map, using a vector with a size that is determined by the configured size of the node, and a hash function to directly map any given `AccountID` to a specific slot in the cache; the eviction policy is simple: in case of collision the existing entry is removed and replaced with the new data. Previously, use of the cache was optional and required additional effort by the programmer. Now the cache is automatic and does not require any additional work or information. The new cache also utilizes a 64-way spinlock, to help reduce any contention that the pressure on the cache would impose.
@nbougalis, thanks for making the measurements you did. A cache hit ratio of 89% is significant. What I've not yet seen from your timings is evidence of a speed up. For that evidence we'd need to see timings without the cache on the same machine and a comparable workload. If the timings with the cache are significantly faster than the timings without the cache then the complexity introduced by the cache is justified. Make sense? |
Opening for review since this may be included in 1.10. e2eed96 (The contributing process requires approvals from at least two reviewers) |
(Ignore the merge conflicts mentioned by GitHub below.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Although I'm not convinced (by measurements taken) that this approach provides significant speedups, I don't think it does any harm. 👍 from me.
|
||
// The maximum size (in bytes) of an item, along with a null terminator | ||
// and aligned at 8 bytes | ||
std::size_t const size_ = 40; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider making this constexpr
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@scottschurr given I'll be closing this PR soon (and it might not get looked at again after that), would it be worth opening an issue for this?
If not, we can leave as-is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not worth an issue. We can leave it as is.
|
||
return *raSrcAccount; | ||
}(); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nitpick. I am not sure what the beauty of a lambda in this case.
Thank you both for the reviews! |
Caching the base58check encoded version of an
AccountID
has performance advantages, because because of the computationally heavy cost associated with the conversion, which requires the application of SHA-256 twice.This commit makes the cache significantly more efficient in terms of memory used: it maps an
AccountID
tochar*
instead of anstd::string
(eliminating the need for dynamic memory allocation and the fixed overhead ofstd::string
) and effectively eliminates runtime allocation and deallocation of memory by grabbing a single, large block of memory at initialization.The eviction policy is also improved and ensures that the cache, once full, will always remain full. Although not currently coded the structure of the cache is such that it can be used to easily evict the least-recently-used entry without additional overhead.
High Level Overview of Change
Context of Change
Type of Change