-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ML] Investigate removing CStringStore #2130
Comments
To investigate removing the
I've kept track of the running string counts in the store. Results (format: string, max
Further notes:
|
Looking at this, I don't think the |
I agree. Essentially we see two different strings here:
For field name strings we do have relatively high use counts, but for any realistic scenario there are at most a few 10s of distinct values of these per job. So this would amount to at most a few 1000s of strings overhead. Also one has to offset against this the 16 bytes for pointer and reference count, so if the strings are small we may actually end up worse off. For the I think the worst possible case will be for a population model where the overhead is much smaller per distinct field value. Still I suspect that the win is likely to be negligible in this case and as before, for short strings, it may even end up costing us memory. The other potentially issue is it could slow down memory accounting. My inclination is to test this (by disabling this optimisation and checking the bucket processing time). If this looks good then I think we should go ahead and remove it. (Note we should also test some large scale population jobs.) |
Results for the following population analysis job on the same data (~34M lines of AWS cluster logs):
Field name strings ( Field value strings ( |
Regarding the memory usage optimization: disabling it for the two jobs above doesn't show any difference. That shouldn't be surprising, because the store only contains about 100 strings, and summing their lengths can't take long. |
Update regarding memory usage: New job config: change New datafeed: only 1 day of data (Jan, 2nd) Now just removing the |
Also commenting out the |
Here's the testing code: #2650 |
Conclusion: @tveasey and I agree that these results indicate that |
I dug a little more into why removing We are effectively relying on caching the string memory usage calculation. This was actually the reason we introduced The one caveat is we need to audit to make sure we do have instances of |
PR: #2652 |
@jan-elastic can we close the issue now, since #2652 has been merged? |
Done. Note that some potential leftover work is in this issue: #2665 |
The
CStringStore
class was introduced to support the "overlapping buckets" feature, because overlapping buckets created a need to remember field values for longer than a single bucket. Prior to this we just stored the field values instd::string
objects that only existed while a particular bucket was being processed.Over the years we've had to put effort into optimising the performance of
CStringStore
. Then the complexity introduced by that optimisation has led to ongoing issues like #2019.But maybe now that we have deleted the code for the overlapping buckets functionality we can go back to just storing field values for the lifetime of the current bucket in simple
std::string
objects.The first step is to audit the code to find out where
CStringStore
is used and whether there's an obvious reason why it cannot be removed. (We should also be able to dig out the commit where it was originally added from the legacy Prelert git history, which will show how things were done before that.)The text was updated successfully, but these errors were encountered: