Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ML] Investigate removing CStringStore #2130

Closed
droberts195 opened this issue Nov 11, 2021 · 13 comments
Closed

[ML] Investigate removing CStringStore #2130

droberts195 opened this issue Nov 11, 2021 · 13 comments
Assignees
Labels
Milestone

Comments

@droberts195
Copy link
Contributor

The CStringStore class was introduced to support the "overlapping buckets" feature, because overlapping buckets created a need to remember field values for longer than a single bucket. Prior to this we just stored the field values in std::string objects that only existed while a particular bucket was being processed.

Over the years we've had to put effort into optimising the performance of CStringStore. Then the complexity introduced by that optimisation has led to ongoing issues like #2019.

But maybe now that we have deleted the code for the overlapping buckets functionality we can go back to just storing field values for the lifetime of the current bucket in simple std::string objects.

The first step is to audit the code to find out where CStringStore is used and whether there's an obvious reason why it cannot be removed. (We should also be able to dig out the commit where it was originally added from the legacy Prelert git history, which will show how things were done before that.)

@jan-elastic
Copy link
Contributor

jan-elastic commented Apr 2, 2024

To investigate removing the CStringStore, I have run an anomaly detection job on an AWS cluster logs file with ~34M log lines, with the following detector config:

      {
        "detector_description": "Num logs",
        "function": "count",
        "by_field_name": "host.name"
      }

I've kept track of the running string counts in the store.

Results (format: string, max CStringStore's use_count of the string, no. of CStringStore::get calls for the string):

string     use_count  #calls(get)
--------------------------------
count             93       258766
host.name        170       465125
i-******           3       729 - 3847 (varying between the 95 different host names)

Further notes:

  • the use counts of count and host.name drop after the job has processed all log lines.

@jan-elastic
Copy link
Contributor

jan-elastic commented Apr 2, 2024

Looking at this, I don't think the CStringStore is very useful, so it can be removed.

@tveasey
Copy link
Contributor

tveasey commented Apr 2, 2024

I agree.

Essentially we see two different strings here:

  1. field name strings (count and host.name) and
  2. field value strings (i-******).

For field name strings we do have relatively high use counts, but for any realistic scenario there are at most a few 10s of distinct values of these per job. So this would amount to at most a few 1000s of strings overhead. Also one has to offset against this the 16 bytes for pointer and reference count, so if the strings are small we may actually end up worse off.

For the i-****** we save at most 2 strings per distinct values which is probably fairly typical. If we are using these for any partitioning field we typically have multiple KB per distinct value for the time series models so this will be negligible.

I think the worst possible case will be for a population model where the overhead is much smaller per distinct field value. Still I suspect that the win is likely to be negligible in this case and as before, for short strings, it may even end up costing us memory.

The other potentially issue is it could slow down memory accounting. CStoredStringPtr includes static constexpr bool dynamicSizeAlwaysZero() { return true; }. The effect is that any container of CStoredStringPtr objects will skip iterating the objects when computing memory. So removing CStringStore will slow down the memory calculation, but again the overall impact on performance will hopefully be negligible, and we will also get some gains for avoiding the indirection when dealing with strings.

My inclination is to test this (by disabling this optimisation and checking the bucket processing time). If this looks good then I think we should go ahead and remove it. (Note we should also test some large scale population jobs.)

@jan-elastic
Copy link
Contributor

jan-elastic commented Apr 3, 2024

Results for the following population analysis job on the same data (~34M lines of AWS cluster logs):

      "analysis_config": {
        "bucket_span": "15m",
        "detectors": [
          {
            "detector_description": "count over \"host.name\"",
            "function": "count",
            "over_field_name": "host.name",
            "detector_index": 0
          }
        ],
        "influencers": [
          "host.name",
          "consumer.id"
        ]
      }

Field name strings (count, host.name, consumer.id) have use counts of around 100.

Field value strings (i-******) mostly have low use count (between 2 and 6), but a small fraction has use count of around 100.

@jan-elastic
Copy link
Contributor

Regarding the memory usage optimization: disabling it for the two jobs above doesn't show any difference. That shouldn't be surprising, because the store only contains about 100 strings, and summing their lengths can't take long.

@jan-elastic
Copy link
Contributor

Update regarding memory usage:

New job config: change "over_field_name": "host.name" to "over_field_name": "event.original" (the actual log line, which is quite long and unique)

New datafeed: only 1 day of data (Jan, 2nd)

Now just removing the static constexpr bool dynamicSizeAlwaysZero() { return true; } increases the runtime from 10secs to almost 20mins, and the processing becomes incrementally slower as more data is processed.

@jan-elastic
Copy link
Contributor

Also commenting out the core::memory::dynamicSize(m_Strings) in CStringStore decreases the runtime back to the original 10secs.

@jan-elastic
Copy link
Contributor

Here's the testing code: #2650

@jan-elastic
Copy link
Contributor

Conclusion: @tveasey and I agree that these results indicate that CStringStore isn't necessary and we can proceed to remove it.

@tveasey
Copy link
Contributor

tveasey commented Apr 4, 2024

I dug a little more into why removing static constexpr bool dynamicSizeAlwaysZero() { return true; } alone was problematic.

We are effectively relying on caching the string memory usage calculation. This was actually the reason we introduced CResourceMonitor in the first place. However, CResourceMonitor calls CStringStore::memoryUsage freely to get the total memory when deciding if allocations are permitted and such. We increased this from fixed low cost to iterating every distinct string in the program for each call. This in turn is why the cost dropped back when we commented core::memory::dynamicSize(m_Strings) out. The two changes together should be mostly representative of the final performance impact.

The one caveat is we need to audit to make sure we do have instances of CStoredStringPtr accounted for in the classes which use them memoryUsage functions. For example, I found at least one case this was missing and we relied on CStringStore for memory usage, i.e. CDataGatherer::m_PartitionFieldValue.

@jan-elastic
Copy link
Contributor

PR: #2652

@valeriy42
Copy link
Contributor

@jan-elastic can we close the issue now, since #2652 has been merged?

@jan-elastic
Copy link
Contributor

Done. Note that some potential leftover work is in this issue: #2665

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants