[ML] Investigate removing CStringStore #2130

droberts195 · 2021-11-11T10:18:26Z

The CStringStore class was introduced to support the "overlapping buckets" feature, because overlapping buckets created a need to remember field values for longer than a single bucket. Prior to this we just stored the field values in std::string objects that only existed while a particular bucket was being processed.

Over the years we've had to put effort into optimising the performance of CStringStore. Then the complexity introduced by that optimisation has led to ongoing issues like #2019.

But maybe now that we have deleted the code for the overlapping buckets functionality we can go back to just storing field values for the lifetime of the current bucket in simple std::string objects.

The first step is to audit the code to find out where CStringStore is used and whether there's an obvious reason why it cannot be removed. (We should also be able to dig out the commit where it was originally added from the legacy Prelert git history, which will show how things were done before that.)

The text was updated successfully, but these errors were encountered:

jan-elastic · 2024-04-02T14:50:17Z

To investigate removing the CStringStore, I have run an anomaly detection job on an AWS cluster logs file with ~34M log lines, with the following detector config:

      {
        "detector_description": "Num logs",
        "function": "count",
        "by_field_name": "host.name"
      }

I've kept track of the running string counts in the store.

Results (format: string, max CStringStore's use_count of the string, no. of CStringStore::get calls for the string):

string     use_count  #calls(get)
--------------------------------
count             93       258766
host.name        170       465125
i-******           3       729 - 3847 (varying between the 95 different host names)

Further notes:

the use counts of count and host.name drop after the job has processed all log lines.

jan-elastic · 2024-04-02T14:51:00Z

Looking at this, I don't think the CStringStore is very useful, so it can be removed.

tveasey · 2024-04-02T15:24:31Z

I agree.

Essentially we see two different strings here:

field name strings (count and host.name) and
field value strings (i-******).

For field name strings we do have relatively high use counts, but for any realistic scenario there are at most a few 10s of distinct values of these per job. So this would amount to at most a few 1000s of strings overhead. Also one has to offset against this the 16 bytes for pointer and reference count, so if the strings are small we may actually end up worse off.

For the i-****** we save at most 2 strings per distinct values which is probably fairly typical. If we are using these for any partitioning field we typically have multiple KB per distinct value for the time series models so this will be negligible.

I think the worst possible case will be for a population model where the overhead is much smaller per distinct field value. Still I suspect that the win is likely to be negligible in this case and as before, for short strings, it may even end up costing us memory.

The other potentially issue is it could slow down memory accounting. CStoredStringPtr includes static constexpr bool dynamicSizeAlwaysZero() { return true; }. The effect is that any container of CStoredStringPtr objects will skip iterating the objects when computing memory. So removing CStringStore will slow down the memory calculation, but again the overall impact on performance will hopefully be negligible, and we will also get some gains for avoiding the indirection when dealing with strings.

My inclination is to test this (by disabling this optimisation and checking the bucket processing time). If this looks good then I think we should go ahead and remove it. (Note we should also test some large scale population jobs.)

jan-elastic · 2024-04-03T10:40:27Z

Results for the following population analysis job on the same data (~34M lines of AWS cluster logs):

      "analysis_config": {
        "bucket_span": "15m",
        "detectors": [
          {
            "detector_description": "count over \"host.name\"",
            "function": "count",
            "over_field_name": "host.name",
            "detector_index": 0
          }
        ],
        "influencers": [
          "host.name",
          "consumer.id"
        ]
      }

Field name strings (count, host.name, consumer.id) have use counts of around 100.

Field value strings (i-******) mostly have low use count (between 2 and 6), but a small fraction has use count of around 100.

jan-elastic · 2024-04-03T11:56:46Z

Regarding the memory usage optimization: disabling it for the two jobs above doesn't show any difference. That shouldn't be surprising, because the store only contains about 100 strings, and summing their lengths can't take long.

jan-elastic · 2024-04-03T13:19:57Z

Update regarding memory usage:

New job config: change "over_field_name": "host.name" to "over_field_name": "event.original" (the actual log line, which is quite long and unique)

New datafeed: only 1 day of data (Jan, 2nd)

Now just removing the static constexpr bool dynamicSizeAlwaysZero() { return true; } increases the runtime from 10secs to almost 20mins, and the processing becomes incrementally slower as more data is processed.

jan-elastic · 2024-04-03T13:40:33Z

Also commenting out the core::memory::dynamicSize(m_Strings) in CStringStore decreases the runtime back to the original 10secs.

jan-elastic · 2024-04-03T14:18:36Z

Here's the testing code: #2650

jan-elastic · 2024-04-04T08:39:38Z

Conclusion: @tveasey and I agree that these results indicate that CStringStore isn't necessary and we can proceed to remove it.

tveasey · 2024-04-04T11:17:23Z

I dug a little more into why removing static constexpr bool dynamicSizeAlwaysZero() { return true; } alone was problematic.

We are effectively relying on caching the string memory usage calculation. This was actually the reason we introduced CResourceMonitor in the first place. However, CResourceMonitor calls CStringStore::memoryUsage freely to get the total memory when deciding if allocations are permitted and such. We increased this from fixed low cost to iterating every distinct string in the program for each call. This in turn is why the cost dropped back when we commented core::memory::dynamicSize(m_Strings) out. The two changes together should be mostly representative of the final performance impact.

The one caveat is we need to audit to make sure we do have instances of CStoredStringPtr accounted for in the classes which use them memoryUsage functions. For example, I found at least one case this was missing and we relied on CStringStore for memory usage, i.e. CDataGatherer::m_PartitionFieldValue.

jan-elastic · 2024-04-16T07:31:42Z

PR: #2652

valeriy42 · 2024-07-23T07:48:46Z

@jan-elastic can we close the issue now, since #2652 has been merged?

jan-elastic · 2024-07-29T07:13:09Z

Done. Note that some potential leftover work is in this issue: #2665

droberts195 added the :ml label Nov 11, 2021

droberts195 mentioned this issue Nov 11, 2021

[ML] CStringStoreTest.testStringStore failure #2019

Open

jan-elastic self-assigned this Mar 25, 2024

sophiec20 added this to the 8.14 milestone Apr 24, 2024

jan-elastic mentioned this issue May 15, 2024

[ML] Investigate / fix degraded anomaly detection performance on specific QA test sets #2665

Open

jan-elastic closed this as completed Jul 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ML] Investigate removing CStringStore #2130

[ML] Investigate removing CStringStore #2130

droberts195 commented Nov 11, 2021

jan-elastic commented Apr 2, 2024 •

edited

Loading

jan-elastic commented Apr 2, 2024 •

edited

Loading

tveasey commented Apr 2, 2024 •

edited

Loading

jan-elastic commented Apr 3, 2024 •

edited

Loading

jan-elastic commented Apr 3, 2024

jan-elastic commented Apr 3, 2024

jan-elastic commented Apr 3, 2024

jan-elastic commented Apr 3, 2024

jan-elastic commented Apr 4, 2024

tveasey commented Apr 4, 2024 •

edited

Loading

jan-elastic commented Apr 16, 2024

valeriy42 commented Jul 23, 2024

jan-elastic commented Jul 29, 2024

[ML] Investigate removing CStringStore #2130

[ML] Investigate removing CStringStore #2130

Comments

droberts195 commented Nov 11, 2021

jan-elastic commented Apr 2, 2024 • edited Loading

jan-elastic commented Apr 2, 2024 • edited Loading

tveasey commented Apr 2, 2024 • edited Loading

jan-elastic commented Apr 3, 2024 • edited Loading

jan-elastic commented Apr 3, 2024

jan-elastic commented Apr 3, 2024

jan-elastic commented Apr 3, 2024

jan-elastic commented Apr 3, 2024

jan-elastic commented Apr 4, 2024

tveasey commented Apr 4, 2024 • edited Loading

jan-elastic commented Apr 16, 2024

valeriy42 commented Jul 23, 2024

jan-elastic commented Jul 29, 2024

jan-elastic commented Apr 2, 2024 •

edited

Loading

jan-elastic commented Apr 2, 2024 •

edited

Loading

tveasey commented Apr 2, 2024 •

edited

Loading

jan-elastic commented Apr 3, 2024 •

edited

Loading

tveasey commented Apr 4, 2024 •

edited

Loading