Fix `std::bad_alloc` exception due to JIT reserving a huge buffer #10317

ttnghia · 2022-02-17T04:49:45Z

In file jit/cache.cpp, a program cache always internally reserves a std::unordered_map using a size set by an environment variable LIBCUDF_KERNEL_CACHE_LIMIT_PER_PROCESS. If that environment variable does not exist, a default value (std::numeric_limit<size_t>::max) is used. Such default value is huge, leading to allocating a huge (impossible) size of memory chunk that crashes the system.

This PR changes that default value from std::numeric_limit<size_t>::max to 1024^2. This is essentially a reverse of the PR #10312 but set the default value to 1024 instead of 100.

Note that 1024^2 is just some random number, not based on any specific calculation.

Closes #10312 and closes #9362.

harrism · 2022-02-17T09:03:45Z

Note that 1024 is just some random number, not based on any specific calculation.

Can you look into whether we can run all tests and query the actual cache size and then base the limit on that? Otherwise we may still thrash with 1024.

ttnghia · 2022-02-17T16:29:38Z

Can you look into whether we can run all tests and query the actual cache size and then base the limit on that? Otherwise we may still thrash with 1024.

I built and ran all the tests on my local machine: all passed. Not sure if "thrash" is anything special here, or it requires some special environment/parameters to occur?

vyasr · 2022-02-18T01:01:08Z

Can you look into whether we can run all tests and query the actual cache size and then base the limit on that? Otherwise we may still thrash with 1024.

I built and ran all the tests on my local machine: all passed. Not sure if "thrash" is anything special here, or it requires some special environment/parameters to occur?

Cache thrashing refers to a situation where the cache is too small to contain all the data that will be used within some predefined context. As an extreme example if your script ran a hundred JIT binops and your cache size was 1, you would observe a cache miss every time then recompile and populate the cache. This process is expensive at such high frequency and can cause code to choke. I assume the situation was quite bad to motivate #8132 in the first place.

What @harrism is suggesting is to see how large the cache grows in practice when running all of the tests when the cache size is some very large number that still fits in memory. For example, if you set the cache size to a million, does it actually get filled, or does it reserve a bunch of unused space? I'm not sure if our JIT code makes it easy to check how big the ProgramCache gets at the end of a run; maybe you can query it in some sort of teardown step for the Google tests once all tests are completed, but I am not sure about that. Choosing too small of a number (and 1024 may be too small) could cause problems for production users.

ttnghia · 2022-02-18T02:39:09Z

I just changed the default value to 1024^2 instead. When I ran the tests, the peak (host) memory usage is 900MB with that value. Thus, such default value should be safe in most systems.

trxcllnt · 2022-02-18T06:12:09Z

900MB seems like a lot for a lib to idly consume, especially in a multi-process environment. Now that binops don't JIT, 1024 is probably fine.

ttnghia · 2022-02-18T17:03:50Z

FYI, CI tests seem to run with these values (https://github.com/rapidsai/ops/issues/1803) so I'm just going to follow them:

export LIBCUDF_KERNEL_CACHE_LIMIT_PER_PROCESS=10000
export LIBCUDF_KERNEL_CACHE_LIMIT_DISK=100000

karthikeyann · 2022-02-22T20:50:56Z

@ttnghia Suggestion on how to get maximum cache size: replace unordered_map with a class {containing the unordered_map, destructor printing max of caches size (needs patch jitify2::ProgramCache to get mem_cache_.size()) } and run all CI tests with large LIBCUDF_KERNEL_CACHE_LIMIT_* values and gather the maximum number printed by the destructor on program exit.

Another question: are CI tests enough for determining the maximum size?

ttnghia · 2022-02-24T22:35:04Z

@ttnghia Suggestion on how to get maximum cache size: replace unordered_map with a class {containing the unordered_map, destructor printing max of caches size (needs patch jitify2::ProgramCache to get mem_cache_.size()) } and run all CI tests with large LIBCUDF_KERNEL_CACHE_LIMIT_* values and gather the maximum number printed by the destructor on program exit.

That's a good idea. Here is the code I inserted:

  virtual ~LRUCache() {
    printf("Cache size: %d\n", (int)cache_.size());
    fflush(stdout);
  }

And the result I got (after running ROLLING_TEST):

Cache size: 8

Apparently, 10000 is more than enough (in this case) while still doesn't cost much memory. I'm not sure how much the cache size will be in the real "production" situations.

The current values (10k and 100k) should be good. If necessary, in "production", users can always specify their own values.

karthikeyann

Nice. LGTM 👍

ttnghia · 2022-02-25T02:06:52Z

@gpucibot merge

Change default values from numeric_limit::max to 1024

373124e

ttnghia added bug 3 - Ready for Review Jitify non-breaking labels Feb 17, 2022

ttnghia requested review from trxcllnt, harrism and cwharris February 17, 2022 04:49

ttnghia requested a review from a team as a code owner February 17, 2022 04:49

ttnghia self-assigned this Feb 17, 2022

github-actions bot added the libcudf label Feb 17, 2022

ttnghia added 2 commits February 16, 2022 21:53

Update copyright year

a2ac9ce

Rewrite function

f98da0a

This comment was marked as off-topic.

Sign in to view

Change default value

1564d6b

Reverse default value

15aab71

Change default values

a663353

harrism approved these changes Feb 24, 2022

View reviewed changes

karthikeyann approved these changes Feb 25, 2022

View reviewed changes

rapids-bot bot merged commit df646b2 into rapidsai:branch-22.04 Feb 25, 2022

ttnghia deleted the fix_jit branch March 7, 2022 16:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `std::bad_alloc` exception due to JIT reserving a huge buffer #10317

Fix `std::bad_alloc` exception due to JIT reserving a huge buffer #10317

ttnghia commented Feb 17, 2022 •

edited

Loading

This comment was marked as off-topic.

harrism commented Feb 17, 2022

ttnghia commented Feb 17, 2022

vyasr commented Feb 18, 2022 •

edited

Loading

ttnghia commented Feb 18, 2022

trxcllnt commented Feb 18, 2022

ttnghia commented Feb 18, 2022 •

edited

Loading

karthikeyann commented Feb 22, 2022

ttnghia commented Feb 24, 2022 •

edited

Loading

karthikeyann left a comment

ttnghia commented Feb 25, 2022

Fix std::bad_alloc exception due to JIT reserving a huge buffer #10317

Fix std::bad_alloc exception due to JIT reserving a huge buffer #10317

Conversation

ttnghia commented Feb 17, 2022 • edited Loading

This comment was marked as off-topic.

harrism commented Feb 17, 2022

ttnghia commented Feb 17, 2022

vyasr commented Feb 18, 2022 • edited Loading

ttnghia commented Feb 18, 2022

trxcllnt commented Feb 18, 2022

ttnghia commented Feb 18, 2022 • edited Loading

karthikeyann commented Feb 22, 2022

ttnghia commented Feb 24, 2022 • edited Loading

karthikeyann left a comment

Choose a reason for hiding this comment

ttnghia commented Feb 25, 2022

Fix `std::bad_alloc` exception due to JIT reserving a huge buffer #10317

Fix `std::bad_alloc` exception due to JIT reserving a huge buffer #10317

ttnghia commented Feb 17, 2022 •

edited

Loading

vyasr commented Feb 18, 2022 •

edited

Loading

ttnghia commented Feb 18, 2022 •

edited

Loading

ttnghia commented Feb 24, 2022 •

edited

Loading