Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix std::bad_alloc exception due to JIT reserving a huge buffer #10317

Merged
merged 6 commits into from
Feb 25, 2022

Conversation

ttnghia
Copy link
Contributor

@ttnghia ttnghia commented Feb 17, 2022

In file jit/cache.cpp, a program cache always internally reserves a std::unordered_map using a size set by an environment variable LIBCUDF_KERNEL_CACHE_LIMIT_PER_PROCESS. If that environment variable does not exist, a default value (std::numeric_limit<size_t>::max) is used. Such default value is huge, leading to allocating a huge (impossible) size of memory chunk that crashes the system.

This PR changes that default value from std::numeric_limit<size_t>::max to 1024^2. This is essentially a reverse of the PR #10312 but set the default value to 1024 instead of 100.

Note that 1024^2 is just some random number, not based on any specific calculation.

Closes #10312 and closes #9362.

@ttnghia ttnghia added bug Something isn't working 3 - Ready for Review Ready for review by team Jitify non-breaking Non-breaking change labels Feb 17, 2022
@ttnghia ttnghia requested a review from a team as a code owner February 17, 2022 04:49
@ttnghia ttnghia self-assigned this Feb 17, 2022
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Feb 17, 2022
@codecov

This comment was marked as off-topic.

@harrism
Copy link
Member

harrism commented Feb 17, 2022

Note that 1024 is just some random number, not based on any specific calculation.

Can you look into whether we can run all tests and query the actual cache size and then base the limit on that? Otherwise we may still thrash with 1024.

@ttnghia
Copy link
Contributor Author

ttnghia commented Feb 17, 2022

Can you look into whether we can run all tests and query the actual cache size and then base the limit on that? Otherwise we may still thrash with 1024.

I built and ran all the tests on my local machine: all passed. Not sure if "thrash" is anything special here, or it requires some special environment/parameters to occur?

@vyasr
Copy link
Contributor

vyasr commented Feb 18, 2022

Can you look into whether we can run all tests and query the actual cache size and then base the limit on that? Otherwise we may still thrash with 1024.

I built and ran all the tests on my local machine: all passed. Not sure if "thrash" is anything special here, or it requires some special environment/parameters to occur?

Cache thrashing refers to a situation where the cache is too small to contain all the data that will be used within some predefined context. As an extreme example if your script ran a hundred JIT binops and your cache size was 1, you would observe a cache miss every time then recompile and populate the cache. This process is expensive at such high frequency and can cause code to choke. I assume the situation was quite bad to motivate #8132 in the first place.

What @harrism is suggesting is to see how large the cache grows in practice when running all of the tests when the cache size is some very large number that still fits in memory. For example, if you set the cache size to a million, does it actually get filled, or does it reserve a bunch of unused space? I'm not sure if our JIT code makes it easy to check how big the ProgramCache gets at the end of a run; maybe you can query it in some sort of teardown step for the Google tests once all tests are completed, but I am not sure about that. Choosing too small of a number (and 1024 may be too small) could cause problems for production users.

@ttnghia
Copy link
Contributor Author

ttnghia commented Feb 18, 2022

I just changed the default value to 1024^2 instead. When I ran the tests, the peak (host) memory usage is 900MB with that value. Thus, such default value should be safe in most systems.

@trxcllnt
Copy link
Contributor

900MB seems like a lot for a lib to idly consume, especially in a multi-process environment. Now that binops don't JIT, 1024 is probably fine.

@ttnghia
Copy link
Contributor Author

ttnghia commented Feb 18, 2022

FYI, CI tests seem to run with these values (https://github.com/rapidsai/ops/issues/1803) so I'm just going to follow them:

export LIBCUDF_KERNEL_CACHE_LIMIT_PER_PROCESS=10000
export LIBCUDF_KERNEL_CACHE_LIMIT_DISK=100000

@karthikeyann
Copy link
Contributor

@ttnghia Suggestion on how to get maximum cache size: replace unordered_map with a class {containing the unordered_map, destructor printing max of caches size (needs patch jitify2::ProgramCache to get mem_cache_.size()) } and run all CI tests with large LIBCUDF_KERNEL_CACHE_LIMIT_* values and gather the maximum number printed by the destructor on program exit.

Another question: are CI tests enough for determining the maximum size?

@ttnghia
Copy link
Contributor Author

ttnghia commented Feb 24, 2022

@ttnghia Suggestion on how to get maximum cache size: replace unordered_map with a class {containing the unordered_map, destructor printing max of caches size (needs patch jitify2::ProgramCache to get mem_cache_.size()) } and run all CI tests with large LIBCUDF_KERNEL_CACHE_LIMIT_* values and gather the maximum number printed by the destructor on program exit.

That's a good idea. Here is the code I inserted:

  virtual ~LRUCache() {
    printf("Cache size: %d\n", (int)cache_.size());
    fflush(stdout);
  }

And the result I got (after running ROLLING_TEST):

Cache size: 8

Apparently, 10000 is more than enough (in this case) while still doesn't cost much memory. I'm not sure how much the cache size will be in the real "production" situations.

The current values (10k and 100k) should be good. If necessary, in "production", users can always specify their own values.

Copy link
Contributor

@karthikeyann karthikeyann left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice. LGTM 👍

@ttnghia
Copy link
Contributor Author

ttnghia commented Feb 25, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit df646b2 into rapidsai:branch-22.04 Feb 25, 2022
@ttnghia ttnghia deleted the fix_jit branch March 7, 2022 16:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
5 participants