-
Notifications
You must be signed in to change notification settings - Fork 915
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix std::bad_alloc
exception due to JIT reserving a huge buffer
#10317
Conversation
This comment was marked as off-topic.
This comment was marked as off-topic.
Can you look into whether we can run all tests and query the actual cache size and then base the limit on that? Otherwise we may still thrash with 1024. |
I built and ran all the tests on my local machine: all passed. Not sure if "thrash" is anything special here, or it requires some special environment/parameters to occur? |
Cache thrashing refers to a situation where the cache is too small to contain all the data that will be used within some predefined context. As an extreme example if your script ran a hundred JIT binops and your cache size was 1, you would observe a cache miss every time then recompile and populate the cache. This process is expensive at such high frequency and can cause code to choke. I assume the situation was quite bad to motivate #8132 in the first place. What @harrism is suggesting is to see how large the cache grows in practice when running all of the tests when the cache size is some very large number that still fits in memory. For example, if you set the cache size to a million, does it actually get filled, or does it reserve a bunch of unused space? I'm not sure if our JIT code makes it easy to check how big the ProgramCache gets at the end of a run; maybe you can query it in some sort of teardown step for the Google tests once all tests are completed, but I am not sure about that. Choosing too small of a number (and 1024 may be too small) could cause problems for production users. |
I just changed the default value to |
900MB seems like a lot for a lib to idly consume, especially in a multi-process environment. Now that binops don't JIT, |
FYI, CI tests seem to run with these values (https://github.com/rapidsai/ops/issues/1803) so I'm just going to follow them:
|
@ttnghia Suggestion on how to get maximum cache size: replace Another question: are CI tests enough for determining the maximum size? |
That's a good idea. Here is the code I inserted:
And the result I got (after running
Apparently, The current values ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice. LGTM 👍
@gpucibot merge |
In file
jit/cache.cpp
, a program cache always internally reserves astd::unordered_map
using a size set by an environment variableLIBCUDF_KERNEL_CACHE_LIMIT_PER_PROCESS
. If that environment variable does not exist, a default value (std::numeric_limit<size_t>::max
) is used. Such default value is huge, leading to allocating a huge (impossible) size of memory chunk that crashes the system.This PR changes that default value from
std::numeric_limit<size_t>::max
to1024^2
. This is essentially a reverse of the PR #10312 but set the default value to1024
instead of100
.Note that
1024^2
is just some random number, not based on any specific calculation.Closes #10312 and closes #9362.