-
-
Notifications
You must be signed in to change notification settings - Fork 5.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make String more GC friendly #40840
Comments
As a reference. What I am aiming to achieve is to have things like https://discourse.julialang.org/t/the-state-of-dataframes-jl-h2o-benchmark/43081/21 working with just Julia Base as much as possible. |
If I understand the main issues, this would not fix it, since in a |
This is what I have imagined. However, I understand your point of not over-specializing code. Also saving memory will also help GC as it will occur less frequently. Let me give an example of the pattern in which I want to ensure GC does not get super slow:
(this is not an artificial example - something like this happens e.g. in when doing joins to create left to right table mapping which is needed only temporarily, but can be large) In such a scenario I would prefer to avoid high cost of GC due to the fact that we have 10^9 strings that potentially have to be marked (or at least to have to pay this cost very infrequently, and most of the time not have to pay it). |
A related question is: if we have a collection of
which to my understanding shows that every element of |
I am opening this issue as a follow-up to a discussion on Slack not to loose track of it.
Rationale: in data-science workflows it is very common to have very large tables that hold columns that consist of many unique strings (e.g. product ID that is non-numeric character sequence).
In such cases the current design of combination of GC and
String
type cause us to create a lot of small strings. The effect is described e.g. here h2oai/db-benchmark#210 (in these benchmarks the high-count string column is not taking part in any computations - it just sits there using-up memory and causing GC strain). The issue is especially apparent in multi-threading contexts (i.e. when the operation you want to do is parallelized and fast in general, but is paused by triggered GC collection cycles).I think - given we want Julia to be fast in data science workflows - this issue critically needs to be resolved (it is apparent in H2O benchmarks, but I get this problem constantly reported by users of DataFrames.jl).
As this issue touches deep Julia Base internals, I am probably not the best person to decide what should be done (as there are for sure many considerations that have to be made before making a decision), but once the decision on what to do is made I can help implementing the changes (unless of course core devs would be willing to do them). Here is a list of options I can see (some of them might immediately make no sense for Julia core devs - in such case please comment, but I do not want to limit myself at this stage of thinking about the issue):
String
type in GC (related to the above, but we might e.g. decide to always treatString
as very old; possibly this could be enabled/disabled by some run-time option)String
interning (thus fully disabling GC for them when interning is on) - this would have an additional benefit of faster comparisons at the expense of creation timeIn the mean time @quinnj is working on improving the handling of this issue on CSV.jl side (to avoid allocation of strings at all), but I think it is kind of a second-best and we should have a good solution in Julia Base.
The text was updated successfully, but these errors were encountered: