Make String more GC friendly #40840

bkamins · 2021-05-17T08:46:32Z

I am opening this issue as a follow-up to a discussion on Slack not to loose track of it.

Rationale: in data-science workflows it is very common to have very large tables that hold columns that consist of many unique strings (e.g. product ID that is non-numeric character sequence).

In such cases the current design of combination of GC and String type cause us to create a lot of small strings. The effect is described e.g. here h2oai/db-benchmark#210 (in these benchmarks the high-count string column is not taking part in any computations - it just sits there using-up memory and causing GC strain). The issue is especially apparent in multi-threading contexts (i.e. when the operation you want to do is parallelized and fast in general, but is paused by triggered GC collection cycles).

I think - given we want Julia to be fast in data science workflows - this issue critically needs to be resolved (it is apparent in H2O benchmarks, but I get this problem constantly reported by users of DataFrames.jl).

As this issue touches deep Julia Base internals, I am probably not the best person to decide what should be done (as there are for sure many considerations that have to be made before making a decision), but once the decision on what to do is made I can help implementing the changes (unless of course core devs would be willing to do them). Here is a list of options I can see (some of them might immediately make no sense for Julia core devs - in such case please comment, but I do not want to limit myself at this stage of thinking about the issue):

improve the "generational" aspect of GC (related: The GC often doesn't act generational #40644)
have a special handling of String type in GC (related to the above, but we might e.g. decide to always treat String as very old; possibly this could be enabled/disabled by some run-time option)
have a run-time option to turn on/off String interning (thus fully disabling GC for them when interning is on) - this would have an additional benefit of faster comparisons at the expense of creation time
have a special representation of short strings that would be non-allocating (if you have very many strings most likely they are short)

In the mean time @quinnj is working on improving the handling of this issue on CSV.jl side (to avoid allocation of strings at all), but I think it is kind of a second-best and we should have a good solution in Julia Base.

The text was updated successfully, but these errors were encountered:

bkamins · 2021-05-17T12:08:29Z

As a reference. What I am aiming to achieve is to have things like https://discourse.julialang.org/t/the-state-of-dataframes-jl-h2o-benchmark/43081/21 working with just Julia Base as much as possible.

JeffBezanson · 2021-05-17T18:36:08Z

have a special representation of short strings that would be non-allocating

If I understand the main issues, this would not fix it, since in a Vector{String} the GC would still have to look at every element to see which ones are references. It would save memory but not marking time. I can imagine fancy things like keeping a flag in the Vector to track whether every element is short, but that is probably too special-case and we'd be better off doing general card-marking.

bkamins · 2021-05-17T20:08:07Z

track whether every element is short

This is what I have imagined. However, I understand your point of not over-specializing code. Also saving memory will also help GC as it will occur less frequently.

Let me give an example of the pattern in which I want to ensure GC does not get super slow:

we have a large Vector{String}; for simplicity assume all its entries are unique; the vector has 10^9 elements; call this vector x
using this vector we create a large temporary object, e.g. d = Dict(x .=> axes(x, 1))
we do some operations using d but eventually we drop it
x is still alive, but d needs to be garbage collected as it is large and GC needs to reclaim the memory occupied by it

(this is not an artificial example - something like this happens e.g. in when doing joins to create left to right table mapping which is needed only temporarily, but can be large)

In such a scenario I would prefer to avoid high cost of GC due to the fact that we have 10^9 strings that potentially have to be marked (or at least to have to pay this cost very infrequently, and most of the time not have to pay it).
Also - what I am advocating - is that if ensuring this would mean that String type gets a special treatment by GC then that the core Julia team could consider allowing for specially handling this case. Thank you!

bkamins · 2021-06-03T07:42:48Z

A related question is: if we have a collection of Symbol - would it be possible to avoid full GC scan of its entries (the same when eltype of the collection is e.g Union{Symbol, Missing})? The issue is that in such cases it is clear by design that marking is not required however we have (fresh Julia session):

julia> x = Symbol.(1:10^7);

julia> GC.gc(true);

julia> @time GC.gc(true);
  0.069542 seconds (99.98% gc time)

julia> @time GC.gc(true);
  0.071290 seconds (99.99% gc time)

which to my understanding shows that every element of x is investigated by GC.

nalimilan added GC Garbage collector performance Must go faster strings "Strings!" labels May 17, 2021

bkamins mentioned this issue May 20, 2021

Improve count performance #40564

Merged

bkamins mentioned this issue Jun 4, 2021

Julia: use interned strings h2oai/db-benchmark#222

Merged

KristofferC mentioned this issue Jul 14, 2021

some optimizations for Array{Symbol} #41575

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make String more GC friendly #40840

Make String more GC friendly #40840

bkamins commented May 17, 2021

bkamins commented May 17, 2021

JeffBezanson commented May 17, 2021

bkamins commented May 17, 2021

bkamins commented Jun 3, 2021

Make String more GC friendly #40840

Make String more GC friendly #40840

Comments

bkamins commented May 17, 2021

bkamins commented May 17, 2021

JeffBezanson commented May 17, 2021

bkamins commented May 17, 2021

bkamins commented Jun 3, 2021