Iterating a SortedDict causes memory allocation #93

cberzan · 2015-05-12T02:25:30Z

Is there a way to iterate through a SortedDict without causing memory allocation? (I'm using a SortedDict in an inner loop, and most of the time is spent in the next function used for iterating.)

Here is a small example:

using DataStructures

function bench()
    dict = SortedDict(["New York" => 1788, "Illinois" => 1818])

    clear_malloc_data()
    @time for trial in 1:10000000
        counter = 0
        for item in dict
            counter += 1  # dummy operation for example purposes only
        end
    end
end

bench()

I'm running Julia 0.3.4 with --track-allocation=all. I'm seeing allocations on the for item in dict line in my script, and in the nexthelper function in sortedDict.jl. It seems that 320 bytes are allocated at every iteration.

If you think this is doable and point me in the right direction, I will consider contributing a pull request :)

The text was updated successfully, but these errors were encountered:

kmsquire · 2015-05-12T15:28:53Z

@cberzan, thanks for the issue. @StephenVavasis might have some ideas, since he wrote the code.

Iteration in Julia is done with start(), next(), and done(), so if you follow those functions in sortedDict.jl, you might find something.

I just took a quick look, and it looks like an sdtoken is constructed and discarded at every iteration (in the call from next(::SDIterableTypesBase, state::SDIterationState) to nexthelper. This could be rewritten so that the token is never created. Why don't you try that and see if it helps?

StephenVavasis · 2015-05-12T16:25:05Z

I'll admit that I don't have a clear understanding of Julia's memory allocation behavior. In C/C++, if a loop body allocates a stack variable, e.g.,

for (int i = 0; i < n; ++i) {
   int j = i*2;
  /* etc. */

}

then the code does not pay a price for allocating j on every iteration; the compiler reserves a spot on the stack for j once and for all at the beginning of the loop.

How about Julia? If a loop body creates a fixed-size variable (e.g. an Int or small immutable object), is there memory allocation on each iteration? If so, how does one reduce or minimize this overhead?

By the way, the reason for nexthelper is that the code supports a large number of looping constructs. Disjoint start/next/done routines for each possible loop construct would have resulted in a lot of repetitive code, so I tried to abstract out as much commonality as I could between the various start/next/done routines.

-- Steve Vavasis

simonster · 2015-05-12T16:54:04Z

@StephenVavasis This is true of Julia when variables are "bits types" (which you can check at the command line using isbits), but anything that isn't a bits type is allocated on the heap. Unfortunately, at least at present, immutables that contain references to non-bits types are not bits types. This includes both SDIterationState and SDToken.

StephenVavasis · 2015-05-12T17:39:24Z

@simonster, thanks for the explanation.

Just so everyone understands the issue: a 'token' is an immutable object that contains two fields: an integer (wrapped in another single-field immutable called a semitoken) and a pointer to the container.

Right now, I don't see how to fix the problem raised by @cberzan. The issue is that the iteration state somehow needs to remember the sorted container over which the iteration is taking place, with a pointer I suppose. If somehow I could split the iteration state into two pieces, one part that is not reallocated on each loop iteration and contains the pointer, and another part that contains the index information (Ints), then everything would be OK. But I don't see how to make this happen in the start/next/done framework of Julia: the state of the loop apparently needs to be recreated on each iteration, and there is no place to hide some permanent state information.

@cberzan, you could improve your performance by calling the lower-level non-exported routine nextloc0 from balancedTree.jl instead of using the documented container loops, but this would make your code fragile this because I or someone else might rewrite these routines in a future release. Possibly I should expose these inner routines via a documented interface? Question for @simonster: is there any plan to update the Julia core so that fixed-sized objects go on the stack instead of the heap? How far in the future is that?

-- Steve

StephenVavasis · 2015-05-12T18:20:13Z

By the way, does anyone know whether the built-in Dict suffers from this issue (that iteration with start/next/done requires heap allocation for the iteration state)? I found the relevant code in dict.jl difficult to understand because of calls to C-routines and the use of 'reinterpret'.

cberzan · 2015-05-12T20:23:27Z

It appears that iterating a built-in Dict also causes memory allocation.

function bench()
    dict = ["New York" => 1788, "Illinois" => 1818]

    clear_malloc_data()
    @time for trial in 1:10000000
        counter = 0
        for item in dict
            counter += 1
        end
    end
end

bench()

gives me elapsed time: 1.57196075 seconds (960000000 bytes allocated, 18.72% gc time)

The --track-allocation=all results say the allocation happens on the first line of skip_deleted in dict.jl, which is L = length(h.slots). But when I profile a call to length(::Array{Uint8,1}) separately, I get 0 bytes allocated. So I suspect the results of --track-allocation=all are unreliable because of inlining...

simonster · 2015-05-12T20:44:15Z

@StephenVavasis Do you need to store the container in the state? The object being iterated over is passed as the first argument to next.

There are plans to allocate objects on the stack when references cannot escape to the heap (JuliaLang/julia#8134) but we're not there yet.

@cberzan For a Dict{ASCIIString,Int}, the iterator yields a Tuple{ASCIIString,Int}, which is not a bits type and has to be heap-allocated. But iterating a Dict{Int,Int} yields a Tuple{Int,Int}, which is a bits type, and indeed doesn't allocate.

StephenVavasis · 2015-05-13T01:56:59Z

I implemented this change in the branch SortedContainers. I did not yet create a pull request for this commit (there is a pull request for an earlier commit on this branch), but I did push the commit to my personal repository: https://github.com/StephenVavasis/DataStructures.jl, branch SortedContainers.

In more detail, the change implemented is as follows. SCIterationState now holds two integers (instead of two integers and a pointer). The various non-token iterations do not construct tokens any more but only semitokens (Ints).

I saw a small improvement (0-10%) in my own timing tests. I am doubtful that these changes will help @cberzan 's timing tests much either because of the structure of his tests (keys are strings; iteration object constructed many times), but if he wishes to try, he can carry out a Pkg.clone(..) and Pkg.checkout(..) from my repository.

I think these changes might lead to a more substantial improvement in a loop over a container with many entries (instead of just 2) and in which the keys and values are bits-types (rather than strings), but I don't have a test like that yet.

StephenVavasis · 2015-05-14T23:55:52Z

I ran a test with integer key/value pairs and with a large SortedDict using my latest code (that removes the unnecessary pointers from SCIterationState and nexthelper) versus an older version, and I observed a speedup of a factor of 3! So this is a very useful improvement. Thanks to @cberzan for pointing out the performance issue and to @simonster for explaining the issue. My code is below. I tested it with the statement:

@time sumsquares1.sumsquares(80000,10000)




module sumsquares1

import DataStructures
using DataStructures.SortedDict

function sumsquares(m,n)
    u = SortedDict(Dict{Int,Int}())
    for j = 1 : m
        u[j] = j*j
        end
    for trial = 1 : n
        sum1 = 0
        for item in u
            sum1 += item[2]
        end
        @assert sum1 == m * (m+1) * (2m+1) / 6
    end
end

end

Edit (kms): Formatting.

cberzan · 2015-05-15T22:07:42Z

@StephenVavasis Thanks; that was fast! Using your branch, I'm seeing a ~30% speedup in my code, and ~40% reduction in bytes allocated. (I meant my actual code, not the examples posted in this issue.)

Also thanks to @simonster for pointing out that some tuples are bits types and some are not. I will try to rewrite my code to use bits types so that iterating dicts can be done without heap allocation.

StephenVavasis · 2015-05-17T20:19:23Z

Dear Readers,

Recall the issue with tokens in the SortedDict data structure: they are fixed-size immutable objects consisting of a wrapped Int and a pointer to a SortedDict, but nonetheless they are allocated on the heap and thus cause a significant loss of performance. I carried out more experimenting and discovered that if I define tokens to be tuples of (pointer,int) instead of a two-field immutable, then this performance problem goes away. See the example code below. A few questions about this: (1) Is it a good idea? Does it lead to readable code? It would mean that the old token interface would have to be deprecated in favor of tokens as tuples. (2) Can anyone explain why tuples do not suffer from the same problem as immutables? I would like to understand the issue to make sure I don't accidentally step in another pitfall.

Thanks,
Steve

The following code creates a sorted dict in which they key i (an integer) is associated with the value i^2 (also an integer). Then it iterates over the sorted dict repeatedly and sums the squares using tokens explicitly (instead of a for-loop). This code uses the proposed new token interface in which a token is a tuple (pointer, IntSemiToken): Timing tests indicate that this code is much faster and allocates far less memory than the old token code in which a token is a two-field immutable.

function ssq4(m,n)
    u = SortedDict(Dict{Int,Int}())
    for j = 1 : m
        u[j] = j*j
    end
    for trial = 1 : n
        sum1 = 0
        (u,s) = startof(u)
        while !isequal((u,s), pastendtoken(u))
            sum1 += deref_value((u,s))
            (u,s) = advance((u,s))
        end
        @assert sum1 == m * (m+1) * (2m+1) / 6
    end
end

Edit (kms): added backticks.

StephenVavasis · 2015-05-17T20:27:13Z

Sorry to follow up on my own post, but here is some additional weirdness with using tuples: if I give the tuple a name t in the above code, then all of a sudden the run-time jumps by 50% and the memory allocation sky-rockets. Why? Here is the same code in which the tuple has a name::

  function ssq5(m,n)
      u = SortedDict(Dict{Int,Int}())
      for j = 1 : m
          u[j] = j*j
      end
      for trial = 1 : n
          sum1 = 0
          t = startof(u)
          while !isequal(t, pastendtoken(u))
              sum1 += deref_value(t)
              t = advance(t)
          end
          @assert sum1 == m * (m+1) * (2m+1) / 6
      end
  end

Edit (kms): added backticks.

kmsquire · 2015-05-18T01:03:42Z

@StephenVavasis, you should add triple backticks around code that you paste into github, so that it is formatted properly, but also because unquoted macros will send a message to users with the same name (in this case, the @assert organization).

(Bonus: If you use ```julia as the first line, you'll get syntax highlighting.)

mauro3 · 2015-05-23T12:31:43Z

I can't claim that I understand the intricacies of the SortedDict. However, as @simonster suggested above, is there a reason to carry around the dict in the state? For that the other argument of the iterator functions should be used. For instance the normal Dict iterator functions do not carry a reference to the dict in the state as that reference in passed in the first argument:

start(t::Dict) = skip_deleted(t, 1)
done(t::Dict, i) = i > length(t.vals)
next(t::Dict, i) = ((t.keys[i],t.vals[i]), skip_deleted(t,i+1))
function skip_deleted(h::Dict, i)
    L = length(h.slots)
    while i<=L && !isslotfilled(h,i)
        i += 1
    end
    return i
end

Translated to your code this might look like:

start(m::SortedDict) = (nextloc0(m.bt,1), 2)
function next(m::SortedDict, state)
    dt, t, ni = nexthelper(m,state)
    (dt.k, dt.d), ni
end
done(m::SortedDict, state) = state[1] == state[2]
function nexthelper(m::SortedDict, state)
    sn = state[1]
    (sn < 3 || !(sn in m.bt.useddatacells)) && throw(BoundsError())
    m.bt.data[sn], sdtoken_construct(m,sn), (nextloc0(m.bt, sn), state[2])
end

Note how m is carried in the first argument, so it doesn't need to be carried in the state; thus state becomes isbits and no allocations occur. Same goes for the specialized iterators, the trick there is to wrap the dict in the helper type. Here how dict does it:

immutable KeyIterator{T<:Associative}
    dict::T
end
immutable ValueIterator{T<:Associative}
    dict::T
end

start(v::Union(KeyIterator,ValueIterator)) = start(v.dict)
done(v::Union(KeyIterator,ValueIterator), state) = done(v.dict, state)

function next(v::KeyIterator, state)
    n = next(v.dict, state)
    n[1][1], n[2]
end

function next(v::ValueIterator, state)
    n = next(v.dict, state)
    n[1][2], n[2]
end

So it iterates through the dict just like the ordinary iterator except in next it only returns the key or value.

StephenVavasis · 2015-05-23T15:36:31Z

For ordinary for-loops, I have already fixed the start/done/next functions following @simonster 's suggestion so that the 'token' overhead issue goes away. The issue remains, however, in situations in which a token is explicitly needed by the user of the package. When a token is explicitly needed, then the question is whether the package should provide a token that indexes the item+container or just the item (in which case the user keeps track of the container). I have called the latter a 'semitoken' in my documentation.

In some earlier rounds of discussion with @kmsquire, we decided that providing an index to an item+container would be more user-friendly and elegant. It is, for example, the way that iterators work in C++ standard containers. Of course, we didn't know about this performance overhead issue when we made that decision.

So far I haven't heard from anyone who can explain the issue of why there is overhead with tuples if you bind a name to them (as opposed to binding names to their components), so I opened a discussion on that topic on the julia-dev newsgroup here:

https://groups.google.com/forum/#!topic/julia-dev/uQ1DsGGSrD0

I'm not sure how to proceed with the sorted-container package development-- drop the token interface (because of performance problems) and provide only a semitoken interface? Provide both interfaces at the cost of more code and documentation complexity? Provide only the token interface (for elegance) and hope that the performance issue is fixed in a future version of Julia?

mauro3 · 2015-05-23T17:37:11Z

Sorry, I somehow missed the updates in your SortedContainer branch (relevant bits are here:
https://github.com/StephenVavasis/DataStructures.jl/blob/SortedContainers/src/containerloops.jl)

StephenVavasis · 2022-07-17T13:03:54Z

Closed via #787

StephenVavasis mentioned this issue May 15, 2015

This commit adds two new data structures to the library: SortedMultiDict #88

Merged

kmsquire assigned StephenVavasis Jan 17, 2016

StephenVavasis closed this as completed Jul 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Iterating a SortedDict causes memory allocation #93

Iterating a SortedDict causes memory allocation #93

cberzan commented May 12, 2015

kmsquire commented May 12, 2015

StephenVavasis commented May 12, 2015

simonster commented May 12, 2015

StephenVavasis commented May 12, 2015

StephenVavasis commented May 12, 2015

cberzan commented May 12, 2015

simonster commented May 12, 2015

StephenVavasis commented May 13, 2015

StephenVavasis commented May 14, 2015

cberzan commented May 15, 2015

StephenVavasis commented May 17, 2015

StephenVavasis commented May 17, 2015

kmsquire commented May 18, 2015

mauro3 commented May 23, 2015

StephenVavasis commented May 23, 2015

mauro3 commented May 23, 2015

StephenVavasis commented Jul 17, 2022

Iterating a SortedDict causes memory allocation #93

Iterating a SortedDict causes memory allocation #93

Comments

cberzan commented May 12, 2015

kmsquire commented May 12, 2015

StephenVavasis commented May 12, 2015

simonster commented May 12, 2015

StephenVavasis commented May 12, 2015

StephenVavasis commented May 12, 2015

cberzan commented May 12, 2015

simonster commented May 12, 2015

StephenVavasis commented May 13, 2015

StephenVavasis commented May 14, 2015

cberzan commented May 15, 2015

StephenVavasis commented May 17, 2015

StephenVavasis commented May 17, 2015

kmsquire commented May 18, 2015

mauro3 commented May 23, 2015

StephenVavasis commented May 23, 2015

mauro3 commented May 23, 2015

StephenVavasis commented Jul 17, 2022