Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RobinDict #501

Merged
merged 58 commits into from
Jul 5, 2019
Merged

RobinDict #501

merged 58 commits into from
Jul 5, 2019

Conversation

eulerkochy
Copy link
Member

@eulerkochy eulerkochy commented May 14, 2019

I call this RobinDict, which implements Robinhood Hashing technique .

  • Write delete/erase operation for RobinDict
  • Add tests
  • Write docstrings
  • Benchmarking the code
  • Documentation

@oxinabox
Copy link
Member

oxinabox commented May 14, 2019

Some early points since i know this is not done yet.

  • Please use 4 spaces for intending. Not tabs.
  • IIRC: The AbstractDict interface requires get(dict, key, default) to be implemented, and if you do that (along with some other things you've already done) a bunch of things (like equality) will work for free.
  • for good style please always use return to return a value from a multiline function.

@eulerkochy
Copy link
Member Author

eulerkochy commented May 14, 2019

I didn't quite get the first point, shouldn't Tab size : 4 do the same thing?
PS: Look at the bottom right-hand corner of the attached screenshot
Screenshot (5)_LI

src/robin_dict.jl Outdated Show resolved Hide resolved
@eulerkochy
Copy link
Member Author

eulerkochy commented May 14, 2019 via email

src/robin_dict.jl Outdated Show resolved Hide resolved
src/robin_dict.jl Outdated Show resolved Hide resolved
src/robin_dict.jl Outdated Show resolved Hide resolved
src/robin_dict.jl Outdated Show resolved Hide resolved
src/robin_dict.jl Outdated Show resolved Hide resolved
src/robin_dict.jl Outdated Show resolved Hide resolved
src/robin_dict.jl Outdated Show resolved Hide resolved
src/robin_dict.jl Outdated Show resolved Hide resolved
src/robin_dict.jl Outdated Show resolved Hide resolved
src/robin_dict.jl Outdated Show resolved Hide resolved
src/robin_dict.jl Outdated Show resolved Hide resolved
@eulerkochy eulerkochy marked this pull request as ready for review May 30, 2019 13:39
@codecov
Copy link

codecov bot commented Jun 2, 2019

Codecov Report

Merging #501 into master will increase coverage by 4.27%.
The diff coverage is 99.08%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #501      +/-   ##
==========================================
+ Coverage   89.11%   93.39%   +4.27%     
==========================================
  Files          31       32       +1     
  Lines        2408     2529     +121     
==========================================
+ Hits         2146     2362     +216     
+ Misses        262      167      -95
Impacted Files Coverage Δ
src/DataStructures.jl 100% <ø> (ø) ⬆️
src/robin_dict.jl 99.08% <99.08%> (ø)
src/int_set.jl 100% <0%> (+0.79%) ⬆️
src/heaps/minmax_heap.jl 100% <0%> (+0.95%) ⬆️
src/multi_dict.jl 75% <0%> (+1.08%) ⬆️
src/mutable_list.jl 99.32% <0%> (+1.32%) ⬆️
src/list.jl 100% <0%> (+1.53%) ⬆️
src/disjoint_set.jl 98.07% <0%> (+1.85%) ⬆️
src/priorityqueue.jl 98.64% <0%> (+1.95%) ⬆️
... and 10 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3529577...d464edf. Read the comment docs.

@codecov
Copy link

codecov bot commented Jun 2, 2019

Codecov Report

Merging #501 into master will decrease coverage by 1.48%.
The diff coverage is 77.73%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #501      +/-   ##
==========================================
- Coverage    89.4%   87.91%   -1.49%     
==========================================
  Files          31       32       +1     
  Lines        2407     2690     +283     
==========================================
+ Hits         2152     2365     +213     
- Misses        255      325      +70
Impacted Files Coverage Δ
src/DataStructures.jl 100% <ø> (ø) ⬆️
src/robin_dict.jl 77.73% <77.73%> (ø)
src/container_loops.jl 51.21% <0%> (-4.07%) ⬇️
src/sorted_dict.jl 84.03% <0%> (-1.69%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9c55bda...7cb18f2. Read the comment docs.

@eulerkochy
Copy link
Member Author

Ready for review @oxinabox

changes to insert

fix rh_insert, write rh_search, getindex
Fix erronous h.count
remove un-necessary print statement
minor corrections in code, add _tablesz for code readability

mistake in constructor with pairs
minor changes in rh_insert!, write out rh_delete!

correct a blunder in rehash!

minor change in sizehint!

yet another mistake in sizehint!
change in rh_delete!

minor changes in rh_delete! and pop!

fix a typo
introduce max_lf(for benchmarking purposes), write constructor for iterables
constructors for tuples and iterables updated to cover all test cases

remove un-necessary constant declaration

indentation changes
@eulerkochy
Copy link
Member Author

I thought I'll just post the benchmarking results here and keep updating it as I refactor the code.
Here's the result after adding rh_insert_for_rehash!:

julia> include("test/bench_robin_dict.jl")
.
Sample #1 Key => Integer , Size => 10^6 entries
.
        add_entries for RobinDict()
  847.246 ms (5914106 allocations: 196.91 MiB)
        add_entries for Dict()
  1.122 s (5985262 allocations: 156.50 MiB)
        add_entries for RobinDict{Int, Int}()
  182.534 ms (48 allocations: 106.67 MiB)
        add_entries for Dict{Int, Int}()
  112.181 ms (54 allocations: 65.17 MiB)
.
Sample #2 Key => Float32 , Size => 10^6 entries
.
        add_entries for RobinDict()
  821.209 ms (5914028 allocations: 196.91 MiB)
        add_entries for Dict()
  1.107 s (5986513 allocations: 156.52 MiB)
        add_entries for RobinDict{Float32, Float32}()
  149.796 ms (46 allocations: 64.00 MiB)
        add_entries for Dict{Float32, Float32}()
  99.266 ms (52 allocations: 34.50 MiB)
.
Sample #3 Key => String , Size => 10^6 entries
.
        add_entries for RobinDict()
  589.279 ms (1956745 allocations: 136.53 MiB)
        add_entries for Dict()
  568.355 ms (2562043 allocations: 104.26 MiB)
        add_entries for RobinDict{String, String}()
  365.851 ms (48 allocations: 106.67 MiB)
        add_entries for Dict{String, String}()
  439.566 ms (54 allocations: 65.17 MiB)
.
Sample #4 Key => Integer , Size => 10^7 entries
.
        add_entries for RobinDict()
  6.338 s (35658047 allocations: 970.77 MiB)
        add_entries for Dict()
  12.688 s (57823447 allocations: 1.39 GiB)
        add_entries for RobinDict{Int, Int}()
  1.656 s (54 allocations: 426.67 MiB)
        add_entries for Dict{Int, Int}()
  1.544 s (72 allocations: 541.17 MiB)
.
Sample #5 Key => Integer , Size => 10^5 entries
.
        add_entries for RobinDict()
  19.470 ms (444164 allocations: 13.45 MiB)
        add_entries for Dict()
  17.759 ms (396983 allocations: 11.73 MiB)
        add_entries for RobinDict{Int, Int}()
  8.130 ms (36 allocations: 6.67 MiB)
        add_entries for Dict{Int, Int}()
  5.527 ms (36 allocations: 5.67 MiB)
.
Sample #6 Key => Float32 , Size => 10^5 entries
.
        add_entries for RobinDict()
  19.748 ms (444207 allocations: 13.45 MiB)
        add_entries for Dict()
  17.693 ms (399228 allocations: 11.76 MiB)
        add_entries for RobinDict{Float32, Float32}()
  7.281 ms (34 allocations: 4.00 MiB)
        add_entries for Dict{Float32, Float32}()
  4.871 ms (34 allocations: 3.00 MiB)
.
Sample #7 Key => String , Size => 10^5 entries
.
        add_entries for RobinDict()
  22.418 ms (121883 allocations: 8.53 MiB)
        add_entries for Dict()
  18.473 ms (115882 allocations: 7.44 MiB)
        add_entries for RobinDict{String, String}()
  15.422 ms (36 allocations: 6.67 MiB)
        add_entries for Dict{String, String}()
  15.398 ms (36 allocations: 5.67 MiB)

Here are the plots, which can provide an insight into the working of RobinDict at load factor of 70%
dibs_1000K  @0 7 L F
lf_and_max_probe_1000K @0 7 L F

@vtjnash
Copy link
Contributor

vtjnash commented Jun 25, 2019

which is pretty bad with respect to memory consumption

FWIW, the growth factor use usually use for AbstractDict is:
rehash!(h, length(h) > 64000 ? sz*2 : sz*4)

@eulerkochy eulerkochy changed the title [WIP] Dict with Robinhood hashing technique RobinDict Jun 26, 2019
test/test_robin_dict.jl Outdated Show resolved Hide resolved
@eulerkochy
Copy link
Member Author

@chethega @vtjnash @oxinabox review ? I'll rebase after that

@eulerkochy
Copy link
Member Author

Bump

docs/src/robin_dict.md Outdated Show resolved Hide resolved
docs/src/robin_dict.md Outdated Show resolved Hide resolved
docs/src/robin_dict.md Show resolved Hide resolved
@eulerkochy
Copy link
Member Author

Bump

Copy link

@chethega chethega left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine by me. Good work!

src/robin_dict.jl Outdated Show resolved Hide resolved
function _setindex!(h::RobinDict{K,V}, key::K, v0) where {K, V}
v = convert(V, v0)
sz = length(h.keys)
(h.count > ROBIN_DICT_LOAD_FACTOR * sz) && rehash!(h, sz<<2)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always resizing by a factor of 4 looks sketchy (leads to very low load factors and hence memory waste for long-lived dicts). RobinDict should now be much cheaper to rehash than Base dict.

Maybe just copy what Base does? (i.e. use threshold to decide whether to grow by factor 2 or 4)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did that on purpose as this change showed some improvement in benchmarks . In details described here

index = rh_search(h, key)

index > 0 && return h.vals[index]

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In theory, we could use the partial traversal from rh_search and continue insertion from there, instead of starting from desired_index again: rh_search aborts at the first entry that would be relocated on insertion.

Just leaving that here for posterity (can be separate PR to improve perf).

@oxinabox
Copy link
Member

oxinabox commented Jul 5, 2019

Given @chethega has approved, I am going to merge this.
Nice work @eulerkochy

@oxinabox oxinabox merged commit 664847e into JuliaCollections:master Jul 5, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants