Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define a token API for Associative types #24454

Open
kmsquire opened this issue Nov 2, 2017 · 13 comments
Open

Define a token API for Associative types #24454

kmsquire opened this issue Nov 2, 2017 · 13 comments
Labels
collections Data structures holding multiple items, e.g. sets

Comments

@kmsquire
Copy link
Member

kmsquire commented Nov 2, 2017

(From a recent discourse discussion.)

Consider

d = Dict{String, Int}()
...
if haskey(d, key)
    d[key] += 1
end

In this code, the hash of the key is calculated 3 times. Ideally, this would happen only once.

Tokens provide one way to handle this. Two possible interfaces:

#######
# API 1
#######
d = Dict{String, Int}()
...
token = gettoken1(d, key) # returns a DictToken object
d[token] += 1             # O(1) get and O(1) set
#######
# API 2
#######
d = Dict{String, Int}()
...
token = keytoken(d, key)        # essentially the same as gettoken1 in API 1
value = gettoken2(d, token)
settoken!(d, token, value + 1)

(gettoken1 and gettoken2 are named as such to distinguish them here. The actual name for whatever API is chosen should be gettoken)

I have a slight preference for API 1. API 2 was proposed by @StefanKarpinski here. (@StefanKarpinski, please fix if incorrect!)

Related solutions/issues:

  1. using tokens in SortedDicts in DataStructures.jl, to allow O(1) access to a value that was looked up
  2. using get!(d, key, default) to store and return a default value, if the key is not in the dictionary
  3. using DefaultDicts in DataStructures.jl, for a similar purpose
  4. proposal to use ?= as a get-or-set operation (?= "get or set" operator #2108)
  5. proposal to use Refs to access elements (julep/rfc: proposal for unified Dict get/insert/has interface #12157)

Cc: @StephenVavasis @andyferris

@xiaodaigh
Copy link

Why not just

updateindex!(dict, key, function_to_apply_to_value)

the outcome of the above should be the same as

dict[key] = get(dict, key, default_value) |> function_to_apply_to_value

so it's in one line and faster?

@StephenVavasis
Copy link
Contributor

The "token" feature of SortedDict is meant to address two problems: multiple key lookups for an update operation (the topic of this issue) and also iterating over a sorted container in the sort-order of the keys. The token serves as the state-variable for this iteration. The updateindex! proposal in the above comment would solve the first problem for both Dict and SortedDict, but tokens or something similar would still be needed to solve the second problem for the sorted containers.

@andyferris
Copy link
Member

I favor API 2.

I've been hoping that we can copy the IndexStyle type of idea from arrays and have iteration and methods like map (of values) using tokens automatically, without knowing the details of the tokens of a given dictionary. (I'm hoping this will work together with the semantics of similar in #24390 to make functions like map (of values) really fast using simple, generic code).

I'd also like to point out that linear indexing is a kind-of token system for dense arrays.

@mbauman
Copy link
Member

mbauman commented Nov 3, 2017

Wouldn't that be API 1, then? That'd more transparently give you a "fast" key to use with indexing syntaxes. The downside with API 1 is that then dictionaries can no longer support all types in the language as keys — the DictToken type (or whatever) would be special. I think that's fine. It's also worth noting that it'd be easy to define a gettoken1 fallback for ::Associative — you just return the (likely slow) key itself by default.

@andyferris
Copy link
Member

andyferris commented Nov 3, 2017

I guess I was imagining methods of map that use gettoken and settoken! and a separate slower (default) method that uses getindex and setindex!.

Wrapping in a DictToken would be fine, also.

In any case, the tricky bit is when you do operations like map! you need to figure out if the tokens of the input and output match, and what to do when they don't, etc.

@JeffBezanson
Copy link
Member

I favor adding update!. It's simple and handles the most common case. Tokens are more complex since you have the problem of invalidating them if the Dict changes.

@JeffBezanson JeffBezanson added the collections Data structures holding multiple items, e.g. sets label Nov 3, 2017
@stevengj
Copy link
Member

stevengj commented Nov 3, 2017

API 1 seems a lot clearer to me.

I don't think it's correct that API 1 wouldn't be able to support all types in the language, because you would parameterize the token type on the dictionary type.

That is, for d::Dict{K,V}, the gettoken(d, k) function would return an AssociativeToken{Dict,K,V} instance.

If you then wanted a new dictionary d2 whose keys were AssociativeToken{Dict,K,V}, then d2 would be of type Dict{AssociativeToken{Dict,K,V}, V}. gettoken on d2 would return an Associative{Dict, AssociativeToken{Dict,K,V}, V} instance. And so on.

@stevengj
Copy link
Member

stevengj commented Nov 3, 2017

Note that you'd want the function to be the first argument of update! so that you could use it with do syntax. But how would you avoid the redundant hashing in the following pattern:

if haskey(dict, k)
    .... do something...
else
    ... do something else with dict[k] ....
end

?

@stevengj
Copy link
Member

stevengj commented Nov 3, 2017

Also, without tokens, how would you swap two elements, i.e.

d[k1], d[k2] = d[k2], d[k1]

without hashing the keys twice? The proposed update! would only give you access to a single element, no?

@JeffBezanson
Copy link
Member

update! could have an optional default value to use if the key doesn't exist yet.

API 1 is ambiguous for Dicts of Any, e.g. for memoization.

It's true, tokens are more powerful, but therefore more difficult to support correctly.

@andyferris
Copy link
Member

I'm still not convinced that update! is sufficient for the kinds of things I was envisaging, though it is certainly helpful.

One obvious but tricky thing is that map!(f, dict1, dict2) should be fast whenever dict1 and dict2 share compatible tokens (which will hopefully be likely after calling similar). Ideally, we can also implement map! with generic code like we do for AbstractArray which as I understand will use linear index "tokens" whenever it can.

@nalimilan
Copy link
Member

update! can be useful on its own, so we could start with that and see whether people still need a lower-level system based on tokens.

@chethega
Copy link
Contributor

chethega commented May 21, 2018

Regarding update!, my favorite API would be

merge!(combine, d::Dict{K,V}, kv::Pair{K,V})
_merge!(combine, d::Dict{K,V}, k::K,v::V)

The _merge! is necessary for types where the pair formation is expensive (e.g. causes an allocation).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
collections Data structures holding multiple items, e.g. sets
Projects
None yet
Development

No branches or pull requests

9 participants