Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Broadcasting dictionaries with unmatched keys gives confusing result #19

Closed
tkf opened this issue May 30, 2020 · 10 comments · Fixed by #22
Closed

Broadcasting dictionaries with unmatched keys gives confusing result #19

tkf opened this issue May 30, 2020 · 10 comments · Fixed by #22

Comments

@tkf
Copy link

tkf commented May 30, 2020

julia> d = dictionary(:a => 1, :b => 2)
2-element HashDictionary{Symbol,Int64}
 :a1
 :b2

julia> e = dictionary(:a => 3, :c => 4)
2-element HashDictionary{Symbol,Int64}
 :a3
 :c4

julia> d .+ e
2-element HashDictionary{Symbol,Int64}
 :a4
 :b6

Maybe it should do what mergewith does? However, each kind of join also makes sense. I think slightly more manual API for specifying the join would be nice JuliaLang/julia#25904 (comment)

@andyferris
Copy link
Owner

andyferris commented Jun 8, 2020

Ahh - yes the bounds checks aren't done yet, partly because I've been fighting with myself about the semantics of map and broadcast, and different equalities on AbstractIndices.

My current thinking #13 is roughly that map respect iteration and ensures the keys match (in iteration order) while in broadcast we can have the keys in any order but they must all be issetequal (or scalar). (At a later point I'd be interested in "Cartesian indexing for dictionaries" which affects broadcast but that's down the road).

The results in #13 shouldn't be spurious anymore. But the more flexible broadcasting is not implemented yet.

@tkf
Copy link
Author

tkf commented Jun 8, 2020

semantics of map and broadcast

Yeah, I've been thinking about this too. I think the tension seems to be map based on the iteration protocol vs map-as-a-functor (like Haskell's fmap).

But, going back to the original topic here, isn't requiring isequal or issetequal a bit too restrictive? Or rather, if you require this condition, why not just require ===? Otherwise, wouldn't it make sense to define map(f, dictionaries...) to be semantically equivalent to map(Base.splat(f), joined_dictionary) where joined_dictionary is either inner or outer join of dictionaries?

Also, the semantics of broadcasting is that it not only maps the elements but also it stretches the shape of the container to fit with the larger container. So, maybe outer join makes sense? Furthermore, if we use nothing as the structural missingness of the joined dictionary, it makes sense to use something as a DSL for switching to inner join:

f.(d1, d2)                       # => AbstractDictionary{I,Union{R,Nothing}}   outer join with `nothing`
something.(f.(d1, d2))           # => AbstractDictionary{I,R}                  inner join
something.(f.(d1, d2), missing)  # => AbstractDictionary{I,Union{R,Missing}}   outer join with `missing`

where R is the return type of f.

Here, something is just used as a "marker" in the Broadcasted tree to specify the join type. Semantically, I think it is equivalent to lift the inner call tree to Union{Some{T},Nothing} and then skipping if the end result is nothing (or wrapping getindex with try ... catch; continue; end).

It'd be useful for flexible default value handling:

something.(f.(something.(d1, default1), d2))

This would fill the value of d1 on-the-fly but removes from the output dictionary if a key exist in d1 but not in d2.

@tkf
Copy link
Author

tkf commented Jun 9, 2020

Actually, maybe it doesn't make sense for map to be a synonym of join since you can have different functions for that (e.g., by moving all the join functions in DataFrames.jl to DataAPI.jl). OTOH, I thought it'd be nice to define some dotcall-based rich DSL for joins of dictionaries.

@andyferris
Copy link
Owner

I think join is rather different and should be modelled by something else. These are also really only primary-key - primary-key joins that you're discussing, too. Doing various things with two dictionaries with the union, intersect, setdiff and symdiff of their keys should be possible at some point.

It seems the people in Base equate map with iterate so we need to support that.

Also, the semantics of broadcasting is that it not only maps the elements but also it stretches the shape of the container to fit with the larger container.

broadcast only "stretches" containers on "dimensions" that are missing. It does special things to match up various CartesianIndex{N} keys. I think we can generalize this neatly (I want multidimensional dictionaries, in both Cartesian product spaces and more general nested/direct-sum/discrimenated-union spaces). Also, it seems to me that because it is key based rather than iteration based we can be flexible of the ordering within a dimension.

OTOH, I thought it'd be nice to define some dotcall-based rich DSL for joins of dictionaries.

I think we could have dotcall do really useful things, but I'm thinking of the kinds of things people do with NDSparse and so-on.

@andyferris
Copy link
Owner

andyferris commented Jun 9, 2020

Doing various things with two dictionaries with the union, intersect, setdiff and symdiff of their keys should be possible at some point.

Probably a mergewith! style thing. A signature something like this:

merge(dict1, dict2; left = identity, right = identity, inner = (x, y) -> y)

So you have keys only on the left (dict1) and have their values pass through the left function, same for keys only in dict2 with values going through right and where a key i is shared we get the value with inner(dict1[i], dict2[i]).

PS - I don't really want to call it mergewithleftrightinner :trollface: On a serious note, keyword arguments are great for APIs in my opinion, but we have troubles passing through type-constructors-as-functions and so-on... :(

EDIT: PPS that merge function only gives a dictionary having the union of the keys; we need to be able to restrict these to left, right, inner, not-left, not-right, not-inner. Maybe just set the various functions to nothing since it's a decent, non-callable sentinel.

@tkf
Copy link
Author

tkf commented Jun 9, 2020

Doing various things with two dictionaries with the union, intersect, setdiff and symdiff of their keys should be possible at some point.

I think union and intersect are covered in the DSL. (Aren't they outer join and inner join? Or maybe I'm just noob about relational algebra?) With the dotcall, it's doable with an arbitrary number of dictionaries and they can sit in a nested call tree. Expressing setdiff and `require additional "makrer" functions, though.

It seems the people in Base equate map with iterate so we need to support that.

I believe there is some room for improvement and fmap view would provide a nice guideline to this. You might be interested in this discussion: JuliaLang/julia#36106

I want multidimensional dictionaries, in both Cartesian product spaces and more general nested/direct-sum/discrimenated-union spaces

This sounds very interesting! If you have this in mind already it makes total sense to avoid attaching additional semantics to dotcalls on dictionaries.

@tkf
Copy link
Author

tkf commented Jun 9, 2020

BTW, since you mentioned nested dictionary, did you have a look at awkward-array https://github.com/scikit-hep/awkward-array? It's a Python library for nested arrays. I think it's exploring an interesting direction for nested data structure. I only know it from this Strange Loop talk https://www.youtube.com/watch?v=2NxWpU7NArk though. I meant to play with it and see how it can be relevant to APIs in Julia but I haven't had a chance to do this yet. Well, I'm mentioning it here anyway since there are some chance that it interest you or you already know it and have some thoughts on it.

@andyferris
Copy link
Owner

did you have a look at awkward-array

Yes I did! I just recently watched the same talk - very interesting.

The python-esque way of having matrices = nested vectors is quite interesting for ragged arrays... together with "dot-getindex" then I can imagine using something like ragged_array.[:].[end] get the last element of each subvector and so-on. I'm not sure how to do an arbitrary ragged_array[:, f(end)] with our current lowering? (end is translated last_index(ragged_array, 2) and is independent of the first index - in the nested case it would be a function of each ragged_array[i]).

@andyferris
Copy link
Owner

andyferris commented Jun 9, 2020

I think union and intersect are covered in the DSL. (Aren't they outer join and inner join? Or maybe I'm just noob about relational algebra?) With the dotcall, it's doable with an arbitrary number of dictionaries and they can sit in a nested call tree. Expressing setdiff and `require additional "makrer" functions, though.

If your goal is to implement relational algebra, there are a few things to keep in mind.

  1. There are only inner joins in relational algebra.
  2. Dictionaries represent only two-column (key, value) relations that are functional relations
  3. The join keys aren't necessarily unique.
  4. You can join on two unique columns, which is the case you seem to be considering with map for dictionaries. Using map and/or broadcast for this is a natural idea (at least for inner joins). Currently merge seems to be somewhat close to an outer join between two tables with the same schema (a key and a value column) - concretely my suggestion is to generalise this.
  5. You can join a non-unique column with a unique column, which is called a foreign-key join, and Julia's non-scalar indexing models this perfectly (if that's confusing it's in my JuliaCon 2018 talk, and it's implemented in Indexing.jl). I suppose the "left join" equivalent is non scalar get rather than non-scalar getindex, if that makes sense. Should I add gets to Indexing.jl? There is still right and outer joins to consider here.
  6. We still need operations for joins not involving primary keys. For dictionaries we might want to join on the values, for example. Would we want a new generic function for this? Would you do this by "inverting" one or both dictionaries first and using one of the operations above? (Do we want an invert function for dictionaries?)

SQL has copped a lot of flack for distorting relational algebra with confusing stuff with NULL and the results of many operations, which are necessary because it doesn't allow nested containers. Take for example C# IEnumerable (LINQ) - instead of left joins we have what I call left group join, which I copied in SplitApplyCombine.jl. The reason it is like a left-join is this:

If there are no correlated elements in inner for a given element of outer, the sequence of matches for that element will be empty but will still appear in the results.

So instead of a null sentinel you get an empty set of matches.

Additionally, the functional programming crowd also noted there is an interesting analogue between left foreign key joins and optional types and outer foreign key joins and variant types, which you see being exploited in the wild in SQL databases all the time (can't find my original reference for learning this, sorry). If you like internet rants which lie somewhere between amusing and dead serious, I also suggest googling something called "The Third Manifesto and an interesting implementation in python called "Dee". I particularly like the way they encourage thinking in terms of nested tables, for example in grouping.

image

(Sorry - clearly I've been thinking about this stuff too much...)

@tkf
Copy link
Author

tkf commented Jun 10, 2020

I appreciate your comments!

The python-esque way of having matrices = nested vectors is quite interesting for ragged arrays

Yeah, I agree that Numpy using C contiguous array and already abstracting multi-dimensional arrays as nested vectors give a big advantage to awkward-array.

how to do an arbitrary ragged_array[:, f(end)]

There was a bit of discussion in JuliaLang/julia#35681 for problems with current lowering of end. I think a closure-based lowering can solve this JuliaLang/julia#35681 (comment). It would be lowered to getindex(ragged_array, :, LazyIndex((_begin, _end) -> f(_end))).

ragged_array.[:].[end] is a bit mind-bending though. I guess x.[i].[j].[k] would be first uncurried/flatten (as in converting i -> j -> k -> value to (i, j, k) -> value) before broadcasted and then curried back afterwards (at least semantically)? I'm guessing this would be required for x.[i].[j].[k] .+ y.[a].[b] to work.

4. You can join on two unique columns, which is the case you seem to be considering with map for dictionaries. Using map and/or broadcast for this is a natural idea (at least for inner joins).

I was also thinking it might be possible to extend this to outer join on primary key (since dictionary keys are unique). I thought it's possible to do this without going to leftgroupjoin since you know that each group is either empty or a singleton. So, Union{Some{T}, Nothing} can be used to represent the group. But I was thinking more like full outer join by default and something.(...) to get inner join. So, left and right join can be expressed with extra somethings

something.(tuple.(d1, something.(d2, Some(nothing))))  # left join
something.(tuple.(something.(d1, Some(nothing)), d2))  # right join

I guess I need to try implementing this to see if I can make a coherent system out of this idea, though. I'm hoping that I can try this after JuliaLang/julia#36085 with arbitrary indexable objects including Base.Dict and NamedTuple.

6. We still need operations for joins not involving primary keys.

Yeah, I was only thinking join on primary keys (dictionary keys) when I talk about broadcasting. For generic join API we need more flexible entry point.

Would we want a new generic function for this? Would you do this by "inverting" one or both dictionaries first and using one of the operations above?

Why not have some generic entry point that treats everything as just iterable (at least semantically) and use, e.g., pairs(dict) and left = last to specify joining on values?

SQL has copped a lot of flack for distorting relational algebra with confusing stuff with NULL and the results of many operations, which are necessary because it doesn't allow nested containers.

OK I used SQL a lot for inspiration mostly because that's what I understand (although somewhat barely) and the internet is full of information about it. But maybe I should've avoided that 😆

I actually started implementing sorting-based lazy inner/left/outer join framework that works on bare Julia data (kind of in the split of SplitApplyCombine.jl). The idea was to make it work with Transducers.jl and SplittablesBase.jl (so that it's parallelizable). It works well with inner join but for outer joins it wasn't clear what the best API was. I can make a "flat" iterator using NULL but it may not be so useful if you want to consume it on-the-fly (this is true also when using foreign key). Until now it didn't click why SplitApplyCombine.jl has leftgroupjoin but I think now I appreciate this direction because similar idea would solve this.

(Rather note to self, but, I think it'd be better to split the work between inner and outer reducing functions and re-launch inner reducing function for each group. This is what I do for GroupBy transducer but somehow it didn't occur to me until now that I should use the same trick!)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants