Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid overflow #862

Closed
wants to merge 1 commit into from
Closed

Avoid overflow #862

wants to merge 1 commit into from

Conversation

matthieugomez
Copy link
Contributor

Correct the following bug :

df = DataFrame(v1 = pool(1:1000), v2 = pool(fill(1, 1000))) 
groupby(df, [:v1, :v2])

(initially part of pull request #815)

@alyst
Copy link
Contributor

alyst commented Aug 22, 2015

That wouldn't fully resolve the overflow problem. If the frame is grouped by, say, 4 columns with >2^16 unique elements each (or 8 columns with >2^8 uniques), UInt64 would be overflown as well.
One possibility is to periodically call PooledDataArray(x) inside the columns loop to "compactify" the group indices. I've tried implementing something along these lines, but abandoned the idea, because it's just an ugly way to reinvent hashing.

An alternative solution would be to use Dict with DataFrameRow that already has contents-based hash() and isequal(). It's implemented in #850 for both grouping and joining. It would be nice, if somebody reviews it. :)

@matthieugomez
Copy link
Contributor Author

What is panda / dplyr / data table doing ? Surely they solved this.
Anyway maybe the best thing for now would be to return an error if this happens ? This fix should still cover most of the bad situations.

@alyst
Copy link
Contributor

alyst commented Aug 22, 2015

dplyr has a lot of C++/Rcpp code (however, it also offers more, e.g. joining on columns with different types), but AFAICS in the core it uses dplyr_hash_map, so it's closer to what #850 proposes. Don't know about panda.

PooledDataArray optimization in #850 is not a big deal, it's just about avoiding setindex!() operation that is very slow. As I state there, it's more about maintaining left frame rows order and not doing unnecessary indexing for certain joins.

@alyst
Copy link
Contributor

alyst commented Aug 22, 2015

Throwing something more informative than current InexactError (also, for some reason, it has no stacktrace) would be a nice quickfix.
The problem is that even with UInt64 conversion the overflow would happen quite often even for medium-sized frames and reasonable number of join/group columns. For the user the workaround would be to concatenate "on"/"by" columns and avoid using multicolumn joins/groups, but that is rather ugly.

@matthieugomez
Copy link
Contributor Author

Ok. Your fix looks better. Could you add the groupby test from my pull request in yours to make sure this problem is not rediscovered at some point?
Out of curiosity, could you also add a timing for simple grouping operations?

alyst added a commit to alyst/DataFrames.jl that referenced this pull request Aug 25, 2015
alyst added a commit to alyst/DataFrames.jl that referenced this pull request Aug 26, 2015
alyst added a commit to alyst/DataFrames.jl that referenced this pull request Aug 26, 2015
alyst added a commit to alyst/DataFrames.jl that referenced this pull request Aug 30, 2015
alyst added a commit to alyst/DataFrames.jl that referenced this pull request Aug 30, 2015
alyst added a commit to alyst/DataFrames.jl that referenced this pull request Oct 3, 2015
alyst added a commit to alyst/DataFrames.jl that referenced this pull request Oct 26, 2015
alyst added a commit to alyst/DataFrames.jl that referenced this pull request Nov 13, 2015
alyst added a commit to alyst/DataFrames.jl that referenced this pull request Nov 13, 2015
alyst added a commit to alyst/DataFrames.jl that referenced this pull request Apr 13, 2016
alyst added a commit to alyst/DataFrames.jl that referenced this pull request Aug 21, 2016
alyst added a commit to alyst/DataFrames.jl that referenced this pull request Aug 31, 2016
alyst added a commit to alyst/DataFrames.jl that referenced this pull request Sep 3, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants