Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Combine categories on merge #3496

Closed
jangorecki opened this issue Dec 1, 2019 · 10 comments · Fixed by #5023
Closed

[FEA] Combine categories on merge #3496

jangorecki opened this issue Dec 1, 2019 · 10 comments · Fixed by #5023
Assignees
Labels
feature request New feature or request Python Affects Python cuDF API.

Comments

@jangorecki
Copy link

I would expect to get same output for string and categorical columns used in join

import cudf as cu
df1 = cu.DataFrame({'id1':['a','b','c','b'], 'v1':[1,2,3,2]})
df2 = cu.DataFrame({'id1':['b','d'], 'v2':[2,4]})
ans1 = df1.merge(df2, on='id1')
df1['id1'] = df1['id1'].astype('category')
df2['id1'] = df2['id1'].astype('category')
ans2 = df1.merge(df2, on='id1')
ans1
#  id1  v1  v2
#0   b   2   2
#1   b   2   2
ans2
#  id1  v1  v2
#0   b   1   2
#1   d   2   4
#2   d   2   4
cu.__version__
#'0.10.0'
@jangorecki jangorecki added Needs Triage Need team to review and classify bug Something isn't working labels Dec 1, 2019
jangorecki added a commit to h2oai/db-benchmark that referenced this issue Dec 1, 2019
@kkraus14 kkraus14 added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Jan 10, 2020
@kkraus14
Copy link
Collaborator

Able to repro on latest 0.12, investigating.

@kkraus14
Copy link
Collaborator

@brandon-b-miller I think your join implicit typecasting PR may take care of this? If not I think we should just do the minimal amount of work to throw an exception if the categories aren't synchronized for the time being and then wait for libcudf support for categorical columns.

@brandon-b-miller
Copy link
Contributor

@kkraus14 It doesn't yet, but shouldn't be hard to tweak so that it raises in this case, I agree it doesn't really make sense to merge two categorical columns that specify different categories.

@kkraus14
Copy link
Collaborator

I agree it doesn't really make sense to merge two categorical columns that specify different categories.

I actually think it does make sense to merge them, it's just that with categorical columns being implemented in libcudf we shouldn't spend the effort which will just get ripped out in the near future anyway.

@brandon-b-miller
Copy link
Contributor

I think there's only a way of making sense of it in strictly the unordered case. In the ordered case we'd have to somehow determine the ordering for the superset containing all the possible categories.

@kkraus14
Copy link
Collaborator

I think there's only a way of making sense of it in strictly the unordered case. In the ordered case we'd have to somehow determine the ordering for the superset containing all the possible categories.

Yea, good point. Pandas implicitly converts categoricals to object dtype when trying to run a merge against them so we have flexibility to implement our own logic here. Alternatively we do have the option of materializing the categorical encoded column to the dictionary's dtype, but that feels very unnecessary and expensive.

@kkraus14
Copy link
Collaborator

@brandon-b-miller is this still an issue after the porting?

@brandon-b-miller
Copy link
Contributor

This should raise for the any code that includes the implicit typecasting.

TypeError: Left and right categories must be the same.

We should be seeing that instead of incorrect results.

@jangorecki
Copy link
Author

Error would be definitely better than current expected results. Yet I think it make sense to keep this issue as FR to support categorical variables (when their categories does not match). Join might be actually faster, especially when there are very few categories and merging categories would be cheap.

@kkraus14 kkraus14 added feature request New feature or request and removed bug Something isn't working labels Mar 18, 2020
@kkraus14 kkraus14 changed the title [BUG] merge on categorical column incorrect [FEA] Combine categories on merge Mar 18, 2020
@kkraus14
Copy link
Collaborator

Error would be definitely better than current expected results. Yet I think it make sense to keep this issue as FR to support categorical variables (when their categories does not match). Join might be actually faster, especially when there are very few categories and merging categories would be cheap.

issue updated to reflect the feature request

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants