Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

unexpected results from aggregation query #848

Closed
jangorecki opened this issue Feb 2, 2019 · 5 comments
Closed

unexpected results from aggregation query #848

jangorecki opened this issue Feb 2, 2019 · 5 comments
Assignees
Labels
bug Something isn't working Python Affects Python cuDF API.

Comments

@jangorecki
Copy link

jangorecki commented Feb 2, 2019

Sorry for lack of minimal reproducible example but issues like that are quite time consuming to track down. To produce data use R:

wget https://raw.githubusercontent.com/h2oai/db-benchmark/master/groupby-datagen.R
Rscript groupby-datagen.R 1e7 1e2 0 0

On G1_1e7_1e2_0_0.csv generated from that script pandas returns the following

ans = x.groupby(['id6']).agg({'v1':'sum', 'v2':'sum', 'v3':'sum'})
print(ans.shape, flush=True)
#(100000, 3)
print(ans.head(3), flush=True)
#      v1   v2         v3
#id6                     
#1    266  256  4503.0951
#2    334  316  5294.6590
#3    276  283  5166.5957

while cudf 0.5 from pip returns the following

ans = x.groupby(['id6']).agg({'v1':'sum', 'v2':'sum', 'v3':'sum'})
print(ans.shape, flush=True)
#(99998, 4)
print(ans.head(3), flush=True)
#   id6 sum_v1 sum_v2   sum_v3
#0    1    809    801 13638.96
#1    2    334    316  5294.66
#2    3    276    283 5166.595

Already shape of answer doesn't match. There are no NA/NaN in that data.

@jangorecki jangorecki added Needs Triage Need team to review and classify bug Something isn't working labels Feb 2, 2019
@kkraus14
Copy link
Collaborator

I'd guess that this is a CSV reader issue, have you checked the pandas dataframe matches the cudf dataframe after the read_csv call?

@kkraus14 kkraus14 added Python Affects Python cuDF API. cuIO cuIO issue and removed Needs Triage Need team to review and classify labels Feb 13, 2019
@jangorecki
Copy link
Author

jangorecki commented Feb 14, 2019

@kkraus14 possible, but from what I was able to check it looks fine.
The most important check to see number of groups in id6 raised error.
Difference in sum(v3) is probably from the fact that pandas store it as float64 while cudf as float32.

  • head of data
# pandas
>>> print(x[['id6','v1','v2','v3']].head())
     id6  v1  v2       v3
0  59276   1   1  96.8126
1  78315   4   1  83.5654
2  27300   4   5  44.6464
3  65416   2   3  29.9499
4  19046   4   3  51.4899

# cudf
>>> print(x[['id6','v1','v2','v3']].head())
    id6   v1   v2        v3
0 59276    1    1  96.81261
1 78315    4    1  83.56539
2 27300    4    5   44.6464
3 65416    2    3 29.949902
4 19046    4    3 51.489902
  • columns to aggregate
# pandas
>>> x['v1'].sum()
30001016
>>> x['v2'].sum()
29998058
>>> x['v3'].sum()
506570021.4995007

# cudf
>>> x['v1'].sum()
30001016
>>> x['v2'].sum()
29998058
>>> x['v3'].sum()
506570050.0
  • columns to group by
# pandas
>>> x['id6'].nunique()
100000

# cudf
>>> x['id6'].unique_count()
Traceback (most recent call last):
  File "/home/jan/git/db-benchmark/cudf/py-cudf/lib/python3.6/site-packages/numba/cuda/cudadrv/nvvm.py", line 111, in __new__
    inst.driver = open_cudalib('nvvm', ccc=True)
  File "/home/jan/git/db-benchmark/cudf/py-cudf/lib/python3.6/site-packages/numba/c
uda/cudadrv/libs.py", line 48, in open_cudalib
    raise OSError('library %s not found' % lib)
OSError: library nvvm not found

@mjsamoht
Copy link
Contributor

mjsamoht commented Apr 2, 2019

This doesn't look like a CSV reader issue. I'm removing the cuIO label but let me know if someone from cuIO team needs to take a look.

@mjsamoht mjsamoht removed the cuIO cuIO issue label Apr 2, 2019
@randerzander randerzander self-assigned this Apr 9, 2019
@VibhuJawa
Copy link
Member

This seems to be resolved in 0.6.1 and 0.7 . Reproduction Gist:
https://gist.github.com/VibhuJawa/d3354dca525e0699d33aa0e286d2a717

@randerzander
Copy link
Contributor

Closing per @VibhuJawa's tests. @jangorecki please re-open if there's still a problem for you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Python Affects Python cuDF API.
Projects
None yet
Development

No branches or pull requests

5 participants