Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] tail method sometimes fail #2495

Closed
jangorecki opened this issue Aug 7, 2019 · 10 comments · Fixed by #2859
Closed

[BUG] tail method sometimes fail #2495

jangorecki opened this issue Aug 7, 2019 · 10 comments · Fixed by #2859
Assignees
Labels
bug Something isn't working Python Affects Python cuDF API.

Comments

@jangorecki
Copy link

jangorecki commented Aug 7, 2019

After running a query, I am getting ans frame. head method works fine on it, but tail method fails. This happens rarely and strongly depends on the data. Using 0.8.0+0.g8fa7bd3.dirty.

>>> ans = x.groupby(['id1'],as_index=False).agg({'v1':'sum'}).reset_index(drop=True)
>>> print(ans.head(3), flush=True)
     id1        v1
0  id001  15006850
1  id002  14994166
>>> print(ans.tail(3), flush=True)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe/dataframe.py", line 491, in __str__
    return self.to_string(nrows=nrows, ncols=ncols)
  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe/dataframe.py", line 480, in to_string
    cols[h] = self[h].values_to_string(nrows=nrows)
  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe/series.py", line 354, in values_to_string
    out = [str(v) for v in values]
  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe/series.py", line 354, in <listcomp>
    out = [str(v) for v in values]
  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe/series.py", line 302, in __getitem__
    return self._column.element_indexing(arg)
  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe/column.py", line 412, in element_indexing
    val = self.data[index]  # this can raise IndexError
  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe/buffer.py", line 149, in __getitem__
    return item.view(self.dtype)
AttributeError: 'NoneType' object has no attribute 'view'
>>> ans.dtypes
id1    object
v1      int64
dtype: object

I can provide reproducible example but it will not be minimal... the one provided in #2494 (comment) might work after changing K=2.

@jangorecki jangorecki added Needs Triage Need team to review and classify bug Something isn't working labels Aug 7, 2019
@kkraus14
Copy link
Collaborator

@jangorecki this should be fixed in the latest nightlies. This was due to nulls being improperly handled as Python None objects as opposed to numpy scalars.

@jangorecki
Copy link
Author

@kkraus14
I don't think the issue is fixed. In 0.8.0 it was also raising segfault.
After upgrade to 0.9.0 I am not getting segfault so far, but print of tail is still raising exception.
h2oai/db-benchmark#102

Traceback (most recent call last):
  File "./cudf/groupby-cudf.py", line 56, in <module>
    print(ans.tail(3), flush=True)
  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe
/dataframe.py", line 553, in __str__
    return self.to_string()
  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe
/dataframe.py", line 550, in to_string
    return self.__repr__()
  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe
/dataframe.py", line 591, in __repr__
    output = self.get_renderable_dataframe()
  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe
/dataframe.py", line 582, in get_renderable_dataframe
    output._cols[col].astype("str").str.fillna("null")
  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe
/series.py", line 1383, in astype
    return self._copy_construct(data=self._column.astype(dtype, **kwargs))
  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe
/columnops.py", line 137, in astype
    return self.as_string_column(dtype, **kwargs)
  File "/home/jan/anaconda3/envs/cudf/lib/python3.6/site-packages/cudf/dataframe
/numerical.py", line 129, in as_string_column
    np.dtype(dev_array.dtype)
KeyError: dtype('O')

@kkraus14 kkraus14 reopened this Aug 26, 2019
@kkraus14 kkraus14 added Python Affects Python cuDF API. and removed Needs Triage Need team to review and classify labels Aug 26, 2019
@rgsl888prabhu rgsl888prabhu self-assigned this Sep 19, 2019
@rgsl888prabhu
Copy link
Contributor

rgsl888prabhu commented Sep 19, 2019

@jangorecki Does it fail always or depends on data generated for 0.9?

@jangorecki
Copy link
Author

@rgsl888prabhu depends on the data, among 4 different cases of cardinality factor ("K") the issue manifests only in one case. You can generate exact data that cause the problem by following initial instructions.

@rgsl888prabhu
Copy link
Contributor

rgsl888prabhu commented Sep 20, 2019

@jangorecki I tried to reproduce using 0.9, but I wasn't able to do so. If you have that .csv file through which you can reproduce, please share it. Meanwhile, I will try to figure out the issue and reproduce it from my end.

@jangorecki
Copy link
Author

@rgsl888prabhu I have the csv but it is 45 GB size.
csv was generated from a script so it make sense to run a script to produce the same csv rather than sharing 45 GB file.

@rgsl888prabhu
Copy link
Contributor

Do you remember the random seed that you had set, I don't see it in the script.

@jangorecki
Copy link
Author

There is a random seed set in the script:

wget https://raw.githubusercontent.com/h2oai/db-benchmark/master/groupby-datagen.R
Rscript groupby-datagen.R 1e9 2 0 0

@rgsl888prabhu
Copy link
Contributor

Thank you @jangorecki, I am able to reproduce scenario.

@rgsl888prabhu
Copy link
Contributor

Simplified code to reproduce

import cudf
import numpy as np
id1 = cudf.Series(['a', 'b'], dtype=np.object)
v1 = cudf.Series([1,2])
s = cudf.DataFrame()
s['id1'] = id1
s['v1'] = v1
print(s.tail(3))

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants