-
Notifications
You must be signed in to change notification settings - Fork 919
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] JSON reader fails to parse files with empty rows #5712
Comments
This issue has been labeled |
I can't reproduce this, however with cudf 23.02.00 I have a similar issue where cudf is failing to parse an empty json array import pandas as pd
import cudf
print(cudf.__version__)
print(pd.__version__)
try:
cudf.read_json('[]', lines=False, engine='cudf')
except Exception as e:
print(e)
try:
cudf.read_json('[]', lines=False, engine='pandas')
except Exception as e:
print(e)
# Pandas works fine
print(pd.read_json('[]', lines=False))
# Blank space works fine
print(cudf.read_json('[ ]', lines=False, engine='cudf'))
import pandas as pd
import cudf
print(cudf.__version__)
print(pd.__version__)
try:
cudf.read_json('[]', lines=False, engine='cudf')
except Exception as e:
print(e)
try:
cudf.read_json('[]', lines=False, engine='pandas')
except Exception as e:
print(e)
# Pandas works fine
print(pd.read_json('[]', lines=False))
# Blank space works fine
print(cudf.read_json('[ ]', lines=False, engine='cudf')) output:
|
Thank you @dagardner-nv for sharing this. This is an issue we had missed in the overall JSON prioritization and it seems to impact our new nested JSON reader. |
This error does not happen in new nested JSON reader anymore. Question: is it possible to have empty column with non-empty indices in cudf Dataframe? |
|
Yup, it's possible to have a non-empty index when no columns exists:
But if a column exists and non-empty index exist, the column will be the length of index: In [6]: cudf.DataFrame(index=[1, 2, 3], columns=['a', 'b'])
Out[6]:
a b
1 <NA> <NA>
2 <NA> <NA>
3 <NA> <NA> |
Spark has a similar issue, but it is a bit different. If we are trying to read a JSON lines file with something like.
in it, then we get back a TableWithMeta that is empty, but we have no idea how many rows there were in the original input. This is even more problematic if we have something like
Where empty lines are supposed to be filtered out. All I need is the number of rows that was read. |
… API for missing information (#15307) CUDF cannot create a table with rows and no columns, but that is exactly what we need to be able to read some JSON input. So this adds in a new API that lets us work around this problem if we know how many rows you expect to see. This is not an ideal solutions so it not a fix for #5712 generically. But is is a stop gap, especially for cases when we know how many rows to expect. Authors: - Robert (Bobby) Evans (https://github.com/revans2) Approvers: - Nghia Truong (https://github.com/ttnghia) - MithunR (https://github.com/mythrocks) URL: #15307
I think all of the cases that previously caused problems now work as expected.
I'm going to close for now, but feel free to reopen if I missed something. |
…lientSourceStage` stages (#1705) * Add a new constructor argument to `HttpServerSourceStage` & `HttpClientSourceStage` called `payload_to_df_fn`, allowing users to specify a custom payload parser. * Remove work-around for rapidsai/cudf#5712 this bug is fixed in our current version of cudf. * Relocate updated tests to `tests/stages` Closes #1703 ## By Submitting this PR I confirm: - I am familiar with the [Contributing Guidelines](https://github.com/nv-morpheus/Morpheus/blob/main/docs/source/developer_guide/contributing.md). - When the PR is ready for review, new or existing tests cover these changes. - When the PR is ready for review, the documentation is up to date with these changes. Authors: - David Gardner (https://github.com/dagardner-nv) Approvers: - Michael Demoret (https://github.com/mdemoret-nv) URL: #1705
This is still happening. I put in a work around into the java API, but it is not a good solution to the problem. |
@galipremsagar What information does cython/python layer need from libcudf |
The input in the example above actually does not contain "empty rows". They are "empty JSON objects/arrays" instead. Empty rows should mean only whitespace, like With recent changes in JSON reader, I believe that this should be fixed (at least for the case of empty JSON objects). Please confirm again. |
It's fixed for spark because passing schema with prune columns will create null columns (of size number of rows). |
The following test fails in both cases:
With array rows, the reader creates a table with int8 column instead of an empty table.
With object rows, a CUDA error happens.
The text was updated successfully, but these errors were encountered: