Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] JSON reader fails to parse files with empty rows #5712

Open
vuule opened this issue Jul 17, 2020 · 12 comments
Open

[BUG] JSON reader fails to parse files with empty rows #5712

vuule opened this issue Jul 17, 2020 · 12 comments
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@vuule
Copy link
Contributor

vuule commented Jul 17, 2020

The following test fails in both cases:

@pytest.mark.parametrize(
    "buffer",
    [
        "[ ]\n[ ]",
        '{ }\n{ }',
    ],
)
def test_json_empty(buffer):
    cu_df = cudf.read_json(buffer, lines=True)
    pd_df = pd.read_json(buffer, lines=True)

    np.testing.assert_array_equal(pd_df.dtypes, cu_df.dtypes)

With array rows, the reader creates a table with int8 column instead of an empty table.
With object rows, a CUDA error happens.

@vuule vuule added bug Something isn't working Needs Triage Need team to review and classify cuIO cuIO issue labels Jul 17, 2020
@kkraus14 kkraus14 removed the Needs Triage Need team to review and classify label Aug 5, 2020
@github-actions
Copy link

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

@GregoryKimball GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. and removed inactive-90d labels Apr 2, 2023
@dagardner-nv
Copy link
Contributor

dagardner-nv commented Jun 2, 2023

I can't reproduce this, however with cudf 23.02.00 I have a similar issue where cudf is failing to parse an empty json array '[]' I get an re.error exception, while an empty list with a blank space in the body '[ ]' parses regardless of the engine value:

import pandas as pd
import cudf

print(cudf.__version__)
print(pd.__version__)

try:
    cudf.read_json('[]', lines=False, engine='cudf')
except Exception as e:
    print(e)

try:
    cudf.read_json('[]', lines=False, engine='pandas')
except Exception as e:
    print(e)

# Pandas works fine
print(pd.read_json('[]', lines=False))

# Blank space works fine
print(cudf.read_json('[ ]', lines=False, engine='cudf'))
import pandas as pd
import cudf

print(cudf.__version__)
print(pd.__version__)

try:
    cudf.read_json('[]', lines=False, engine='cudf')
except Exception as e:
    print(e)

try:
    cudf.read_json('[]', lines=False, engine='pandas')
except Exception as e:
    print(e)

# Pandas works fine
print(pd.read_json('[]', lines=False))

# Blank space works fine
print(cudf.read_json('[ ]', lines=False, engine='cudf'))

output:

23.02.00
1.3.5
unterminated character set at position 31
/home/dagardner/work/conda/envs/morpheus/lib/python3.10/site-packages/cudf/io/json.py:121: UserWarning: Using CPU via Pandas to read JSON dataset, this may be GPU accelerated in the future
  warnings.warn(
unterminated character set at position 31
Empty DataFrame
Columns: []
Index: []
Empty DataFrame
Columns: []
Index: []

@GregoryKimball
Copy link
Contributor

Thank you @dagardner-nv for sharing this. This is an issue we had missed in the overall JSON prioritization and it seems to impact our new nested JSON reader.

@karthikeyann
Copy link
Contributor

This error does not happen in new nested JSON reader anymore.
Only difference I see is that, Index: [0, 1] in pandas, Index: [] in cudf.

Question: is it possible to have empty column with non-empty indices in cudf Dataframe?

@vuule
Copy link
Contributor Author

vuule commented Sep 18, 2023

Question: is it possible to have empty column with non-empty indices in cudf Dataframe?

@galipremsagar

@galipremsagar
Copy link
Contributor

Question: is it possible to have empty column with non-empty indices in cudf Dataframe?

@galipremsagar

Yup, it's possible to have a non-empty index when no columns exists:

In [4]: cudf.DataFrame()
Out[4]: 
Empty DataFrame
Columns: []
Index: []

In [5]: cudf.DataFrame(index=[1, 2, 3])
Out[5]: 
Empty DataFrame
Columns: []
Index: [1, 2, 3]

But if a column exists and non-empty index exist, the column will be the length of index:

In [6]: cudf.DataFrame(index=[1, 2, 3], columns=['a', 'b'])
Out[6]: 
      a     b
1  <NA>  <NA>
2  <NA>  <NA>
3  <NA>  <NA>

@revans2 revans2 added the Spark Functionality that helps Spark RAPIDS label Mar 14, 2024
@revans2
Copy link
Contributor

revans2 commented Mar 14, 2024

Spark has a similar issue, but it is a bit different. If we are trying to read a JSON lines file with something like.

{}
{}

in it, then we get back a TableWithMeta that is empty, but we have no idea how many rows there were in the original input. This is even more problematic if we have something like

{}

{}

Where empty lines are supposed to be filtered out. All I need is the number of rows that was read.

rapids-bot bot pushed a commit that referenced this issue Mar 15, 2024
… API for missing information (#15307)

CUDF cannot create a table with rows and no columns, but that is exactly what we need to be able to read some JSON input. So this adds in a new API that lets us work around this problem if we know how many rows you expect to see. This is not an ideal solutions so it not a fix for #5712 generically.  But is is a stop gap, especially for cases when we know how many rows to expect.

Authors:
  - Robert (Bobby) Evans (https://github.com/revans2)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - MithunR (https://github.com/mythrocks)

URL: #15307
@vyasr
Copy link
Contributor

vyasr commented May 16, 2024

I think all of the cases that previously caused problems now work as expected.

In [1]: import cudf
   ...: import pandas as pd
   ...: import numpy as np                                                                                                                                                                                     ...:
   ...: buffers = ["[ ]\n[ ]", '{ }\n{ }']
   ...: for buffer in buffers:
   ...:     cu_df = cudf.read_json(buffer, lines=True)
   ...:     pd_df = pd.read_json(buffer, lines=True)
   ...:     np.testing.assert_array_equal(pd_df.dtypes, cu_df.dtypes)
   ...:
   ...:                                                                                                                                                                                                        ...: print(cudf.read_json('[]', lines=False, engine='cudf'))
   ...: print(cudf.read_json('[]', lines=False, engine='pandas'))
   ...: print(pd.read_json('[]', lines=False))
   ...: print(cudf.read_json('[ ]', lines=False, engine='cudf'))
/home/coder/cudf/python/cudf/cudf/utils/ioutils.py:1745: FutureWarning: Passing literal json to read_json is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object.
  warnings.warn(
<ipython-input-1-1c788c1a77cf>:8: FutureWarning: Passing literal json to 'read_json' is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object.
  pd_df = pd.read_json(buffer, lines=True)
/home/coder/cudf/python/cudf/cudf/utils/ioutils.py:1733: FutureWarning: Passing literal json to read_json is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object.
  warnings.warn(
<ipython-input-1-1c788c1a77cf>:8: FutureWarning: Passing literal json to 'read_json' is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object.
  pd_df = pd.read_json(buffer, lines=True)
/home/coder/cudf/python/cudf/cudf/utils/ioutils.py:1745: FutureWarning: Passing literal json to read_json is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object.
  warnings.warn(
Empty DataFrame
Columns: []
Index: []
/home/coder/cudf/python/cudf/cudf/io/json.py:108: UserWarning: Using CPU via Pandas to read JSON dataset, this may be GPU accelerated in the future
  warnings.warn(
/home/coder/cudf/python/cudf/cudf/io/json.py:130: FutureWarning: Passing literal json to 'read_json' is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object.
  pd_value = pd.read_json(
Empty DataFrame
Columns: []
Index: []
<ipython-input-1-1c788c1a77cf>:14: FutureWarning: Passing literal json to 'read_json' is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object.
  print(pd.read_json('[]', lines=False))
Empty DataFrame
Columns: []
Index: []
/home/coder/cudf/python/cudf/cudf/utils/ioutils.py:1745: FutureWarning: Passing literal json to read_json is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object.
  warnings.warn(
Empty DataFrame
Columns: []
Index: []

I'm going to close for now, but feel free to reopen if I missed something.

@vyasr vyasr closed this as completed May 16, 2024
rapids-bot bot pushed a commit to nv-morpheus/Morpheus that referenced this issue May 30, 2024
…lientSourceStage` stages (#1705)

* Add a new constructor argument to `HttpServerSourceStage` & `HttpClientSourceStage` called `payload_to_df_fn`, allowing users to specify a custom payload parser.
* Remove work-around for rapidsai/cudf#5712 this bug is fixed in our current version of cudf.
* Relocate updated tests to `tests/stages`

Closes #1703

## By Submitting this PR I confirm:
- I am familiar with the [Contributing Guidelines](https://github.com/nv-morpheus/Morpheus/blob/main/docs/source/developer_guide/contributing.md).
- When the PR is ready for review, new or existing tests cover these changes.
- When the PR is ready for review, the documentation is up to date with these changes.

Authors:
  - David Gardner (https://github.com/dagardner-nv)

Approvers:
  - Michael Demoret (https://github.com/mdemoret-nv)

URL: #1705
@revans2
Copy link
Contributor

revans2 commented Sep 9, 2024

This is still happening. I put in a work around into the java API, but it is not a good solution to the problem.

@revans2 revans2 reopened this Sep 9, 2024
@karthikeyann
Copy link
Contributor

@galipremsagar What information does cython/python layer need from libcudf read_json to ensure you can create index columns even if there are no columns in result table?

@GregoryKimball GregoryKimball moved this to Needs owner in libcudf Nov 12, 2024
@ttnghia
Copy link
Contributor

ttnghia commented Nov 15, 2024

The input in the example above actually does not contain "empty rows". They are "empty JSON objects/arrays" instead. Empty rows should mean only whitespace, like " "\n" ".

With recent changes in JSON reader, I believe that this should be fixed (at least for the case of empty JSON objects). Please confirm again.

@karthikeyann
Copy link
Contributor

It's fixed for spark because passing schema with prune columns will create null columns (of size number of rows).
For python, it's still an issue, where number of rows is not returned for an empty table.
One option is to return number of rows as metadata num_rows_per_source.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
0 - Backlog In queue waiting for assignment bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
Status: Needs owner
Development

No branches or pull requests

9 participants