[BUG] JSON reader fails to parse files with empty rows #5712

vuule · 2020-07-17T00:59:06Z

The following test fails in both cases:

@pytest.mark.parametrize(
    "buffer",
    [
        "[ ]\n[ ]",
        '{ }\n{ }',
    ],
)
def test_json_empty(buffer):
    cu_df = cudf.read_json(buffer, lines=True)
    pd_df = pd.read_json(buffer, lines=True)

    np.testing.assert_array_equal(pd_df.dtypes, cu_df.dtypes)

With array rows, the reader creates a table with int8 column instead of an empty table.
With object rows, a CUDA error happens.

The text was updated successfully, but these errors were encountered:

github-actions · 2021-03-14T19:12:33Z

This issue has been labeled inactive-90d due to no recent activity in the past 90 days. Please close this issue if no further response or action is needed. Otherwise, please respond with a comment indicating any updates or changes to the original issue and/or confirm this issue still needs to be addressed.

dagardner-nv · 2023-06-02T19:53:31Z

I can't reproduce this, however with cudf 23.02.00 I have a similar issue where cudf is failing to parse an empty json array '[]' I get an re.error exception, while an empty list with a blank space in the body '[ ]' parses regardless of the engine value:

import pandas as pd
import cudf

print(cudf.__version__)
print(pd.__version__)

try:
    cudf.read_json('[]', lines=False, engine='cudf')
except Exception as e:
    print(e)

try:
    cudf.read_json('[]', lines=False, engine='pandas')
except Exception as e:
    print(e)

# Pandas works fine
print(pd.read_json('[]', lines=False))

# Blank space works fine
print(cudf.read_json('[ ]', lines=False, engine='cudf'))
import pandas as pd
import cudf

print(cudf.__version__)
print(pd.__version__)

try:
    cudf.read_json('[]', lines=False, engine='cudf')
except Exception as e:
    print(e)

try:
    cudf.read_json('[]', lines=False, engine='pandas')
except Exception as e:
    print(e)

# Pandas works fine
print(pd.read_json('[]', lines=False))

# Blank space works fine
print(cudf.read_json('[ ]', lines=False, engine='cudf'))

output:

23.02.00
1.3.5
unterminated character set at position 31
/home/dagardner/work/conda/envs/morpheus/lib/python3.10/site-packages/cudf/io/json.py:121: UserWarning: Using CPU via Pandas to read JSON dataset, this may be GPU accelerated in the future
  warnings.warn(
unterminated character set at position 31
Empty DataFrame
Columns: []
Index: []
Empty DataFrame
Columns: []
Index: []

GregoryKimball · 2023-06-05T19:22:19Z

Thank you @dagardner-nv for sharing this. This is an issue we had missed in the overall JSON prioritization and it seems to impact our new nested JSON reader.

karthikeyann · 2023-09-12T11:50:15Z

This error does not happen in new nested JSON reader anymore.
Only difference I see is that, Index: [0, 1] in pandas, Index: [] in cudf.

Question: is it possible to have empty column with non-empty indices in cudf Dataframe?

vuule · 2023-09-18T16:20:35Z

Question: is it possible to have empty column with non-empty indices in cudf Dataframe?

@galipremsagar

galipremsagar · 2023-09-18T17:02:49Z

Question: is it possible to have empty column with non-empty indices in cudf Dataframe?

@galipremsagar

Yup, it's possible to have a non-empty index when no columns exists:

In [4]: cudf.DataFrame()
Out[4]: 
Empty DataFrame
Columns: []
Index: []

In [5]: cudf.DataFrame(index=[1, 2, 3])
Out[5]: 
Empty DataFrame
Columns: []
Index: [1, 2, 3]

But if a column exists and non-empty index exist, the column will be the length of index:

In [6]: cudf.DataFrame(index=[1, 2, 3], columns=['a', 'b'])
Out[6]: 
      a     b
1  <NA>  <NA>
2  <NA>  <NA>
3  <NA>  <NA>

revans2 · 2024-03-14T18:18:15Z

Spark has a similar issue, but it is a bit different. If we are trying to read a JSON lines file with something like.

{}
{}

in it, then we get back a TableWithMeta that is empty, but we have no idea how many rows there were in the original input. This is even more problematic if we have something like

{}

{}

Where empty lines are supposed to be filtered out. All I need is the number of rows that was read.

… API for missing information (#15307) CUDF cannot create a table with rows and no columns, but that is exactly what we need to be able to read some JSON input. So this adds in a new API that lets us work around this problem if we know how many rows you expect to see. This is not an ideal solutions so it not a fix for #5712 generically. But is is a stop gap, especially for cases when we know how many rows to expect. Authors: - Robert (Bobby) Evans (https://github.com/revans2) Approvers: - Nghia Truong (https://github.com/ttnghia) - MithunR (https://github.com/mythrocks) URL: #15307

vyasr · 2024-05-16T19:20:58Z

I think all of the cases that previously caused problems now work as expected.

In [1]: import cudf
   ...: import pandas as pd
   ...: import numpy as np                                                                                                                                                                                     ...:
   ...: buffers = ["[ ]\n[ ]", '{ }\n{ }']
   ...: for buffer in buffers:
   ...:     cu_df = cudf.read_json(buffer, lines=True)
   ...:     pd_df = pd.read_json(buffer, lines=True)
   ...:     np.testing.assert_array_equal(pd_df.dtypes, cu_df.dtypes)
   ...:
   ...:                                                                                                                                                                                                        ...: print(cudf.read_json('[]', lines=False, engine='cudf'))
   ...: print(cudf.read_json('[]', lines=False, engine='pandas'))
   ...: print(pd.read_json('[]', lines=False))
   ...: print(cudf.read_json('[ ]', lines=False, engine='cudf'))
/home/coder/cudf/python/cudf/cudf/utils/ioutils.py:1745: FutureWarning: Passing literal json to read_json is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object.
  warnings.warn(
<ipython-input-1-1c788c1a77cf>:8: FutureWarning: Passing literal json to 'read_json' is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object.
  pd_df = pd.read_json(buffer, lines=True)
/home/coder/cudf/python/cudf/cudf/utils/ioutils.py:1733: FutureWarning: Passing literal json to read_json is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object.
  warnings.warn(
<ipython-input-1-1c788c1a77cf>:8: FutureWarning: Passing literal json to 'read_json' is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object.
  pd_df = pd.read_json(buffer, lines=True)
/home/coder/cudf/python/cudf/cudf/utils/ioutils.py:1745: FutureWarning: Passing literal json to read_json is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object.
  warnings.warn(
Empty DataFrame
Columns: []
Index: []
/home/coder/cudf/python/cudf/cudf/io/json.py:108: UserWarning: Using CPU via Pandas to read JSON dataset, this may be GPU accelerated in the future
  warnings.warn(
/home/coder/cudf/python/cudf/cudf/io/json.py:130: FutureWarning: Passing literal json to 'read_json' is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object.
  pd_value = pd.read_json(
Empty DataFrame
Columns: []
Index: []
<ipython-input-1-1c788c1a77cf>:14: FutureWarning: Passing literal json to 'read_json' is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object.
  print(pd.read_json('[]', lines=False))
Empty DataFrame
Columns: []
Index: []
/home/coder/cudf/python/cudf/cudf/utils/ioutils.py:1745: FutureWarning: Passing literal json to read_json is deprecated and will be removed in a future version. To read from a literal string, wrap it in a 'StringIO' object.
  warnings.warn(
Empty DataFrame
Columns: []
Index: []

I'm going to close for now, but feel free to reopen if I missed something.

…lientSourceStage` stages (#1705) * Add a new constructor argument to `HttpServerSourceStage` & `HttpClientSourceStage` called `payload_to_df_fn`, allowing users to specify a custom payload parser. * Remove work-around for rapidsai/cudf#5712 this bug is fixed in our current version of cudf. * Relocate updated tests to `tests/stages` Closes #1703 ## By Submitting this PR I confirm: - I am familiar with the [Contributing Guidelines](https://github.com/nv-morpheus/Morpheus/blob/main/docs/source/developer_guide/contributing.md). - When the PR is ready for review, new or existing tests cover these changes. - When the PR is ready for review, the documentation is up to date with these changes. Authors: - David Gardner (https://github.com/dagardner-nv) Approvers: - Michael Demoret (https://github.com/mdemoret-nv) URL: #1705

revans2 · 2024-09-09T19:35:04Z

This is still happening. I put in a work around into the java API, but it is not a good solution to the problem.

karthikeyann · 2024-11-12T19:10:43Z

@galipremsagar What information does cython/python layer need from libcudf read_json to ensure you can create index columns even if there are no columns in result table?

ttnghia · 2024-11-15T00:04:38Z

The input in the example above actually does not contain "empty rows". They are "empty JSON objects/arrays" instead. Empty rows should mean only whitespace, like " "\n" ".

With recent changes in JSON reader, I believe that this should be fixed (at least for the case of empty JSON objects). Please confirm again.

karthikeyann · 2024-11-26T19:21:23Z

It's fixed for spark because passing schema with prune columns will create null columns (of size number of rows).
For python, it's still an issue, where number of rows is not returned for an empty table.
One option is to return number of rows as metadata num_rows_per_source.

vuule added bug Something isn't working Needs Triage Need team to review and classify cuIO cuIO issue labels Jul 17, 2020

kkraus14 removed the Needs Triage Need team to review and classify label Aug 5, 2020

github-actions bot added the inactive-90d label Mar 14, 2021

GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. and removed inactive-90d labels Apr 2, 2023

GregoryKimball added the 0 - Backlog In queue waiting for assignment label Jun 5, 2023

GregoryKimball added this to the Nested JSON reader milestone Jun 5, 2023

This was referenced Jun 5, 2023

[BUG] JSON reader has no option to return the columns only for the requested schema #13473

Closed

[FEA] JSON reader improvements for Spark-RAPIDS #13525

Closed

revans2 added the Spark Functionality that helps Spark RAPIDS label Mar 14, 2024

revans2 mentioned this issue Mar 14, 2024

This fixes an NPE when trying to read empty JSON data by adding a new API for missing information #15307

Merged

3 tasks

vyasr closed this as completed May 16, 2024

dagardner-nv mentioned this issue May 16, 2024

Support passing a custom parser to HttpServerSourceStage and HttpClientSourceStage stages nv-morpheus/Morpheus#1705

Merged

revans2 reopened this Sep 9, 2024

GregoryKimball added this to libcudf Nov 12, 2024

GregoryKimball moved this to Needs owner in libcudf Nov 12, 2024

karthikeyann mentioned this issue Dec 2, 2024

Add num_rows to metadata in JSON reader #17480

Draft

3 tasks

GregoryKimball mentioned this issue Jan 10, 2025

[FEA] JSON reader performance projects #17718

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] JSON reader fails to parse files with empty rows #5712

[BUG] JSON reader fails to parse files with empty rows #5712

vuule commented Jul 17, 2020 •

edited

Loading

github-actions bot commented Mar 14, 2021

dagardner-nv commented Jun 2, 2023 •

edited

Loading

GregoryKimball commented Jun 5, 2023

karthikeyann commented Sep 12, 2023

vuule commented Sep 18, 2023

galipremsagar commented Sep 18, 2023

revans2 commented Mar 14, 2024

vyasr commented May 16, 2024

revans2 commented Sep 9, 2024

karthikeyann commented Nov 12, 2024

ttnghia commented Nov 15, 2024 •

edited

Loading

karthikeyann commented Nov 26, 2024

[BUG] JSON reader fails to parse files with empty rows #5712

[BUG] JSON reader fails to parse files with empty rows #5712

Comments

vuule commented Jul 17, 2020 • edited Loading

github-actions bot commented Mar 14, 2021

dagardner-nv commented Jun 2, 2023 • edited Loading

GregoryKimball commented Jun 5, 2023

karthikeyann commented Sep 12, 2023

vuule commented Sep 18, 2023

galipremsagar commented Sep 18, 2023

revans2 commented Mar 14, 2024

vyasr commented May 16, 2024

revans2 commented Sep 9, 2024

karthikeyann commented Nov 12, 2024

ttnghia commented Nov 15, 2024 • edited Loading

karthikeyann commented Nov 26, 2024

vuule commented Jul 17, 2020 •

edited

Loading

dagardner-nv commented Jun 2, 2023 •

edited

Loading

ttnghia commented Nov 15, 2024 •

edited

Loading