Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Fix an issue with reading raw string in cudf.read_json #10924

Merged
merged 7 commits into from
May 23, 2022

Conversation

galipremsagar
Copy link
Contributor

Fixes issue described here: #10275 (comment)

This PR removes a false error that says path couldn't be resolved. But that isn't true incase of a json reader where the input can be a json string itself, hence to resolve this issue is_raw_text_like_input that indicates an IO reader(like read_json) is calling the utility function and the path need not be a valid one. In case of a true invalid path, either fsspec or libcudf throws a file not found error.

@galipremsagar galipremsagar added bug Something isn't working 3 - Ready for Review Ready for review by team Python Affects Python cuDF API. 4 - Needs cuDF (Python) Reviewer improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels May 23, 2022
@galipremsagar galipremsagar self-assigned this May 23, 2022
@galipremsagar galipremsagar requested a review from a team as a code owner May 23, 2022 12:39
@galipremsagar galipremsagar removed the improvement Improvement / enhancement to an existing function label May 23, 2022
@galipremsagar
Copy link
Contributor Author

rerun tests

python/cudf/cudf/io/orc.py Outdated Show resolved Hide resolved
python/cudf/cudf/utils/ioutils.py Show resolved Hide resolved
python/cudf/cudf/utils/ioutils.py Outdated Show resolved Hide resolved
python/cudf/cudf/utils/ioutils.py Outdated Show resolved Hide resolved
if _is_local_filesystem(fs):
# Doing this as `read_json` accepts a json string
# path_or_data need not be a filepath like string
if os.path.exists(paths[0]):
path_or_data = paths if len(paths) > 1 else paths[0]
if len(paths):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did we lose the check for len(paths) == 0 above this intentionally? Even if _is_local_filesystem returns false, I think we still need paths to be nonempty.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, this was intentional because fsspec tends to not return any paths now. However, this comment lead me to think about the check when _is_local_filesystem is False i.e., in else block I was missing this check. Added it now.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we should be erroring based on len(paths) == 0 only in the case when _is_local_filesystem returns False. In the True case we should check for paths being nonempty but only error if paths is nonempty and those paths don't exist. Empty paths does not constitute an error, which is the change in fsspec that is causing our tests to fail.

python/cudf/cudf/io/json.py Outdated Show resolved Hide resolved
python/cudf/cudf/io/json.py Outdated Show resolved Hide resolved
python/cudf/cudf/io/orc.py Outdated Show resolved Hide resolved
python/cudf/cudf/io/orc.py Outdated Show resolved Hide resolved
python/cudf/cudf/io/orc.py Outdated Show resolved Hide resolved
Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple questions, but nothing blocking (although I am curious about the os.path.exists->fs.exists change). Let's prioritize getting this merged ASAP to unblock CI. Thanks @galipremsagar!

python/cudf/cudf/io/orc.py Outdated Show resolved Hide resolved
if _is_local_filesystem(fs):
# Doing this as `read_json` accepts a json string
# path_or_data need not be a filepath like string
if os.path.exists(paths[0]):
path_or_data = paths if len(paths) > 1 else paths[0]
if len(paths):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, we should be erroring based on len(paths) == 0 only in the case when _is_local_filesystem returns False. In the True case we should check for paths being nonempty but only error if paths is nonempty and those paths don't exist. Empty paths does not constitute an error, which is the change in fsspec that is causing our tests to fail.

python/cudf/cudf/io/orc.py Show resolved Hide resolved
python/cudf/cudf/utils/ioutils.py Show resolved Hide resolved
@galipremsagar galipremsagar added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team 4 - Needs cuDF (Python) Reviewer labels May 23, 2022
@codecov
Copy link

codecov bot commented May 23, 2022

Codecov Report

Merging #10924 (425f2c4) into branch-22.06 (54789ee) will increase coverage by 0.02%.
The diff coverage is 93.75%.

❗ Current head 425f2c4 differs from pull request most recent head 9241931. Consider uploading reports for the commit 9241931 to get more accurate results

@@               Coverage Diff                @@
##           branch-22.06   #10924      +/-   ##
================================================
+ Coverage         86.30%   86.32%   +0.02%     
================================================
  Files               144      144              
  Lines             22665    22668       +3     
================================================
+ Hits              19560    19569       +9     
+ Misses             3105     3099       -6     
Impacted Files Coverage Δ
python/cudf/cudf/utils/ioutils.py 79.47% <87.50%> (-0.13%) ⬇️
python/cudf/cudf/io/avro.py 78.57% <100.00%> (ø)
python/cudf/cudf/io/csv.py 91.80% <100.00%> (ø)
python/cudf/cudf/io/json.py 97.56% <100.00%> (ø)
python/cudf/cudf/io/orc.py 92.77% <100.00%> (ø)
python/cudf/cudf/io/parquet.py 90.83% <100.00%> (ø)
python/cudf/cudf/io/text.py 100.00% <100.00%> (ø)
python/cudf/cudf/core/dataframe.py 93.78% <0.00%> (+0.04%) ⬆️
python/cudf/cudf/core/column/string.py 88.78% <0.00%> (+0.12%) ⬆️
python/cudf/cudf/core/groupby/groupby.py 91.79% <0.00%> (+0.22%) ⬆️
... and 3 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6acf226...9241931. Read the comment docs.

@galipremsagar
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 5067cc7 into rapidsai:branch-22.06 May 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge bug Something isn't working non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants