Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

avro reader integration tests #7156

Merged
merged 7 commits into from
Feb 11, 2021
Merged

Conversation

cwharris
Copy link
Contributor

@cwharris cwharris commented Jan 15, 2021

Added some avro reader integration tests for fastavro. These cover type detection, single-value parsing, and null value parsing, but do not cover parsing multiple values.

@cwharris cwharris marked this pull request as ready for review January 27, 2021 05:37
@cwharris cwharris requested a review from a team as a code owner January 27, 2021 05:37
@cwharris
Copy link
Contributor Author

rerun tests

1 similar comment
@cwharris
Copy link
Contributor Author

rerun tests

@cwharris cwharris added 4 - Needs cuIO Reviewer cuIO cuIO issue Python Affects Python cuDF API. non-breaking Non-breaking change labels Jan 27, 2021
@codecov
Copy link

codecov bot commented Jan 27, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-0.19@fc40c52). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@              Coverage Diff               @@
##             branch-0.19    #7156   +/-   ##
==============================================
  Coverage               ?   82.22%           
==============================================
  Files                  ?      100           
  Lines                  ?    16969           
  Branches               ?        0           
==============================================
  Hits                   ?    13953           
  Misses                 ?     3016           
  Partials               ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update fc40c52...af6a966. Read the comment docs.

@cwharris cwharris requested review from vuule and removed request for brandon-b-miller January 28, 2021 22:24
@vuule vuule added the improvement Improvement / enhancement to an existing function label Jan 28, 2021
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good stuff. Got some questions/suggestions.

python/cudf/cudf/tests/test_avro.py Show resolved Hide resolved
Comment on lines 130 to 133
records = [
{"prop": avro_val},
{"prop": None},
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is the dataframe shape (1,2)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

expected and actual are the same shape. I don't know what shape that should be.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we also have some tests with a large number of rows?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can test a large number of values,. It would be nice to have a test data generator. I see we're generating random values for fuzz testing. Are we able to do that in a deterministic manner so it can be also be used for unit tests?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC the data generator optionally takes a seed value; that the output is deterministic for each seed. CC @galipremsagar for pointer to the generator + sample use.

Copy link
Contributor

@galipremsagar galipremsagar Feb 4, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are discussing having large rows, I'd recommend staying in <30 rows range to not slow down things in pytests by a lot as that would slow down in gpu CI too. If there is a bug that only reproduces for a large column scenarion then we can widen the test coverage for large columns, else I think fuzz tests should take care of large rows testing. For using the dataset generator, here is how we can use it:

>>> import cudf
>>> from cudf.tests.dataset_generator import rand_dataframe
>>> rand_dataframe(dtypes_meta=[{"dtype": "int64", "null_frequency": 0.4, "cardinality": 10}], 100, seed=2)
  File "<stdin>", line 1
SyntaxError: positional argument follows keyword argument
>>> rand_dataframe(dtypes_meta=[{"dtype": "int64", "null_frequency": 0.4, "cardinality": 10}], rows=100, seed=2)
pyarrow.Table
0: int64
>>> cudf.DataFrame.from_arrow(rand_dataframe(dtypes_meta=[{"dtype": "int64", "null_frequency": 0.4, "cardinality": 10}], rows=100, seed=2))
                       0
0   -1468954783236838137
1                   <NA>
2    2200161065918338095
3   -1193091257902529461
4   -5448271019629827509
..                   ...
95                  <NA>
96   2200161065918338095
97  -8745117541724490168
98                  <NA>
99  -4301277553722975852

[100 rows x 1 columns]

Alternatively, There is also an existing API that also returns deterministic data with the same seed values that is widely used across our pytests:
https://github.com/rapidsai/cudf/blob/branch-0.18/python/cudf/cudf/datasets.py#L60
This is much simpler to use and fits the use-case here.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we rather just change this test to be a list of values(cudf_val be length 5/10) instead of 1 value?

@vuule vuule added the 0 - Waiting on Author Waiting for author to respond to review label Feb 2, 2021
@cwharris cwharris requested a review from vuule February 2, 2021 19:23
@vuule vuule removed the 0 - Waiting on Author Waiting for author to respond to review label Feb 2, 2021
@cwharris
Copy link
Contributor Author

cwharris commented Feb 2, 2021

Looks like the PR is failing due to mypy style checks unrelated to these changes. Can we ignore that?

@galipremsagar
Copy link
Contributor

Looks like the PR is failing due to mypy style checks unrelated to these changes. Can we ignore that?

Fix incoming: #7279

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes based on the two unresolved comments.

@kkraus14
Copy link
Collaborator

kkraus14 commented Feb 9, 2021

rerun tests

@harrism harrism requested a review from vuule February 9, 2021 22:52
@vuule vuule added the 0 - Waiting on Author Waiting for author to respond to review label Feb 10, 2021
@cwharris cwharris removed the 0 - Waiting on Author Waiting for author to respond to review label Feb 10, 2021
@cwharris cwharris requested a review from vuule February 10, 2021 16:55
@vuule
Copy link
Contributor

vuule commented Feb 10, 2021

@cwharris should this PR close #6802?

@kkraus14 kkraus14 changed the base branch from branch-0.18 to branch-0.19 February 10, 2021 19:15
@kkraus14
Copy link
Collaborator

Retargeted to branch-0.19.

Copy link
Contributor

@galipremsagar galipremsagar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe better to add some PR description?

@galipremsagar galipremsagar added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 4 - Needs cuDF (Python) Reviewer labels Feb 11, 2021
@vuule
Copy link
Contributor

vuule commented Feb 11, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit aa72df7 into rapidsai:branch-0.19 Feb 11, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge cuIO cuIO issue improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants