Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement DataFrame.eval using libcudf ASTs #8022

Merged
merged 69 commits into from
Apr 28, 2022

Conversation

vyasr
Copy link
Contributor

@vyasr vyasr commented Apr 21, 2021

This PR exposes libcudf's expression parsing functionality in cudf and uses it to implement DataFrame.eval. The implementation is mostly feature-complete, but there are a few limitations relative to the pandas API and a couple of gotchas around type casting. The implementation is reasonably performant, improving upon an equivalent df.apply even accounting for JIT-compilation overhead. This implementation provides a stepping stone to leveraging libcudf's AST implementation for more complex tasks in cudf such as conditional joins.

The most significant issue with the current implementation is the lack of casting between integral types, meaning that operations can only be performed between columns of the exact same dtype. For example, operations between int8 and int16 would fail. This becomes particularly problematic for constants e.g. df.eval('x+1'). The best paths to improve this are at the C++ level of the expression evaluation, so I think we'll have to live with this limitation for now if we want to move forward.

Resolves #9112

@vyasr vyasr added feature request New feature or request Python Affects Python cuDF API. 0 - Blocked Cannot progress due to external reasons non-breaking Non-breaking change labels Apr 21, 2021
@vyasr vyasr self-assigned this Apr 21, 2021
@github-actions github-actions bot added conda libcudf Affects libcudf (C++/CUDA) code. labels Apr 21, 2021
@codecov
Copy link

codecov bot commented Apr 22, 2021

Codecov Report

Merging #8022 (11d81d6) into branch-22.06 (01d08af) will increase coverage by 0.07%.
The diff coverage is 93.77%.

❗ Current head 11d81d6 differs from pull request most recent head f009e5b. Consider uploading reports for the commit f009e5b to get more accurate results

@@               Coverage Diff                @@
##           branch-22.06    #8022      +/-   ##
================================================
+ Coverage         86.35%   86.42%   +0.07%     
================================================
  Files               142      143       +1     
  Lines             22335    22438     +103     
================================================
+ Hits              19287    19393     +106     
+ Misses             3048     3045       -3     
Impacted Files Coverage Δ
python/cudf/cudf/_fuzz_testing/fuzzer.py 0.00% <0.00%> (ø)
python/cudf/cudf/_lib/__init__.py 100.00% <ø> (ø)
python/cudf/cudf/core/indexed_frame.py 91.70% <ø> (ø)
python/cudf/cudf/testing/dataset_generator.py 73.25% <ø> (ø)
python/cudf/cudf/testing/testing.py 81.69% <ø> (ø)
python/cudf/cudf/utils/utils.py 90.35% <ø> (+0.06%) ⬆️
python/dask_cudf/dask_cudf/io/orc.py 91.04% <ø> (ø)
python/cudf/cudf/core/column/numerical.py 96.17% <50.00%> (+0.29%) ⬆️
python/dask_cudf/dask_cudf/io/parquet.py 92.39% <81.08%> (-1.40%) ⬇️
python/cudf/cudf/core/_internals/expressions.py 92.85% <92.85%> (ø)
... and 24 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8a4d1b2...f009e5b. Read the comment docs.

@vyasr vyasr force-pushed the feature/python_ast branch from 3ccbb4c to 368a4b5 Compare May 14, 2021 22:34
@github-actions github-actions bot removed the conda label May 14, 2021
@vyasr vyasr changed the title Add Python support for evaluation via AST Add Python support for evaluation via AST May 17, 2021
@vyasr vyasr changed the title Add Python support for evaluation via AST Add Python support for evaluation via AST May 17, 2021
@vyasr vyasr changed the base branch from branch-21.06 to branch-21.08 May 27, 2021 00:03
@harrism
Copy link
Member

harrism commented Jul 21, 2021

Moving to 21.10

@harrism harrism changed the base branch from branch-21.08 to branch-21.10 July 21, 2021 22:37
@vyasr vyasr force-pushed the feature/python_ast branch from 368a4b5 to 997a746 Compare August 25, 2021 17:55
@vyasr vyasr added this to the Conditional Joins milestone Aug 25, 2021
@vyasr vyasr force-pushed the feature/python_ast branch from eed923f to 6ff2d3c Compare September 9, 2021 23:18
@vyasr vyasr mentioned this pull request Oct 21, 2021
@vyasr vyasr force-pushed the feature/python_ast branch from 6ff2d3c to 44ef2a5 Compare April 11, 2022 23:39
@brandon-b-miller
Copy link
Contributor

I have some thoughts about the mixed dtype issue, but they're probably a little out of scope of this PR. Let's follow up when you get a chance.

Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have several comments that I would like to see addressed. For now, I am requesting changes. If you would like to punt all these comments/fixes to a future PR, I will approve and let you do that separately so that this large chunk of changes doesn't get stale.

python/cudf/cudf/_lib/ast.pyx Outdated Show resolved Hide resolved
python/cudf/cudf/_lib/ast.pyx Outdated Show resolved Hide resolved
python/cudf/cudf/_lib/ast.pyx Outdated Show resolved Hide resolved
python/cudf/cudf/_lib/ast.pyx Outdated Show resolved Hide resolved
python/cudf/cudf/core/dataframe.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/dataframe.py Show resolved Hide resolved
python/cudf/cudf/core/dataframe.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/dataframe.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/dataframe.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/dataframe.py Outdated Show resolved Hide resolved
@vyasr vyasr requested review from bdice and brandon-b-miller April 26, 2022 15:47
Copy link
Contributor

@bdice bdice left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few minor comments but this looks great!

python/cudf/cudf/_lib/cpp/expressions.pxd Outdated Show resolved Hide resolved
python/cudf/cudf/_lib/cpp/transform.pxd Outdated Show resolved Hide resolved
python/cudf/cudf/_lib/transform.pyx Outdated Show resolved Hide resolved
python/cudf/cudf/core/_internals/expressions.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/dataframe.py Outdated Show resolved Hide resolved
python/cudf/cudf/core/dataframe.py Show resolved Hide resolved
python/cudf/cudf/core/dataframe.py Outdated Show resolved Hide resolved
python/cudf/cudf/tests/test_dataframe.py Outdated Show resolved Hide resolved
Copy link
Contributor

@brandon-b-miller brandon-b-miller left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think all my comments were addressed here.

@vyasr
Copy link
Contributor Author

vyasr commented Apr 28, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit a43fb9e into rapidsai:branch-22.06 Apr 28, 2022
@vyasr vyasr deleted the feature/python_ast branch June 30, 2022 00:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants