Implement DataFrame.eval using libcudf ASTs #8022

vyasr · 2021-04-21T22:05:01Z

This PR exposes libcudf's expression parsing functionality in cudf and uses it to implement DataFrame.eval. The implementation is mostly feature-complete, but there are a few limitations relative to the pandas API and a couple of gotchas around type casting. The implementation is reasonably performant, improving upon an equivalent df.apply even accounting for JIT-compilation overhead. This implementation provides a stepping stone to leveraging libcudf's AST implementation for more complex tasks in cudf such as conditional joins.

The most significant issue with the current implementation is the lack of casting between integral types, meaning that operations can only be performed between columns of the exact same dtype. For example, operations between int8 and int16 would fail. This becomes particularly problematic for constants e.g. df.eval('x+1'). The best paths to improve this are at the C++ level of the expression evaluation, so I think we'll have to live with this limitation for now if we want to move forward.

Resolves #9112

cpp/include/cudf/ast/detail/transform.cuh

codecov · 2021-04-22T01:22:58Z

Codecov Report

Merging #8022 (11d81d6) into branch-22.06 (01d08af) will increase coverage by 0.07%.
The diff coverage is 93.77%.

❗ Current head 11d81d6 differs from pull request most recent head f009e5b. Consider uploading reports for the commit f009e5b to get more accurate results

@@               Coverage Diff                @@
##           branch-22.06    #8022      +/-   ##
================================================
+ Coverage         86.35%   86.42%   +0.07%     
================================================
  Files               142      143       +1     
  Lines             22335    22438     +103     
================================================
+ Hits              19287    19393     +106     
+ Misses             3048     3045       -3

Impacted Files	Coverage Δ
python/cudf/cudf/_fuzz_testing/fuzzer.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/_lib/__init__.py	`100.00% <ø> (ø)`
python/cudf/cudf/core/indexed_frame.py	`91.70% <ø> (ø)`
python/cudf/cudf/testing/dataset_generator.py	`73.25% <ø> (ø)`
python/cudf/cudf/testing/testing.py	`81.69% <ø> (ø)`
python/cudf/cudf/utils/utils.py	`90.35% <ø> (+0.06%)`	⬆️
python/dask_cudf/dask_cudf/io/orc.py	`91.04% <ø> (ø)`
python/cudf/cudf/core/column/numerical.py	`96.17% <50.00%> (+0.29%)`	⬆️
python/dask_cudf/dask_cudf/io/parquet.py	`92.39% <81.08%> (-1.40%)`	⬇️
python/cudf/cudf/core/_internals/expressions.py	`92.85% <92.85%> (ø)`
... and 24 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8a4d1b2...f009e5b. Read the comment docs.

harrism · 2021-07-21T22:36:45Z

Moving to 21.10

python/cudf/cudf/_lib/ast.pyx

python/cudf/cudf/_lib/transform.pyx

python/cudf/cudf/core/dataframe.py

brandon-b-miller · 2022-04-22T16:34:37Z

I have some thoughts about the mixed dtype issue, but they're probably a little out of scope of this PR. Let's follow up when you get a chance.

bdice

I have several comments that I would like to see addressed. For now, I am requesting changes. If you would like to punt all these comments/fixes to a future PR, I will approve and let you do that separately so that this large chunk of changes doesn't get stale.

python/cudf/cudf/_lib/ast.pyx

python/cudf/cudf/core/dataframe.py

…ately.

…to a pure Python module.

bdice

A few minor comments but this looks great!

python/cudf/cudf/_lib/cpp/expressions.pxd

python/cudf/cudf/_lib/cpp/transform.pxd

python/cudf/cudf/_lib/transform.pyx

python/cudf/cudf/core/_internals/expressions.py

python/cudf/cudf/core/dataframe.py

python/cudf/cudf/tests/test_dataframe.py

brandon-b-miller

I think all my comments were addressed here.

vyasr · 2022-04-28T15:04:08Z

@gpucibot merge

vyasr added feature request New feature or request Python Affects Python cuDF API. 0 - Blocked Cannot progress due to external reasons non-breaking Non-breaking change labels Apr 21, 2021

vyasr self-assigned this Apr 21, 2021

github-actions bot added conda libcudf Affects libcudf (C++/CUDA) code. labels Apr 21, 2021

jrhemstad reviewed Apr 21, 2021

View reviewed changes

cpp/include/cudf/ast/detail/transform.cuh Outdated Show resolved Hide resolved

vyasr force-pushed the feature/python_ast branch from 3ccbb4c to 368a4b5 Compare May 14, 2021 22:34

github-actions bot removed the conda label May 14, 2021

vyasr changed the title ~~Add Python support for evaluation via AST~~ Add Python support for evaluation via AST May 17, 2021

vyasr changed the base branch from branch-21.06 to branch-21.08 May 27, 2021 00:03

harrism changed the base branch from branch-21.08 to branch-21.10 July 21, 2021 22:37

vyasr force-pushed the feature/python_ast branch from 368a4b5 to 997a746 Compare August 25, 2021 17:55

vyasr added this to the Conditional Joins milestone Aug 25, 2021

vyasr force-pushed the feature/python_ast branch from eed923f to 6ff2d3c Compare September 9, 2021 23:18

vyasr mentioned this pull request Oct 21, 2021

Remove dependency on six. #9495

Merged

vyasr force-pushed the feature/python_ast branch from 6ff2d3c to 44ef2a5 Compare April 11, 2022 23:39