Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Datafusion aggregate #471

Merged

Conversation

jdye64
Copy link
Collaborator

@jdye64 jdye64 commented Apr 12, 2022

Adds the support for Aggregate and Group By in queries.

This closes #464
This closes #472
This closes #478

rjzamora and others added 29 commits March 25, 2022 11:48
* basic predicate-pushdown support

* remove explict Dispatch class

* use _Frame.fillna

* cleanup comments

* test coverage

* improve test coverage

* add xfail test for dt accessor in predicate and fix test_show.py

* fix some naming issues

* add config and use assert_eq

* add logging events when predicate-pushdown bails

* move bail logic earlier in function

* address easier code review comments

* typo fix

* fix creation_info access bug

* convert any expression to DNF

* csv test coverage

* include IN coverage

* improve test rigor

* address code review

* skip parquet tests when deps are not installed

* fix bug

* add pyarrow dep to cluster workers

* roll back test skipping changes

Co-authored-by: Charles Blackmon-Luca <[email protected]>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
* Simplify gpuCI updating workflow

* Add check for cuML nightly version
…ontrib#447)

* Add handling for newer prompt-toolkit version

* Place compatibility code in _compat
* Update versions for java dependencies with cves

* Rerun tests
* Update versions for java dependencies with cves

* Rerun tests

* update jackson-databind dependency
* Disable SQL server functionality

* Update docs/source/server.rst

Co-authored-by: Ayush Dattagupta <[email protected]>

* Disable server at lowest possible level

* Skip all server tests

* Add tests to ensure server is disabled

* Fix CVE fix test

Co-authored-by: Ayush Dattagupta <[email protected]>
* Revert "Disable SQL server functionality (dask-contrib#448)"

This reverts commit 37a3a61.

* Bump httpclient version
* Unpin dask/distributed post release

* Remove dask/distributed version ceiling
* Add jsonschema to ci env

* Fix typo in config schema
…ask-contrib#365)

* Start moving tests to dd.assert_eq

* Use assert_eq in datetime filter test

* Resolve most resulting test failures

* Resolve remaining test failures

* Convert over tests

* Convert more tests

* Consolidate select limit cpu/gpu test

* Remove remaining assert_series_equal

* Remove explicit cudf imports from many tests

* Resolve rex test failures

* Remove some additional compute calls

* Consolidate sorting tests with getfixturevalue

* Fix failed join test

* Remove breakpoint

* Use custom assert_eq function for tests

* Resolve test failures / seg faults

* Remove unnecessary testing utils

* Resolve local test failures

* Generalize RAND test

* Avoid closing client if using independent cluster

* Fix failures on Windows

* Resolve black failures

* Make random test variables more clear
* Set max pin on antlr4-python-runtime due to incompatibilities with fugue_sql

* update comment on antlr max pin version
@charlesbluca
Copy link
Collaborator

Is there any way we can get CI running for this currently so that the uncommented tests run?

@jdye64 jdye64 requested a review from charlesbluca April 18, 2022 20:46
.github/workflows/rust.yml Outdated Show resolved Hide resolved
.github/workflows/rust.yml Outdated Show resolved Hide resolved
.pre-commit-config.yaml Outdated Show resolved Hide resolved
.pre-commit-config.yaml Show resolved Hide resolved
README.md Outdated Show resolved Hide resolved
continuous_integration/recipe/meta.yaml Outdated Show resolved Hide resolved
continuous_integration/recipe/meta.yaml Show resolved Hide resolved
dask_sql/mappings.py Outdated Show resolved Hide resolved
dask_sql/context.py Outdated Show resolved Hide resolved
dask_sql/physical/rel/custom/alter.py Outdated Show resolved Hide resolved
@jdye64 jdye64 requested a review from charlesbluca April 20, 2022 14:08
@charlesbluca
Copy link
Collaborator

rerun tests

@jdye64
Copy link
Collaborator Author

jdye64 commented Apr 20, 2022

@charlesbluca how does one find the link for the gpuci runs?

@charlesbluca
Copy link
Collaborator

rerun tests

1 similar comment
@charlesbluca
Copy link
Collaborator

rerun tests

@charlesbluca
Copy link
Collaborator

rerun tests

Copy link
Collaborator

@ayushdg ayushdg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updates lgtm! We should be good to merge on my end.

Last thing I wanted to discuss if there's scope to release the Python GIL in situations where we aren't directly working with python objects such as parsing the sql string and converting the ast into a datafusion plan. If it is, it might be worth raising an issue and following up in a separate pr.

.pre-commit-config.yaml Outdated Show resolved Hide resolved
pre-commit.sh Outdated Show resolved Hide resolved
pre-commit.sh Outdated Show resolved Hide resolved
.pre-commit-config.yaml Outdated Show resolved Hide resolved
@charlesbluca charlesbluca merged commit afeee32 into dask-contrib:datafusion-sql-planner Apr 21, 2022
charlesbluca added a commit that referenced this pull request Sep 21, 2022
* First pass at datafusion parsing

* updates

* updates

* updates

* DaskSchema implementation for Python in Rust

* updated mappings so that Python types map to PyArrow types which is the type also used by Datafusion Statements which are logical plans

* Add ability to add columns to an existing DaskTable

* Add ability to tables to be added to the DaskSchema

* Completion of _get_ral() function in dask-sql. Still does not actually compute yet

* Finished converting base class and DaskRelDataType and DaskRelDataTypeField

* Can make a very simple pass of a projection on a TableScan operation query work now

* updates

* Allow for the rough registration of Schemas to the DaskSQLContext

* pytest test_context.py working/checkpoint

* all unit tests passing/checkpoint

* checkpoint

* Update on test_select.py

* Refactor setup.py

* Refactored Rust code to traverse the AST SQL parse tree

* Datafusion aggregate (#471)

* Add basic predicate-pushdown optimization (#433)

* basic predicate-pushdown support

* remove explict Dispatch class

* use _Frame.fillna

* cleanup comments

* test coverage

* improve test coverage

* add xfail test for dt accessor in predicate and fix test_show.py

* fix some naming issues

* add config and use assert_eq

* add logging events when predicate-pushdown bails

* move bail logic earlier in function

* address easier code review comments

* typo fix

* fix creation_info access bug

* convert any expression to DNF

* csv test coverage

* include IN coverage

* improve test rigor

* address code review

* skip parquet tests when deps are not installed

* fix bug

* add pyarrow dep to cluster workers

* roll back test skipping changes

Co-authored-by: Charles Blackmon-Luca <[email protected]>

* Add workflow to keep datafusion dev branch up to date (#440)

* Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral

* Updates for test_filter

* more of test_filter.py working with the exception of some date pytests

* Updates to dates and parsing dates like postgresql does

* Update gpuCI `RAPIDS_VER` to `22.06` (#434)

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Bump black to 22.3.0 (#443)

* Check for ucx-py nightlies when updating gpuCI (#441)

* Simplify gpuCI updating workflow

* Add check for cuML nightly version

* Refactored to adjust for better type management

* Refactor schema and statements

* update types

* fix syntax issues and renamed function name calls

* Add handling for newer `prompt_toolkit` versions in cmd tests (#447)

* Add handling for newer prompt-toolkit version

* Place compatibility code in _compat

* Fix version for gha-find-replace (#446)

* Improved error handling and code clean up

* move pieces of logical.rs to seperated files to ensure code readability

* left join working

* Update versions of Java dependencies (#445)

* Update versions for java dependencies with cves

* Rerun tests

* Update jackson databind version (#449)

* Update versions for java dependencies with cves

* Rerun tests

* update jackson-databind dependency

* Disable SQL server functionality (#448)

* Disable SQL server functionality

* Update docs/source/server.rst

Co-authored-by: Ayush Dattagupta <[email protected]>

* Disable server at lowest possible level

* Skip all server tests

* Add tests to ensure server is disabled

* Fix CVE fix test

Co-authored-by: Ayush Dattagupta <[email protected]>

* Update dask pinnings for release (#450)

* Add Java source code to source distribution (#451)

* Bump `httpclient` dependency (#453)

* Revert "Disable SQL server functionality (#448)"

This reverts commit 37a3a61fb13b0c56fcc10bf8ef01f4885a58dae8.

* Bump httpclient version

* Unpin Dask/distributed versions (#452)

* Unpin dask/distributed post release

* Remove dask/distributed version ceiling

* Add jsonschema to ci testing (#454)

* Add jsonschema to ci env

* Fix typo in config schema

* Switch tests from `pd.testing.assert_frame_equal` to `dd.assert_eq` (#365)

* Start moving tests to dd.assert_eq

* Use assert_eq in datetime filter test

* Resolve most resulting test failures

* Resolve remaining test failures

* Convert over tests

* Convert more tests

* Consolidate select limit cpu/gpu test

* Remove remaining assert_series_equal

* Remove explicit cudf imports from many tests

* Resolve rex test failures

* Remove some additional compute calls

* Consolidate sorting tests with getfixturevalue

* Fix failed join test

* Remove breakpoint

* Use custom assert_eq function for tests

* Resolve test failures / seg faults

* Remove unnecessary testing utils

* Resolve local test failures

* Generalize RAND test

* Avoid closing client if using independent cluster

* Fix failures on Windows

* Resolve black failures

* Make random test variables more clear

* First basic working checkpoint for group by

* Set max pin on antlr4-python-runtime  (#456)

* Set max pin on antlr4-python-runtime due to incompatibilities with fugue_sql

* update comment on antlr max pin version

* Updates to style

* stage pre-commit changes for upstream merge

* Fix black failures

* Updates to Rust formatting

* Fix rust lint and clippy

* Remove jar building step which is no longer needed

* Remove Java from github workflows matrix

* Removes jar and Java references from test.yml

* Update Release workflow to remove references to Java

* Update rust.yml to remove references from linux-build-lib

* Add pre-commit.sh file to provide pre-commit support for Rust in a convenient script

* Removed overlooked jdk references

* cargo clippy auto fixes

* Address all Rust clippy warnings

* Include setuptools-rust in conda build recipie

* Include setuptools-rust in conda build recipie, in host and run

* Adjustments for conda build, committing for others to help with error and see it occurring in CI

* Include sql.yaml in package files

* Include pyarrow in run section of conda build to ensure tests pass

* include setuptools-rust in host and run of conda since removing caused errors

* to_string() method had been removed in rust and not removed here, caused conda run_test.py to fail when this line was hit

* Replace commented out tests with pytest.skip and bump version of pyarrow to 7.0.0

* Fix setup.py syntax issue introduced on last commit by find/replace

* Rename Datafusion -> DataFusion and Apache DataFusion -> Arrow DataFusion

* Fix docs build environment

* Include Rust compiler in docs environment

* Bump Rust compiler version to 1.59

* Ok, well readthedocs didn't like that

* Store libdask_planner.so and retrieve it between github workflows

* Cache the Rust library binary

* Remove Cargo.lock from git

* Remove unused datafusion-expr crate

* Build datafusion at each test step instead of caching binaries

* Remove maven and jar cache steps from test-upstream.yaml

* Removed dangling 'build' workflow step reference

* Lowered PyArrow version to 6.0.1 since cudf has a hard requirement on that version for the version of cudf we are using

* Add Rust build step to test in dask cluster

* Install setuptools-rust for pip to use for bare requirements import

* Include pyarrow 6.0.1 via conda as a bare minimum dependency

* Remove cudf dependency for python 3.9 which is causing build issues on windows

* Address documentation from review

* Install Rust as readthedocs post_create_environment step

* Run rust install non-interactively

* Run rust install non-interactively

* Rust isn't available in PyPi so remove that dependency

* Append ~/.cargo/bin to the PATH

* Print out some environment information for debugging

* Print out some environment information for debugging

* More - Increase verbosity

* More - Increase verbosity

* More - Increase verbosity

* Switch RTD over to use Conda instead of Pip since having issues with Rust and pip

* Try to use mamba for building docs environment

* Partial review suggestion address, checking CI still works

* Skip mistakenly enabled tests

* Use DataFusion master branch, and fix syntax issues related to the version bump

* More updates after bumping DataFusion version to master

* Use actions-rs in github workflows debug flag for setup.py

* Remove setuptools-rust from conda

* Use re-exported Rust types for BuiltinScalarFunction

* Move python imports to TYPE_CHECKING section where applicable

* Address review concerns and remove pre-commit.sh file

* Pin to a specific github rev for DataFusion

Co-authored-by: Richard (Rick) Zamora <[email protected]>
Co-authored-by: Charles Blackmon-Luca <[email protected]>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Ayush Dattagupta <[email protected]>

* Bump DataFusion version (#494)

* bump DataFusion version

* remove unnecessary downcasts and use separate structs for TableSource and TableProvider

* Basic DataFusion Select Functionality (#489)

* Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral

* Updates for test_filter

* more of test_filter.py working with the exception of some date pytests

* Add workflow to keep datafusion dev branch up to date (#440)

* Include setuptools-rust in conda build recipie, in host and run

* Remove PyArrow dependency

* rebase with datafusion-sql-planner

* refactor changes that were inadvertent during rebase

* timestamp with loglca time zone

* Include RelDataType work

* Include RelDataType work

* Introduced SqlTypeName Enum in Rust and mappings for Python

* impl PyExpr.getIndex()

* add getRowType() for logical.rs

* Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes

* use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict

* linter changes, why did that work on my local pre-commit??

* linter changes, why did that work on my local pre-commit??

* Convert final strs to SqlTypeName Enum

* removed a few print statements

* Temporarily disable conda run_test.py script since it uses features not yet implemented

* expose fromString method for SqlTypeName to use Enums instead of strings for type checking

* expanded SqlTypeName from_string() support

* accept INT as INTEGER

* Remove print statements

* Default to UTC if tz is None

* Delegate timezone handling to the arrow library

* Updates from review

Co-authored-by: Charles Blackmon-Luca <[email protected]>

* Allow for Cast parsing and logicalplan (#498)

* Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral

* Updates for test_filter

* more of test_filter.py working with the exception of some date pytests

* Add workflow to keep datafusion dev branch up to date (#440)

* Include setuptools-rust in conda build recipie, in host and run

* Remove PyArrow dependency

* rebase with datafusion-sql-planner

* refactor changes that were inadvertent during rebase

* timestamp with loglca time zone

* Include RelDataType work

* Include RelDataType work

* Introduced SqlTypeName Enum in Rust and mappings for Python

* impl PyExpr.getIndex()

* add getRowType() for logical.rs

* Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes

* use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict

* linter changes, why did that work on my local pre-commit??

* linter changes, why did that work on my local pre-commit??

* Convert final strs to SqlTypeName Enum

* removed a few print statements

* commit to share with colleague

* updates

* Temporarily disable conda run_test.py script since it uses features not yet implemented

* formatting after upstream merge

* expose fromString method for SqlTypeName to use Enums instead of strings for type checking

* expanded SqlTypeName from_string() support

* accept INT as INTEGER

* checkpoint

* Refactor PyExpr by removing From trait, and using recursion to expand expression list for rex calls

* skip test that uses create statement for gpuci

* uncommented pytests

* uncommented pytests

* code cleanup for review

* code cleanup for review

* Enabled more pytest that work now

* Enabled more pytest that work now

* Disable 2 pytest that are causing gpuCI issues. They will be address in a follow up PR

* Mark just the GPU tests as skipped

Co-authored-by: Charles Blackmon-Luca <[email protected]>

* Minor code cleanup in row_type() (#504)

* Minor code cleanup in row_type()

* remove unwrap

* Bump Rust version to 1.60 from 1.59 (#508)

* Improve code for getting column name from expression (#509)

* helper code for getting column name from expression

* Update dask_planner/src/expression.rs

Co-authored-by: Jeremy Dyer <[email protected]>

* Update dask_planner/src/expression.rs

Co-authored-by: Jeremy Dyer <[email protected]>

* fix build

* Improve error handling

Co-authored-by: Jeremy Dyer <[email protected]>

* Update exceptions that are thrown (#507)

* Update exceptions that are thrown

* Remove Java error regex formatting logic. Rust messages will be presented already formatted from Rust itself

* Removed lingering test that was still trying to test out Java specific error messages

* Update dask_planner/src/sql.rs

Co-authored-by: Andy Grove <[email protected]>

* clean up logical_relational_algebra function

Co-authored-by: Andy Grove <[email protected]>

* add support for expr_to_field for Expr::Sort expressions (#515)

* reduce crate dependencies (#516)

* Datafusion dsql explain (#511)

* Planner: Add explain logical plan bindings

* Planner: Add explain plan accessor to PyLogicalPlan

* Python: Add Explain plan plugin

* Python: Register explain plan plugin and add special check to directly return string results

* Planner: Update imports and accessor logic after merge with upstream

* Python: Add sql EXPLAIN tests

* Planner: Replace the pub use with use for LogicalPlan

* Port sort logic to the datafusion planner (#505)

* Implement PySort logical_plan

* Add a sort plan accessor to the logical plan

* Python: Update the sort plugin

* Python: Uncomment tests

* PLanner: Update accessor pattern for concrete logical plan implementations

* Test: Address review comments

* add support for expr_to_field for Expr::Sort expressions

* Planner: Update sort expr utilities and import cleanup

* Python: Re-enable skipped sort tests

* Python: Handle case where orderby column name is an alias

* Apply suggestions from code review

Remove redundant unwrap + re-wrap

Co-authored-by: Andy Grove <[email protected]>

* Style: Fix formatting

* Planner: Remove public scope for LogicalPlan import

* Python: Add more complex sort tests with alias that error right now

* Python: Remove old commented code

Co-authored-by: Andy Grove <[email protected]>

* Add helper method to convert LogicalPlan to Python type (#522)

* Add helper method to convert LogicalPlan to Python type

* simplify more

* Support CASE WHEN and BETWEEN (#502)

* Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral

* Updates for test_filter

* more of test_filter.py working with the exception of some date pytests

* Add workflow to keep datafusion dev branch up to date (#440)

* Include setuptools-rust in conda build recipie, in host and run

* Remove PyArrow dependency

* rebase with datafusion-sql-planner

* refactor changes that were inadvertent during rebase

* timestamp with loglca time zone

* Bump DataFusion version (#494)

* bump DataFusion version

* remove unnecessary downcasts and use separate structs for TableSource and TableProvider

* Include RelDataType work

* Include RelDataType work

* Introduced SqlTypeName Enum in Rust and mappings for Python

* impl PyExpr.getIndex()

* add getRowType() for logical.rs

* Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes

* use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict

* linter changes, why did that work on my local pre-commit??

* linter changes, why did that work on my local pre-commit??

* Convert final strs to SqlTypeName Enum

* removed a few print statements

* commit to share with colleague

* updates

* checkpoint

* Temporarily disable conda run_test.py script since it uses features not yet implemented

* formatting after upstream merge

* expose fromString method for SqlTypeName to use Enums instead of strings for type checking

* expanded SqlTypeName from_string() support

* accept INT as INTEGER

* tests update

* checkpoint

* checkpoint

* Refactor PyExpr by removing From trait, and using recursion to expand expression list for rex calls

* skip test that uses create statement for gpuci

* Basic DataFusion Select Functionality (#489)

* Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral

* Updates for test_filter

* more of test_filter.py working with the exception of some date pytests

* Add workflow to keep datafusion dev branch up to date (#440)

* Include setuptools-rust in conda build recipie, in host and run

* Remove PyArrow dependency

* rebase with datafusion-sql-planner

* refactor changes that were inadvertent during rebase

* timestamp with loglca time zone

* Include RelDataType work

* Include RelDataType work

* Introduced SqlTypeName Enum in Rust and mappings for Python

* impl PyExpr.getIndex()

* add getRowType() for logical.rs

* Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes

* use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict

* linter changes, why did that work on my local pre-commit??

* linter changes, why did that work on my local pre-commit??

* Convert final strs to SqlTypeName Enum

* removed a few print statements

* Temporarily disable conda run_test.py script since it uses features not yet implemented

* expose fromString method for SqlTypeName to use Enums instead of strings for type checking

* expanded SqlTypeName from_string() support

* accept INT as INTEGER

* Remove print statements

* Default to UTC if tz is None

* Delegate timezone handling to the arrow library

* Updates from review

Co-authored-by: Charles Blackmon-Luca <[email protected]>

* updates for expression

* uncommented pytests

* uncommented pytests

* code cleanup for review

* code cleanup for review

* Enabled more pytest that work now

* Enabled more pytest that work now

* Output Expression as String when BinaryExpr does not contain a named alias

* Output Expression as String when BinaryExpr does not contain a named alias

* Disable 2 pytest that are causing gpuCI issues. They will be address in a follow up PR

* Handle Between operation for case-when

* adjust timestamp casting

* Refactor projection _column_name() logic to the _column_name logic in expression.rs

* removed println! statements

* Updates from review

* refactor String::from() to .to_string()

* When no ELSE statement is present in CASE/WHEN statement default to None

* Remove println

* Re-enable rex test that previously could not be ran

Co-authored-by: Charles Blackmon-Luca <[email protected]>
Co-authored-by: Andy Grove <[email protected]>

* Upgrade to DataFusion 8.0.0 (#533)

* upgrade to DataFusion 8.0.0

* fmt

* Enable passing tests (#539)

* Uncomment working pytests

* Enable PyTests that are now passing

* Datafusion crossjoin (#521)

* Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral

* Updates for test_filter

* more of test_filter.py working with the exception of some date pytests

* Add workflow to keep datafusion dev branch up to date (#440)

* Include setuptools-rust in conda build recipie, in host and run

* Remove PyArrow dependency

* rebase with datafusion-sql-planner

* refactor changes that were inadvertent during rebase

* timestamp with loglca time zone

* Bump DataFusion version (#494)

* bump DataFusion version

* remove unnecessary downcasts and use separate structs for TableSource and TableProvider

* Include RelDataType work

* Include RelDataType work

* Introduced SqlTypeName Enum in Rust and mappings for Python

* impl PyExpr.getIndex()

* add getRowType() for logical.rs

* Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes

* use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict

* linter changes, why did that work on my local pre-commit??

* linter changes, why did that work on my local pre-commit??

* Convert final strs to SqlTypeName Enum

* removed a few print statements

* commit to share with colleague

* updates

* checkpoint

* Temporarily disable conda run_test.py script since it uses features not yet implemented

* formatting after upstream merge

* expose fromString method for SqlTypeName to use Enums instead of strings for type checking

* expanded SqlTypeName from_string() support

* accept INT as INTEGER

* tests update

* checkpoint

* checkpoint

* Refactor PyExpr by removing From trait, and using recursion to expand expression list for rex calls

* skip test that uses create statement for gpuci

* Basic DataFusion Select Functionality (#489)

* Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral

* Updates for test_filter

* more of test_filter.py working with the exception of some date pytests

* Add workflow to keep datafusion dev branch up to date (#440)

* Include setuptools-rust in conda build recipie, in host and run

* Remove PyArrow dependency

* rebase with datafusion-sql-planner

* refactor changes that were inadvertent during rebase

* timestamp with loglca time zone

* Include RelDataType work

* Include RelDataType work

* Introduced SqlTypeName Enum in Rust and mappings for Python

* impl PyExpr.getIndex()

* add getRowType() for logical.rs

* Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes

* use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict

* linter changes, why did that work on my local pre-commit??

* linter changes, why did that work on my local pre-commit??

* Convert final strs to SqlTypeName Enum

* removed a few print statements

* Temporarily disable conda run_test.py script since it uses features not yet implemented

* expose fromString method for SqlTypeName to use Enums instead of strings for type checking

* expanded SqlTypeName from_string() support

* accept INT as INTEGER

* Remove print statements

* Default to UTC if tz is None

* Delegate timezone handling to the arrow library

* Updates from review

Co-authored-by: Charles Blackmon-Luca <[email protected]>

* updates for expression

* uncommented pytests

* uncommented pytests

* code cleanup for review

* code cleanup for review

* Enabled more pytest that work now

* Enabled more pytest that work now

* Output Expression as String when BinaryExpr does not contain a named alias

* Output Expression as String when BinaryExpr does not contain a named alias

* Disable 2 pytest that are causing gpuCI issues. They will be address in a follow up PR

* Handle Between operation for case-when

* adjust timestamp casting

* Refactor projection _column_name() logic to the _column_name logic in expression.rs

* removed println! statements

* Updates from review

* refactor String::from() to .to_string()

* Fix mappings

* Add cross_join.py and cross_join.rs

* Add pytest for cross_join

* Address review comments

* Fix module import issue where typo was introduced

* manually supply test fixtures

* Remove request from test method signature

Co-authored-by: Charles Blackmon-Luca <[email protected]>
Co-authored-by: Andy Grove <[email protected]>

* Implement TryFrom for plans (#543)

* impl TryFrom for plans

* fix todo

* fix compilation error

* code cleanup

* revert change to error message

* simplify code

* cargo fmt

* Support for LIMIT clause with DataFusion (#529)

* Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral

* Updates for test_filter

* more of test_filter.py working with the exception of some date pytests

* Add workflow to keep datafusion dev branch up to date (#440)

* Include setuptools-rust in conda build recipie, in host and run

* Remove PyArrow dependency

* rebase with datafusion-sql-planner

* refactor changes that were inadvertent during rebase

* timestamp with loglca time zone

* Bump DataFusion version (#494)

* bump DataFusion version

* remove unnecessary downcasts and use separate structs for TableSource and TableProvider

* Include RelDataType work

* Include RelDataType work

* Introduced SqlTypeName Enum in Rust and mappings for Python

* impl PyExpr.getIndex()

* add getRowType() for logical.rs

* Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes

* use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict

* linter changes, why did that work on my local pre-commit??

* linter changes, why did that work on my local pre-commit??

* Convert final strs to SqlTypeName Enum

* removed a few print statements

* commit to share with colleague

* updates

* checkpoint

* Temporarily disable conda run_test.py script since it uses features not yet implemented

* formatting after upstream merge

* expose fromString method for SqlTypeName to use Enums instead of strings for type checking

* expanded SqlTypeName from_string() support

* accept INT as INTEGER

* tests update

* checkpoint

* checkpoint

* Refactor PyExpr by removing From trait, and using recursion to expand expression list for rex calls

* skip test that uses create statement for gpuci

* Basic DataFusion Select Functionality (#489)

* Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral

* Updates for test_filter

* more of test_filter.py working with the exception of some date pytests

* Add workflow to keep datafusion dev branch up to date (#440)

* Include setuptools-rust in conda build recipie, in host and run

* Remove PyArrow dependency

* rebase with datafusion-sql-planner

* refactor changes that were inadvertent during rebase

* timestamp with loglca time zone

* Include RelDataType work

* Include RelDataType work

* Introduced SqlTypeName Enum in Rust and mappings for Python

* impl PyExpr.getIndex()

* add getRowType() for logical.rs

* Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes

* use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict

* linter changes, why did that work on my local pre-commit??

* linter changes, why did that work on my local pre-commit??

* Convert final strs to SqlTypeName Enum

* removed a few print statements

* Temporarily disable conda run_test.py script since it uses features not yet implemented

* expose fromString method for SqlTypeName to use Enums instead of strings for type checking

* expanded SqlTypeName from_string() support

* accept INT as INTEGER

* Remove print statements

* Default to UTC if tz is None

* Delegate timezone handling to the arrow library

* Updates from review

Co-authored-by: Charles Blackmon-Luca <[email protected]>

* updates for expression

* uncommented pytests

* uncommented pytests

* code cleanup for review

* code cleanup for review

* Enabled more pytest that work now

* Enabled more pytest that work now

* Output Expression as String when BinaryExpr does not contain a named alias

* Output Expression as String when BinaryExpr does not contain a named alias

* Disable 2 pytest that are causing gpuCI issues. They will be address in a follow up PR

* Handle Between operation for case-when

* adjust timestamp casting

* Refactor projection _column_name() logic to the _column_name logic in expression.rs

* removed println! statements

* Updates from review

* Add Offset and point to repo with offset in datafusion

* Introduce offset

* limit updates

* commit before upstream merge

* Code formatting

* update Cargo.toml to use Arrow-DataFusion version with LIMIT logic

* Bump DataFusion version to get changes around variant_name()

* Use map partitions for determining the offset

* Refactor offset partition func

* Update to use TryFrom logic

* Add cloudpickle to independent scheduler requirements

Co-authored-by: Charles Blackmon-Luca <[email protected]>
Co-authored-by: Andy Grove <[email protected]>

* Support Joins using DataFusion planner/parser (#512)

* Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral

* Updates for test_filter

* more of test_filter.py working with the exception of some date pytests

* Add workflow to keep datafusion dev branch up to date (#440)

* Include setuptools-rust in conda build recipie, in host and run

* Remove PyArrow dependency

* rebase with datafusion-sql-planner

* refactor changes that were inadvertent during rebase

* timestamp with loglca time zone

* Bump DataFusion version (#494)

* bump DataFusion version

* remove unnecessary downcasts and use separate structs for TableSource and TableProvider

* Include RelDataType work

* Include RelDataType work

* Introduced SqlTypeName Enum in Rust and mappings for Python

* impl PyExpr.getIndex()

* add getRowType() for logical.rs

* Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes

* use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict

* linter changes, why did that work on my local pre-commit??

* linter changes, why did that work on my local pre-commit??

* Convert final strs to SqlTypeName Enum

* removed a few print statements

* commit to share with colleague

* updates

* checkpoint

* Temporarily disable conda run_test.py script since it uses features not yet implemented

* formatting after upstream merge

* expose fromString method for SqlTypeName to use Enums instead of strings for type checking

* expanded SqlTypeName from_string() support

* accept INT as INTEGER

* tests update

* checkpoint

* checkpoint

* Refactor PyExpr by removing From trait, and using recursion to expand expression list for rex calls

* skip test that uses create statement for gpuci

* Basic DataFusion Select Functionality (#489)

* Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral

* Updates for test_filter

* more of test_filter.py working with the exception of some date pytests

* Add workflow to keep datafusion dev branch up to date (#440)

* Include setuptools-rust in conda build recipie, in host and run

* Remove PyArrow dependency

* rebase with datafusion-sql-planner

* refactor changes that were inadvertent during rebase

* timestamp with loglca time zone

* Include RelDataType work

* Include RelDataType work

* Introduced SqlTypeName Enum in Rust and mappings for Python

* impl PyExpr.getIndex()

* add getRowType() for logical.rs

* Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes

* use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict

* linter changes, why did that work on my local pre-commit??

* linter changes, why did that work on my local pre-commit??

* Convert final strs to SqlTypeName Enum

* removed a few print statements

* Temporarily disable conda run_test.py script since it uses features not yet implemented

* expose fromString method for SqlTypeName to use Enums instead of strings for type checking

* expanded SqlTypeName from_string() support

* accept INT as INTEGER

* Remove print statements

* Default to UTC if tz is None

* Delegate timezone handling to the arrow library

* Updates from review

Co-authored-by: Charles Blackmon-Luca <[email protected]>

* updates for expression

* uncommented pytests

* uncommented pytests

* code cleanup for review

* code cleanup for review

* Enabled more pytest that work now

* Enabled more pytest that work now

* Output Expression as String when BinaryExpr does not contain a named alias

* Output Expression as String when BinaryExpr does not contain a named alias

* Disable 2 pytest that are causing gpuCI issues. They will be address in a follow up PR

* Handle Between operation for case-when

* adjust timestamp casting

* Refactor projection _column_name() logic to the _column_name logic in expression.rs

* removed println! statements

* introduce join getCondition() logic for retrieving the combining Rex logic for joining

* Updates from review

* Add Offset and point to repo with offset in datafusion

* Introduce offset

* limit updates

* commit before upstream merge

* Code formatting

* update Cargo.toml to use Arrow-DataFusion version with LIMIT logic

* Bump DataFusion version to get changes around variant_name()

* Use map partitions for determining the offset

* Merge with upstream

* Rename underlying DataContainer's DataFrame instance to match the column container names

* Adjust ColumnContainer mapping after join.py logic to entire the bakend mapping is reset

* Add enumerate to column_{i} generation string to ensure columns exist in both dataframes

* Adjust join schema logic to perform merge instead of join on rust side to avoid name collisions

* Handle DataFusion COUNT(UInt8(1)) as COUNT(*)

* commit before merge

* Update function for gathering index of a expression

* Update for review check

* Adjust RelDataType to retrieve fully qualified column names

* Adjust base.py to get fully qualified column  name

* Enable passing pytests in test_join.py

* Adjust keys provided by getting backend column mapping name

* Adjust output_col to not use the backend_column name for special reserved exprs

* uncomment cross join pytest which works now

* Uncomment passing pytests in test_select.py

* Review updates

* Add back complex join case condition, not just cross join but 'complex' joins

Co-authored-by: Charles Blackmon-Luca <[email protected]>
Co-authored-by: Andy Grove <[email protected]>

* Datafusion is not (#557)

* Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral

* Updates for test_filter

* more of test_filter.py working with the exception of some date pytests

* Add workflow to keep datafusion dev branch up to date (#440)

* Include setuptools-rust in conda build recipie, in host and run

* Remove PyArrow dependency

* rebase with datafusion-sql-planner

* refactor changes that were inadvertent during rebase

* timestamp with loglca time zone

* Bump DataFusion version (#494)

* bump DataFusion version

* remove unnecessary downcasts and use separate structs for TableSource and TableProvider

* Include RelDataType work

* Include RelDataType work

* Introduced SqlTypeName Enum in Rust and mappings for Python

* impl PyExpr.getIndex()

* add getRowType() for logical.rs

* Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes

* use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict

* linter changes, why did that work on my local pre-commit??

* linter changes, why did that work on my local pre-commit??

* Convert final strs to SqlTypeName Enum

* removed a few print statements

* commit to share with colleague

* updates

* checkpoint

* Temporarily disable conda run_test.py script since it uses features not yet implemented

* formatting after upstream merge

* expose fromString method for SqlTypeName to use Enums instead of strings for type checking

* expanded SqlTypeName from_string() support

* accept INT as INTEGER

* tests update

* checkpoint

* checkpoint

* Refactor PyExpr by removing From trait, and using recursion to expand expression list for rex calls

* skip test that uses create statement for gpuci

* Basic DataFusion Select Functionality (#489)

* Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral

* Updates for test_filter

* more of test_filter.py working with the exception of some date pytests

* Add workflow to keep datafusion dev branch up to date (#440)

* Include setuptools-rust in conda build recipie, in host and run

* Remove PyArrow dependency

* rebase with datafusion-sql-planner

* refactor changes that were inadvertent during rebase

* timestamp with loglca time zone

* Include RelDataType work

* Include RelDataType work

* Introduced SqlTypeName Enum in Rust and mappings for Python

* impl PyExpr.getIndex()

* add getRowType() for logical.rs

* Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes

* use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict

* linter changes, why did that work on my local pre-commit??

* linter changes, why did that work on my local pre-commit??

* Convert final strs to SqlTypeName Enum

* removed a few print statements

* Temporarily disable conda run_test.py script since it uses features not yet implemented

* expose fromString method for SqlTypeName to use Enums instead of strings for type checking

* expanded SqlTypeName from_string() support

* accept INT as INTEGER

* Remove print statements

* Default to UTC if tz is None

* Delegate timezone handling to the arrow library

* Updates from review

Co-authored-by: Charles Blackmon-Luca <[email protected]>

* updates for expression

* uncommented pytests

* uncommented pytests

* code cleanup for review

* code cleanup for review

* Enabled more pytest that work now

* Enabled more pytest that work now

* Output Expression as String when BinaryExpr does not contain a named alias

* Output Expression as String when BinaryExpr does not contain a named alias

* Disable 2 pytest that are causing gpuCI issues. They will be address in a follow up PR

* Handle Between operation for case-when

* adjust timestamp casting

* Refactor projection _column_name() logic to the _column_name logic in expression.rs

* removed println! statements

* introduce join getCondition() logic for retrieving the combining Rex logic for joining

* Updates from review

* Add Offset and point to repo with offset in datafusion

* Introduce offset

* limit updates

* commit before upstream merge

* Code formatting

* update Cargo.toml to use Arrow-DataFusion version with LIMIT logic

* Bump DataFusion version to get changes around variant_name()

* Use map partitions for determining the offset

* Merge with upstream

* Rename underlying DataContainer's DataFrame instance to match the column container names

* Adjust ColumnContainer mapping after join.py logic to entire the bakend mapping is reset

* Add enumerate to column_{i} generation string to ensure columns exist in both dataframes

* Adjust join schema logic to perform merge instead of join on rust side to avoid name collisions

* Handle DataFusion COUNT(UInt8(1)) as COUNT(*)

* commit before merge

* Update function for gathering index of a expression

* Update for review check

* Adjust RelDataType to retrieve fully qualified column names

* Adjust base.py to get fully qualified column  name

* Enable passing pytests in test_join.py

* Adjust keys provided by getting backend column mapping name

* Adjust output_col to not use the backend_column name for special reserved exprs

* uncomment cross join pytest which works now

* Uncomment passing pytests in test_select.py

* Review updates

* Add back complex join case condition, not just cross join but 'complex' joins

* Add support for 'is not null' clause

Co-authored-by: Charles Blackmon-Luca <[email protected]>
Co-authored-by: Andy Grove <[email protected]>

* [REVIEW] Add support for `UNION` (#542)

Fixes: #470

This PR adds UNION support in dask-sql which uses datafusion's union logic.

* [REVIEW] Fix issue with duplicates in column renaming (#559)

* initial commit

* add union.rs

* fix

* debug

* updates

* handle multiple inputs

* xfail

* remove xfails

* style

* cleanup

* un-xfail

* address reviews

* fix projection issue

* address reviews

* enable tests (#560)

Looks like LIMIT has been ported, hence enabling more tests across different files.

* Add CODEOWNERS file (#562)

* Add CODEOWNERS file

* Can only specify users with write access

* Remove accidental dev commit

* Upgrade DataFusion version & support non-equijoin join conditions (#566)

* use latest datafusion

* fix regression

* remove DaskTableProvider

* simplify code

* Add ayushdg and galipremsagar to rust CODEOWNERS (#572)

* Enable DataFusion CBO and introduce DaskSqlOptimizer (#558)

* Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral

* Updates for test_filter

* more of test_filter.py working with the exception of some date pytests

* Add workflow to keep datafusion dev branch up to date (#440)

* Include setuptools-rust in conda build recipie, in host and run

* Remove PyArrow dependency

* rebase with datafusion-sql-planner

* refactor changes that were inadvertent during rebase

* timestamp with loglca time zone

* Bump DataFusion version (#494)

* bump DataFusion version

* remove unnecessary downcasts and use separate structs for TableSource and TableProvider

* Include RelDataType work

* Include RelDataType work

* Introduced SqlTypeName Enum in Rust and mappings for Python

* impl PyExpr.getIndex()

* add getRowType() for logical.rs

* Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes

* use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict

* linter changes, why did that work on my local pre-commit??

* linter changes, why did that work on my local pre-commit??

* Convert final strs to SqlTypeName Enum

* removed a few print statements

* commit to share with colleague

* updates

* checkpoint

* Temporarily disable conda run_test.py script since it uses features not yet implemented

* formatting after upstream merge

* expose fromString method for SqlTypeName to use Enums instead of strings for type checking

* expanded SqlTypeName from_string() support

* accept INT as INTEGER

* tests update

* checkpoint

* checkpoint

* Refactor PyExpr by removing From trait, and using recursion to expand expression list for rex calls

* skip test that uses create statement for gpuci

* Basic DataFusion Select Functionality (#489)

* Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral

* Updates for test_filter

* more of test_filter.py working with the exception of some date pytests

* Add workflow to keep datafusion dev branch up to date (#440)

* Include setuptools-rust in conda build recipie, in host and run

* Remove PyArrow dependency

* rebase with datafusion-sql-planner

* refactor changes that were inadvertent during rebase

* timestamp with loglca time zone

* Include RelDataType work

* Include RelDataType work

* Introduced SqlTypeName Enum in Rust and mappings for Python

* impl PyExpr.getIndex()

* add getRowType() for logical.rs

* Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes

* use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict

* linter changes, why did that work on my local pre-commit??

* linter changes, why did that work on my local pre-commit??

* Convert final strs to SqlTypeName Enum

* removed a few print statements

* Temporarily disable conda run_test.py script since it uses features not yet implemented

* expose fromString method for SqlTypeName to use Enums instead of strings for type checking

* expanded SqlTypeName from_string() support

* accept INT as INTEGER

* Remove print statements

* Default to UTC if tz is None

* Delegate timezone handling to the arrow library

* Updates from review

Co-authored-by: Charles Blackmon-Luca <[email protected]>

* updates for expression

* uncommented pytests

* uncommented pytests

* code cleanup for review

* code cleanup for review

* Enabled more pytest that work now

* Enabled more pytest that work now

* Output Expression as String when BinaryExpr does not contain a named alias

* Output Expression as String when BinaryExpr does not contain a named alias

* Disable 2 pytest that are causing gpuCI issues. They will be address in a follow up PR

* Handle Between operation for case-when

* adjust timestamp casting

* Refactor projection _column_name() logic to the _column_name logic in expression.rs

* removed println! statements

* introduce join getCondition() logic for retrieving the combining Rex logic for joining

* Updates from review

* Add Offset and point to repo with offset in datafusion

* Introduce offset

* limit updates

* commit before upstream merge

* Code formatting

* update Cargo.toml to use Arrow-DataFusion version with LIMIT logic

* Bump DataFusion version to get changes around variant_name()

* Use map partitions for determining the offset

* Merge with upstream

* Rename underlying DataContainer's DataFrame instance to match the column container names

* Adjust ColumnContainer mapping after join.py logic to entire the bakend mapping is reset

* Add enumerate to column_{i} generation string to ensure columns exist in both dataframes

* Adjust join schema logic to perform merge instead of join on rust side to avoid name collisions

* Handle DataFusion COUNT(UInt8(1)) as COUNT(*)

* commit before merge

* Update function for gathering index of a expression

* Update for review check

* Adjust RelDataType to retrieve fully qualified column names

* Adjust base.py to get fully qualified column  name

* Enable passing pytests in test_join.py

* Adjust keys provided by getting backend column mapping name

* Adjust output_col to not use the backend_column name for special reserved exprs

* uncomment cross join pytest which works now

* Uncomment passing pytests in test_select.py

* Review updates

* Add back complex join case condition, not just cross join but 'complex' joins

* Enable DataFusion CBO logic

* Disable EliminateFilter optimization rule

* updates

* Disable tests that hit CBO generated plan edge cases of yet to be implemented logic

* [REVIEW] - Modifiy sql.skip_optimize to use dask_config.get and remove used method parameter

* [REVIEW] - change name of configuration from skip_optimize to optimize

* [REVIEW] - Add OptimizeException catch and raise statements back

* Found issue where backend column names which are results of a single aggregate resulting column, COUNT(*) for example, need to get the first agg df column since names are not valid

* Remove SQL from OptimizationException

* skip tests that CBO plan reorganization causes missing features to be present

Co-authored-by: Charles Blackmon-Luca <[email protected]>
Co-authored-by: Andy Grove <[email protected]>

* Only use the specific DataFusion crates that we need (#568)

* use specific datafusion crates

* clean up imports

* Fix some clippy warnings (#574)

* Datafusion invalid projection (#571)

* Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral

* Updates for test_filter

* more of test_filter.py working with the exception of some date pytests

* Add workflow to keep datafusion dev branch up to date (#440)

* Include setuptools-rust in conda build recipie, in host and run

* Remove PyArrow dependency

* rebase with datafusion-sql-planner

* refactor changes that were inadvertent during rebase

* timestamp with loglca time zone

* Bump DataFusion version (#494)

* bump DataFusion version

* remove unnecessary downcasts and use separate structs for TableSource and TableProvider

* Include RelDataType work

* Include RelDataType work

* Introduced SqlTypeName Enum in Rust and mappings for Python

* impl PyExpr.getIndex()

* add getRowType() for logical.rs

* Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes

* use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict

* linter changes, why did that work on my local pre-commit??

* linter changes, why did that work on my local pre-commit??

* Convert final strs to SqlTypeName Enum

* removed a few print statements

* commit to share with colleague

* updates

* checkpoint

* Temporarily disable conda run_test.py script since it uses features not yet implemented

* formatting after upstream merge

* expose fromString method for SqlTypeName to use Enums instead of strings for type checking

* expanded SqlTypeName from_string() support

* accept INT as INTEGER

* tests update

* checkpoint

* checkpoint

* Refactor PyExpr by removing From trait, and using recursion to expand expression list for rex calls

* skip test that uses create statement for gpuci

* Basic DataFusion Select Functionality (#489)

* Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral

* Updates for test_filter

* more of test_filter.py working with the exception of some date pytests

* Add workflow to keep datafusion dev branch up to date (#440)

* Include setuptools-rust in conda build recipie, in host and run

* Remove PyArrow dependency

* rebase with datafusion-sql-planner

* refactor changes that were inadvertent during rebase

* timestamp with loglca time zone

* Include RelDataType work

* Include RelDataType work

* Introduced SqlTypeName Enum in Rust and mappings for Python

* impl PyExpr.getIndex()

* add getRowType() for logical.rs

* Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes

* use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict

* linter changes, why did that work on my local pre-commit??

* linter changes, why did that work on my local pre-commit??

* Convert final strs to SqlTypeName Enum

* removed a few print statements

* Temporarily disable conda run_test.py script since it uses features not yet implemented

* expose fromString method for SqlTypeName to use Enums instead of strings for type checking

* expanded SqlTypeName from_string() support

* accept INT as INTEGER

* Remove print statements

* Default to UTC if tz is None

* Delegate timezone handling to the arrow library

* Updates from review

Co-authored-by: Charles Blackmon-Luca <[email protected]>

* updates for expression

* uncommented pytests

* uncommented pytests

* code cleanup for review

* code cleanup for review

* Enabled more pytest that work now

* Enabled more pytest that work now

* Output Expression as String when BinaryExpr does not contain a named alias

* Output Expression as String when BinaryExpr does not contain a named alias

* Disable 2 pytest that are causing gpuCI issues. They will be address in a follow up PR

* Handle Between operation for case-when

* adjust timestamp casting

* Refactor projection _column_name() logic to the _column_name logic in expression.rs

* removed println! statements

* introduce join getCondition() logic for retrieving the combining Rex logic for joining

* Updates from review

* Add Offset and point to repo with offset in datafusion

* Introduce offset

* limit updates

* commit before upstream merge

* Code formatting

* update Cargo.toml to use Arrow-DataFusion version with LIMIT logic

* Bump DataFusion version to get changes around variant_name()

* Use map partitions for determining the offset

* Merge with upstream

* Rename underlying DataContainer's DataFrame instance to match the column container names

* Adjust ColumnContainer mapping after join.py logic to entire the bakend mapping is reset

* Add enumerate to column_{i} generation string to ensure columns exist in both dataframes

* Adjust join schema logic to perform merge instead of join on rust side to avoid name collisions

* Handle DataFusion COUNT(UInt8(1)) as COUNT(*)

* commit before merge

* Update function for gathering index of a expression

* Update for review check

* Adjust RelDataType to retrieve fully qualified column names

* Adjust base.py to get fully qualified column  name

* Enable passing pytests in test_join.py

* Adjust keys provided by getting backend column mapping name

* Adjust output_col to not use the backend_column name for special reserved exprs

* uncomment cross join pytest which works now

* Uncomment passing pytests in test_select.py

* Review updates

* Add back complex join case condition, not just cross join but 'complex' joins

* Enable DataFusion CBO logic

* Disable EliminateFilter optimization rule

* updates

* Disable tests that hit CBO generated plan edge cases of yet to be implemented logic

* [REVIEW] - Modifiy sql.skip_optimize to use dask_config.get and remove used method parameter

* [REVIEW] - change name of configuration from skip_optimize to optimize

* [REVIEW] - Add OptimizeException catch and raise statements back

* Found issue where backend column names which are results of a single aggregate resulting column, COUNT(*) for example, need to get the first agg df column since names are not valid

* Remove SQL from OptimizationException

* skip tests that CBO plan reorganization causes missing features to be present

* If TableScan contains projections use those instead of all of the TableColums for limiting columns read during table_scan

* [REVIEW] remove compute(), remove temp row_type variable

* [REVIEW] - Add test for projection pushdown

* [REVIEW] - Add some more parametrized test combinations

* [REVIEW] - Use iterator instead of for loop and simplify contains_projections

* [REVIEW] - merge upstream and adjust imports

* [REVIEW] - Rename pytest function and remove duplicate table creation

Co-authored-by: Charles Blackmon-Luca <[email protected]>
Co-authored-by: Andy Grove <[email protected]>

* Datafusion upstream merge (#576)

* Add basic predicate-pushdown optimization (#433)

* basic predicate-pushdown support

* remove explict Dispatch class

* use _Frame.fillna

* cleanup comments

* test coverage

* improve test coverage

* add xfail test for dt accessor in predicate and fix test_show.py

* fix some naming issues

* add config and use assert_eq

* add logging events when predicate-pushdown bails

* move bail logic earlier in function

* address easier code review comments

* typo fix

* fix creation_info access bug

* convert any expression to DNF

* csv test coverage

* include IN coverage

* improve test rigor

* address code review

* skip parquet tests when deps are not installed

* fix bug

* add pyarrow dep to cluster workers

* roll back test skipping changes

Co-authored-by: Charles Blackmon-Luca <[email protected]>

* Add workflow to keep datafusion dev branch up to date (#440)

* Update gpuCI `RAPIDS_VER` to `22.06` (#434)

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Bump black to 22.3.0 (#443)

* Check for ucx-py nightlies when updating gpuCI (#441)

* Simplify gpuCI updating workflow

* Add check for cuML nightly version

* Add handling for newer `prompt_toolkit` versions in cmd tests (#447)

* Add handling for newer prompt-toolkit version

* Place compatibility code in _compat

* Fix version for gha-find-replace (#446)

* Update versions of Java dependencies (#445)

* Update versions for java dependencies with cves

* Rerun tests

* Update jackson databind version (#449)

* Update versions for java dependencies with cves

* Rerun tests

* update jackson-databind dependency

* Disable SQL server functionality (#448)

* Disable SQL server functionality

* Update docs/source/server.rst

Co-authored-by: Ayush Dattagupta <[email protected]>

* Disable server at lowest possible level

* Skip all server tests

* Add tests to ensure server is disabled

* Fix CVE fix test

Co-authored-by: Ayush Dattagupta <[email protected]>

* Update dask pinnings for release (#450)

* Add Java source code to source distribution (#451)

* Bump `httpclient` dependency (#453)

* Revert "Disable SQL server functionality (#448)"

This reverts commit 37a3a61fb13b0c56fcc10bf8ef01f4885a58dae8.

* Bump httpclient version

* Unpin Dask/distributed versions (#452)

* Unpin dask/distributed post release

* Remove dask/distributed version ceiling

* Add jsonschema to ci testing (#454)

* Add jsonschema to ci env

* Fix typo in config schema

* Switch tests from `pd.testing.assert_frame_equal` to `dd.assert_eq` (#365)

* Start moving tests to dd.assert_eq

* Use assert_eq in datetime filter test

* Resolve most resulting test failures

* Resolve remaining test failures

* Convert over tests

* Convert more tests

* Consolidate select limit cpu/gpu test

* Remove remaining assert_series_equal

* Remove explicit cudf imports from many tests

* Resolve rex test failures

* Remove some additional compute calls

* Consolidate sorting tests with getfixturevalue

* Fix failed join test

* Remove breakpoint

* Use custom assert_eq function for tests

* Resolve test failures / seg faults

* Remove unnecessary testing utils

* Resolve local test failures

* Generalize RAND test

* Avoid closing client if using independent cluster

* Fix failures on Windows

* Resolve black failures

* Make random test variables more clear

* Set max pin on antlr4-python-runtime  (#456)

* Set max pin on antlr4-python-runtime due to incompatibilities with fugue_sql

* update comment on antlr max pin version

* Move / minimize number of cudf / dask-cudf imports (#480)

* Move / minimize number of cudf / dask-cudf imports

* Add tests for GPU-related errors

* Fix unbound local error

* Fix ddf value error

* Use `map_partitions` to compute LIMIT / OFFSET (#517)

* Use map_partitions to compute limit / offset

* Use partition_info to extract partition_index

* Use `dev` images for independent cluster testing (#518)

* Switch to dask dev images

* Use mamba for conda installs in images

* Remove sleep call for installation

* Use timeout / until to wait for cluster to be initialized

* Add documentation for FugueSQL integrations (#523)

* Add documentation for FugueSQL integrations

* Minor nitpick around autodoc obj -> class

* Timestampdiff support (#495)

* added timestampdiff

* initial work for timestampdiff

* Added test cases for timestampdiff

* Update interval month dtype mapping

* Add datetimesubOperator

* Uncomment timestampdiff literal tests

* Update logic for handling interval_months for pandas/cudf series and scalars

* Add negative diff testcases, and gpu tests

* Update reinterpret and timedelta to explicitly cast to int64 instead of int

* Simplify cast_column_to_type mapping logic

* Add scalar handling to castOperation and reuse it for reinterpret

Co-authored-by: rajagurnath <[email protected]>

* Relax jsonschema testing dependency (#546)

* Update upstream testing workflows (#536)

* Use dask nightly conda packages for upstream testing

* Add independent cluster testing to nightly upstream CI [test-upstream]

* Remove unnecessary dask install [test-upstream]

* Remove strict channel policy to allow nightly dask installs

* Use nightly Dask packages in independent cluster test [test-upstream]

* Use channels argument to install Dask conda nightlies [test-upstream]

* Fix channel expression

* [test-upstream]

* Need to add mamba update command to get dask conda nightlies

* Use conda nightlies for dask-sql import test

* Add import test to upstream nightly tests

* [test-upstream]

* Make sure we have nightly Dask for import tests [test-upstream]

* Fix pyarrow / cloudpickle failures in cluster testing (#553)

* Explicitly install libstdcxx-ng in clusters

* Make pyarrow dependency consistent across testing

* Make libstdcxx-ng dep a min version

* Add cloudpickle to cluster dependencies

* cloudpickle must be in the scheduler environment

* Bump cloudpickle version

* Move cloudpickle install to workers

* Fix pyarrow constraint in cluster spec

* Use bash -l as default entrypoint for all jobs (#552)

* Constrain dask/distributed for release (#563)

* Unpin dask/distributed for development (#564)

* Unpin dask/distributed post release

* Remove dask/distributed version ceiling

* update dask-sphinx-theme (#567)

* Introduce subquery.py to handle subquery expressions

* update ordering

* Make sure scheduler has Dask nightlies in upstream cluster testing (#573)

* Make sure scheduler has Dask nightlies in upstream cluster testing

* empty commit to [test-upstream]

* Update gpuCI `RAPIDS_VER` to `22.08` (#565)

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* updates

* Remove startswith function merged by mistake

* [REVIEW] - Remove instance that are meant for the currently removed timestampdiff

* Modify test environment pinnings to cover minimum versions (#555)

* Remove black/isort deps as we prefer pre-commit

* Unpin all non python/jdk dependencies

* Minor package corrections for py3.9 jdk11 env

* Set min version constraints for all non-testing dependencies

* Pin all non-test deps for 3.8 testing

* Bump sklearn min version to 1.0.0

* Bump pyarrow min version to 1.0.1

* Fix pip notation for fugue

* Use unpinned deps for cluster testing for now

* Add fugue deps to environments, bump pandas to 1.0.2

* Add back antlr4 version ceiling

* Explicitly mark all fugue dependencies

* Alter test_analyze to avoid rtol

* Bump pandas to 1.0.5 to fix upstream numpy issues

* Alter datetime casting util to dodge panda casting failures

* Bump pandas to 1.1.0 for groupby dropna support

* Simplify string dtype check for get_supported_aggregations

* Add check_dtype=False back to test_group_by_nan

* Bump cluster to python 3.9

* Bump fastapi to 0.69.0, resolve remaining JDBC failures

* Typo - correct pandas version

* Generalize test_multi_case_when's dtype check

* Bump pandas to 1.1.1 to resolve flaky test failures

* Constrain mlflow for windows python 3.8 testing

* Selectors don't work for conda env files

* Problems seem to persist in 1.1.1, bump to 1.1.2

* Remove accidental debug changes

* [test-upstream]

* Use python 3.9 for upstream cluster testing [test-upstream]

* Updated missed pandas pinning

* Unconstrain mlflow to see if Windows failures persist

* Add min version for protobuf

* Bump pyarrow min version to allow for newer protobuf versions

* Don't move jar to local mvn repo (#579)

* Add tests for intersection

* Add tests for intersection

* Add another intersection test, even more simple but for testing raw intersection

* Use Timedelta when doing ReduceOperation(s) against datetime64 dtypes

* Cleanup

* Use an either/or strategy for converting to Timedelta objects

* Support more than 2 operands for Timedelta conversions

* fix merge issues, is_frame() function of call.py was removed accidentally before

* Remove pytest …
@jdye64 jdye64 deleted the datafusion-aggregate branch January 30, 2023 22:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datafusion Related to work in DataFusion
Projects
None yet
6 participants