Skip to content

Commit

Permalink
Switch to Arrow DataFusion SQL parser (#788)
Browse files Browse the repository at this point in the history
* First pass at datafusion parsing

* updates

* updates

* updates

* DaskSchema implementation for Python in Rust

* updated mappings so that Python types map to PyArrow types which is the type also used by Datafusion Statements which are logical plans

* Add ability to add columns to an existing DaskTable

* Add ability to tables to be added to the DaskSchema

* Completion of _get_ral() function in dask-sql. Still does not actually compute yet

* Finished converting base class and DaskRelDataType and DaskRelDataTypeField

* Can make a very simple pass of a projection on a TableScan operation query work now

* updates

* Allow for the rough registration of Schemas to the DaskSQLContext

* pytest test_context.py working/checkpoint

* all unit tests passing/checkpoint

* checkpoint

* Update on test_select.py

* Refactor setup.py

* Refactored Rust code to traverse the AST SQL parse tree

* Datafusion aggregate (#471)

* Add basic predicate-pushdown optimization (#433)

* basic predicate-pushdown support

* remove explict Dispatch class

* use _Frame.fillna

* cleanup comments

* test coverage

* improve test coverage

* add xfail test for dt accessor in predicate and fix test_show.py

* fix some naming issues

* add config and use assert_eq

* add logging events when predicate-pushdown bails

* move bail logic earlier in function

* address easier code review comments

* typo fix

* fix creation_info access bug

* convert any expression to DNF

* csv test coverage

* include IN coverage

* improve test rigor

* address code review

* skip parquet tests when deps are not installed

* fix bug

* add pyarrow dep to cluster workers

* roll back test skipping changes

Co-authored-by: Charles Blackmon-Luca <[email protected]>

* Add workflow to keep datafusion dev branch up to date (#440)

* Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral

* Updates for test_filter

* more of test_filter.py working with the exception of some date pytests

* Updates to dates and parsing dates like postgresql does

* Update gpuCI `RAPIDS_VER` to `22.06` (#434)

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Bump black to 22.3.0 (#443)

* Check for ucx-py nightlies when updating gpuCI (#441)

* Simplify gpuCI updating workflow

* Add check for cuML nightly version

* Refactored to adjust for better type management

* Refactor schema and statements

* update types

* fix syntax issues and renamed function name calls

* Add handling for newer `prompt_toolkit` versions in cmd tests (#447)

* Add handling for newer prompt-toolkit version

* Place compatibility code in _compat

* Fix version for gha-find-replace (#446)

* Improved error handling and code clean up

* move pieces of logical.rs to seperated files to ensure code readability

* left join working

* Update versions of Java dependencies (#445)

* Update versions for java dependencies with cves

* Rerun tests

* Update jackson databind version (#449)

* Update versions for java dependencies with cves

* Rerun tests

* update jackson-databind dependency

* Disable SQL server functionality (#448)

* Disable SQL server functionality

* Update docs/source/server.rst

Co-authored-by: Ayush Dattagupta <[email protected]>

* Disable server at lowest possible level

* Skip all server tests

* Add tests to ensure server is disabled

* Fix CVE fix test

Co-authored-by: Ayush Dattagupta <[email protected]>

* Update dask pinnings for release (#450)

* Add Java source code to source distribution (#451)

* Bump `httpclient` dependency (#453)

* Revert "Disable SQL server functionality (#448)"

This reverts commit 37a3a61fb13b0c56fcc10bf8ef01f4885a58dae8.

* Bump httpclient version

* Unpin Dask/distributed versions (#452)

* Unpin dask/distributed post release

* Remove dask/distributed version ceiling

* Add jsonschema to ci testing (#454)

* Add jsonschema to ci env

* Fix typo in config schema

* Switch tests from `pd.testing.assert_frame_equal` to `dd.assert_eq` (#365)

* Start moving tests to dd.assert_eq

* Use assert_eq in datetime filter test

* Resolve most resulting test failures

* Resolve remaining test failures

* Convert over tests

* Convert more tests

* Consolidate select limit cpu/gpu test

* Remove remaining assert_series_equal

* Remove explicit cudf imports from many tests

* Resolve rex test failures

* Remove some additional compute calls

* Consolidate sorting tests with getfixturevalue

* Fix failed join test

* Remove breakpoint

* Use custom assert_eq function for tests

* Resolve test failures / seg faults

* Remove unnecessary testing utils

* Resolve local test failures

* Generalize RAND test

* Avoid closing client if using independent cluster

* Fix failures on Windows

* Resolve black failures

* Make random test variables more clear

* First basic working checkpoint for group by

* Set max pin on antlr4-python-runtime  (#456)

* Set max pin on antlr4-python-runtime due to incompatibilities with fugue_sql

* update comment on antlr max pin version

* Updates to style

* stage pre-commit changes for upstream merge

* Fix black failures

* Updates to Rust formatting

* Fix rust lint and clippy

* Remove jar building step which is no longer needed

* Remove Java from github workflows matrix

* Removes jar and Java references from test.yml

* Update Release workflow to remove references to Java

* Update rust.yml to remove references from linux-build-lib

* Add pre-commit.sh file to provide pre-commit support for Rust in a convenient script

* Removed overlooked jdk references

* cargo clippy auto fixes

* Address all Rust clippy warnings

* Include setuptools-rust in conda build recipie

* Include setuptools-rust in conda build recipie, in host and run

* Adjustments for conda build, committing for others to help with error and see it occurring in CI

* Include sql.yaml in package files

* Include pyarrow in run section of conda build to ensure tests pass

* include setuptools-rust in host and run of conda since removing caused errors

* to_string() method had been removed in rust and not removed here, caused conda run_test.py to fail when this line was hit

* Replace commented out tests with pytest.skip and bump version of pyarrow to 7.0.0

* Fix setup.py syntax issue introduced on last commit by find/replace

* Rename Datafusion -> DataFusion and Apache DataFusion -> Arrow DataFusion

* Fix docs build environment

* Include Rust compiler in docs environment

* Bump Rust compiler version to 1.59

* Ok, well readthedocs didn't like that

* Store libdask_planner.so and retrieve it between github workflows

* Cache the Rust library binary

* Remove Cargo.lock from git

* Remove unused datafusion-expr crate

* Build datafusion at each test step instead of caching binaries

* Remove maven and jar cache steps from test-upstream.yaml

* Removed dangling 'build' workflow step reference

* Lowered PyArrow version to 6.0.1 since cudf has a hard requirement on that version for the version of cudf we are using

* Add Rust build step to test in dask cluster

* Install setuptools-rust for pip to use for bare requirements import

* Include pyarrow 6.0.1 via conda as a bare minimum dependency

* Remove cudf dependency for python 3.9 which is causing build issues on windows

* Address documentation from review

* Install Rust as readthedocs post_create_environment step

* Run rust install non-interactively

* Run rust install non-interactively

* Rust isn't available in PyPi so remove that dependency

* Append ~/.cargo/bin to the PATH

* Print out some environment information for debugging

* Print out some environment information for debugging

* More - Increase verbosity

* More - Increase verbosity

* More - Increase verbosity

* Switch RTD over to use Conda instead of Pip since having issues with Rust and pip

* Try to use mamba for building docs environment

* Partial review suggestion address, checking CI still works

* Skip mistakenly enabled tests

* Use DataFusion master branch, and fix syntax issues related to the version bump

* More updates after bumping DataFusion version to master

* Use actions-rs in github workflows debug flag for setup.py

* Remove setuptools-rust from conda

* Use re-exported Rust types for BuiltinScalarFunction

* Move python imports to TYPE_CHECKING section where applicable

* Address review concerns and remove pre-commit.sh file

* Pin to a specific github rev for DataFusion

Co-authored-by: Richard (Rick) Zamora <[email protected]>
Co-authored-by: Charles Blackmon-Luca <[email protected]>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Ayush Dattagupta <[email protected]>

* Bump DataFusion version (#494)

* bump DataFusion version

* remove unnecessary downcasts and use separate structs for TableSource and TableProvider

* Basic DataFusion Select Functionality (#489)

* Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral

* Updates for test_filter

* more of test_filter.py working with the exception of some date pytests

* Add workflow to keep datafusion dev branch up to date (#440)

* Include setuptools-rust in conda build recipie, in host and run

* Remove PyArrow dependency

* rebase with datafusion-sql-planner

* refactor changes that were inadvertent during rebase

* timestamp with loglca time zone

* Include RelDataType work

* Include RelDataType work

* Introduced SqlTypeName Enum in Rust and mappings for Python

* impl PyExpr.getIndex()

* add getRowType() for logical.rs

* Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes

* use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict

* linter changes, why did that work on my local pre-commit??

* linter changes, why did that work on my local pre-commit??

* Convert final strs to SqlTypeName Enum

* removed a few print statements

* Temporarily disable conda run_test.py script since it uses features not yet implemented

* expose fromString method for SqlTypeName to use Enums instead of strings for type checking

* expanded SqlTypeName from_string() support

* accept INT as INTEGER

* Remove print statements

* Default to UTC if tz is None

* Delegate timezone handling to the arrow library

* Updates from review

Co-authored-by: Charles Blackmon-Luca <[email protected]>

* Allow for Cast parsing and logicalplan (#498)

* Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral

* Updates for test_filter

* more of test_filter.py working with the exception of some date pytests

* Add workflow to keep datafusion dev branch up to date (#440)

* Include setuptools-rust in conda build recipie, in host and run

* Remove PyArrow dependency

* rebase with datafusion-sql-planner

* refactor changes that were inadvertent during rebase

* timestamp with loglca time zone

* Include RelDataType work

* Include RelDataType work

* Introduced SqlTypeName Enum in Rust and mappings for Python

* impl PyExpr.getIndex()

* add getRowType() for logical.rs

* Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes

* use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict

* linter changes, why did that work on my local pre-commit??

* linter changes, why did that work on my local pre-commit??

* Convert final strs to SqlTypeName Enum

* removed a few print statements

* commit to share with colleague

* updates

* Temporarily disable conda run_test.py script since it uses features not yet implemented

* formatting after upstream merge

* expose fromString method for SqlTypeName to use Enums instead of strings for type checking

* expanded SqlTypeName from_string() support

* accept INT as INTEGER

* checkpoint

* Refactor PyExpr by removing From trait, and using recursion to expand expression list for rex calls

* skip test that uses create statement for gpuci

* uncommented pytests

* uncommented pytests

* code cleanup for review

* code cleanup for review

* Enabled more pytest that work now

* Enabled more pytest that work now

* Disable 2 pytest that are causing gpuCI issues. They will be address in a follow up PR

* Mark just the GPU tests as skipped

Co-authored-by: Charles Blackmon-Luca <[email protected]>

* Minor code cleanup in row_type() (#504)

* Minor code cleanup in row_type()

* remove unwrap

* Bump Rust version to 1.60 from 1.59 (#508)

* Improve code for getting column name from expression (#509)

* helper code for getting column name from expression

* Update dask_planner/src/expression.rs

Co-authored-by: Jeremy Dyer <[email protected]>

* Update dask_planner/src/expression.rs

Co-authored-by: Jeremy Dyer <[email protected]>

* fix build

* Improve error handling

Co-authored-by: Jeremy Dyer <[email protected]>

* Update exceptions that are thrown (#507)

* Update exceptions that are thrown

* Remove Java error regex formatting logic. Rust messages will be presented already formatted from Rust itself

* Removed lingering test that was still trying to test out Java specific error messages

* Update dask_planner/src/sql.rs

Co-authored-by: Andy Grove <[email protected]>

* clean up logical_relational_algebra function

Co-authored-by: Andy Grove <[email protected]>

* add support for expr_to_field for Expr::Sort expressions (#515)

* reduce crate dependencies (#516)

* Datafusion dsql explain (#511)

* Planner: Add explain logical plan bindings

* Planner: Add explain plan accessor to PyLogicalPlan

* Python: Add Explain plan plugin

* Python: Register explain plan plugin and add special check to directly return string results

* Planner: Update imports and accessor logic after merge with upstream

* Python: Add sql EXPLAIN tests

* Planner: Replace the pub use with use for LogicalPlan

* Port sort logic to the datafusion planner (#505)

* Implement PySort logical_plan

* Add a sort plan accessor to the logical plan

* Python: Update the sort plugin

* Python: Uncomment tests

* PLanner: Update accessor pattern for concrete logical plan implementations

* Test: Address review comments

* add support for expr_to_field for Expr::Sort expressions

* Planner: Update sort expr utilities and import cleanup

* Python: Re-enable skipped sort tests

* Python: Handle case where orderby column name is an alias

* Apply suggestions from code review

Remove redundant unwrap + re-wrap

Co-authored-by: Andy Grove <[email protected]>

* Style: Fix formatting

* Planner: Remove public scope for LogicalPlan import

* Python: Add more complex sort tests with alias that error right now

* Python: Remove old commented code

Co-authored-by: Andy Grove <[email protected]>

* Add helper method to convert LogicalPlan to Python type (#522)

* Add helper method to convert LogicalPlan to Python type

* simplify more

* Support CASE WHEN and BETWEEN (#502)

* Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral

* Updates for test_filter

* more of test_filter.py working with the exception of some date pytests

* Add workflow to keep datafusion dev branch up to date (#440)

* Include setuptools-rust in conda build recipie, in host and run

* Remove PyArrow dependency

* rebase with datafusion-sql-planner

* refactor changes that were inadvertent during rebase

* timestamp with loglca time zone

* Bump DataFusion version (#494)

* bump DataFusion version

* remove unnecessary downcasts and use separate structs for TableSource and TableProvider

* Include RelDataType work

* Include RelDataType work

* Introduced SqlTypeName Enum in Rust and mappings for Python

* impl PyExpr.getIndex()

* add getRowType() for logical.rs

* Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes

* use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict

* linter changes, why did that work on my local pre-commit??

* linter changes, why did that work on my local pre-commit??

* Convert final strs to SqlTypeName Enum

* removed a few print statements

* commit to share with colleague

* updates

* checkpoint

* Temporarily disable conda run_test.py script since it uses features not yet implemented

* formatting after upstream merge

* expose fromString method for SqlTypeName to use Enums instead of strings for type checking

* expanded SqlTypeName from_string() support

* accept INT as INTEGER

* tests update

* checkpoint

* checkpoint

* Refactor PyExpr by removing From trait, and using recursion to expand expression list for rex calls

* skip test that uses create statement for gpuci

* Basic DataFusion Select Functionality (#489)

* Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral

* Updates for test_filter

* more of test_filter.py working with the exception of some date pytests

* Add workflow to keep datafusion dev branch up to date (#440)

* Include setuptools-rust in conda build recipie, in host and run

* Remove PyArrow dependency

* rebase with datafusion-sql-planner

* refactor changes that were inadvertent during rebase

* timestamp with loglca time zone

* Include RelDataType work

* Include RelDataType work

* Introduced SqlTypeName Enum in Rust and mappings for Python

* impl PyExpr.getIndex()

* add getRowType() for logical.rs

* Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes

* use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict

* linter changes, why did that work on my local pre-commit??

* linter changes, why did that work on my local pre-commit??

* Convert final strs to SqlTypeName Enum

* removed a few print statements

* Temporarily disable conda run_test.py script since it uses features not yet implemented

* expose fromString method for SqlTypeName to use Enums instead of strings for type checking

* expanded SqlTypeName from_string() support

* accept INT as INTEGER

* Remove print statements

* Default to UTC if tz is None

* Delegate timezone handling to the arrow library

* Updates from review

Co-authored-by: Charles Blackmon-Luca <[email protected]>

* updates for expression

* uncommented pytests

* uncommented pytests

* code cleanup for review

* code cleanup for review

* Enabled more pytest that work now

* Enabled more pytest that work now

* Output Expression as String when BinaryExpr does not contain a named alias

* Output Expression as String when BinaryExpr does not contain a named alias

* Disable 2 pytest that are causing gpuCI issues. They will be address in a follow up PR

* Handle Between operation for case-when

* adjust timestamp casting

* Refactor projection _column_name() logic to the _column_name logic in expression.rs

* removed println! statements

* Updates from review

* refactor String::from() to .to_string()

* When no ELSE statement is present in CASE/WHEN statement default to None

* Remove println

* Re-enable rex test that previously could not be ran

Co-authored-by: Charles Blackmon-Luca <[email protected]>
Co-authored-by: Andy Grove <[email protected]>

* Upgrade to DataFusion 8.0.0 (#533)

* upgrade to DataFusion 8.0.0

* fmt

* Enable passing tests (#539)

* Uncomment working pytests

* Enable PyTests that are now passing

* Datafusion crossjoin (#521)

* Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral

* Updates for test_filter

* more of test_filter.py working with the exception of some date pytests

* Add workflow to keep datafusion dev branch up to date (#440)

* Include setuptools-rust in conda build recipie, in host and run

* Remove PyArrow dependency

* rebase with datafusion-sql-planner

* refactor changes that were inadvertent during rebase

* timestamp with loglca time zone

* Bump DataFusion version (#494)

* bump DataFusion version

* remove unnecessary downcasts and use separate structs for TableSource and TableProvider

* Include RelDataType work

* Include RelDataType work

* Introduced SqlTypeName Enum in Rust and mappings for Python

* impl PyExpr.getIndex()

* add getRowType() for logical.rs

* Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes

* use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict

* linter changes, why did that work on my local pre-commit??

* linter changes, why did that work on my local pre-commit??

* Convert final strs to SqlTypeName Enum

* removed a few print statements

* commit to share with colleague

* updates

* checkpoint

* Temporarily disable conda run_test.py script since it uses features not yet implemented

* formatting after upstream merge

* expose fromString method for SqlTypeName to use Enums instead of strings for type checking

* expanded SqlTypeName from_string() support

* accept INT as INTEGER

* tests update

* checkpoint

* checkpoint

* Refactor PyExpr by removing From trait, and using recursion to expand expression list for rex calls

* skip test that uses create statement for gpuci

* Basic DataFusion Select Functionality (#489)

* Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral

* Updates for test_filter

* more of test_filter.py working with the exception of some date pytests

* Add workflow to keep datafusion dev branch up to date (#440)

* Include setuptools-rust in conda build recipie, in host and run

* Remove PyArrow dependency

* rebase with datafusion-sql-planner

* refactor changes that were inadvertent during rebase

* timestamp with loglca time zone

* Include RelDataType work

* Include RelDataType work

* Introduced SqlTypeName Enum in Rust and mappings for Python

* impl PyExpr.getIndex()

* add getRowType() for logical.rs

* Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes

* use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict

* linter changes, why did that work on my local pre-commit??

* linter changes, why did that work on my local pre-commit??

* Convert final strs to SqlTypeName Enum

* removed a few print statements

* Temporarily disable conda run_test.py script since it uses features not yet implemented

* expose fromString method for SqlTypeName to use Enums instead of strings for type checking

* expanded SqlTypeName from_string() support

* accept INT as INTEGER

* Remove print statements

* Default to UTC if tz is None

* Delegate timezone handling to the arrow library

* Updates from review

Co-authored-by: Charles Blackmon-Luca <[email protected]>

* updates for expression

* uncommented pytests

* uncommented pytests

* code cleanup for review

* code cleanup for review

* Enabled more pytest that work now

* Enabled more pytest that work now

* Output Expression as String when BinaryExpr does not contain a named alias

* Output Expression as String when BinaryExpr does not contain a named alias

* Disable 2 pytest that are causing gpuCI issues. They will be address in a follow up PR

* Handle Between operation for case-when

* adjust timestamp casting

* Refactor projection _column_name() logic to the _column_name logic in expression.rs

* removed println! statements

* Updates from review

* refactor String::from() to .to_string()

* Fix mappings

* Add cross_join.py and cross_join.rs

* Add pytest for cross_join

* Address review comments

* Fix module import issue where typo was introduced

* manually supply test fixtures

* Remove request from test method signature

Co-authored-by: Charles Blackmon-Luca <[email protected]>
Co-authored-by: Andy Grove <[email protected]>

* Implement TryFrom for plans (#543)

* impl TryFrom for plans

* fix todo

* fix compilation error

* code cleanup

* revert change to error message

* simplify code

* cargo fmt

* Support for LIMIT clause with DataFusion (#529)

* Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral

* Updates for test_filter

* more of test_filter.py working with the exception of some date pytests

* Add workflow to keep datafusion dev branch up to date (#440)

* Include setuptools-rust in conda build recipie, in host and run

* Remove PyArrow dependency

* rebase with datafusion-sql-planner

* refactor changes that were inadvertent during rebase

* timestamp with loglca time zone

* Bump DataFusion version (#494)

* bump DataFusion version

* remove unnecessary downcasts and use separate structs for TableSource and TableProvider

* Include RelDataType work

* Include RelDataType work

* Introduced SqlTypeName Enum in Rust and mappings for Python

* impl PyExpr.getIndex()

* add getRowType() for logical.rs

* Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes

* use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict

* linter changes, why did that work on my local pre-commit??

* linter changes, why did that work on my local pre-commit??

* Convert final strs to SqlTypeName Enum

* removed a few print statements

* commit to share with colleague

* updates

* checkpoint

* Temporarily disable conda run_test.py script since it uses features not yet implemented

* formatting after upstream merge

* expose fromString method for SqlTypeName to use Enums instead of strings for type checking

* expanded SqlTypeName from_string() support

* accept INT as INTEGER

* tests update

* checkpoint

* checkpoint

* Refactor PyExpr by removing From trait, and using recursion to expand expression list for rex calls

* skip test that uses create statement for gpuci

* Basic DataFusion Select Functionality (#489)

* Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral

* Updates for test_filter

* more of test_filter.py working with the exception of some date pytests

* Add workflow to keep datafusion dev branch up to date (#440)

* Include setuptools-rust in conda build recipie, in host and run

* Remove PyArrow dependency

* rebase with datafusion-sql-planner

* refactor changes that were inadvertent during rebase

* timestamp with loglca time zone

* Include RelDataType work

* Include RelDataType work

* Introduced SqlTypeName Enum in Rust and mappings for Python

* impl PyExpr.getIndex()

* add getRowType() for logical.rs

* Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes

* use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict

* linter changes, why did that work on my local pre-commit??

* linter changes, why did that work on my local pre-commit??

* Convert final strs to SqlTypeName Enum

* removed a few print statements

* Temporarily disable conda run_test.py script since it uses features not yet implemented

* expose fromString method for SqlTypeName to use Enums instead of strings for type checking

* expanded SqlTypeName from_string() support

* accept INT as INTEGER

* Remove print statements

* Default to UTC if tz is None

* Delegate timezone handling to the arrow library

* Updates from review

Co-authored-by: Charles Blackmon-Luca <[email protected]>

* updates for expression

* uncommented pytests

* uncommented pytests

* code cleanup for review

* code cleanup for review

* Enabled more pytest that work now

* Enabled more pytest that work now

* Output Expression as String when BinaryExpr does not contain a named alias

* Output Expression as String when BinaryExpr does not contain a named alias

* Disable 2 pytest that are causing gpuCI issues. They will be address in a follow up PR

* Handle Between operation for case-when

* adjust timestamp casting

* Refactor projection _column_name() logic to the _column_name logic in expression.rs

* removed println! statements

* Updates from review

* Add Offset and point to repo with offset in datafusion

* Introduce offset

* limit updates

* commit before upstream merge

* Code formatting

* update Cargo.toml to use Arrow-DataFusion version with LIMIT logic

* Bump DataFusion version to get changes around variant_name()

* Use map partitions for determining the offset

* Refactor offset partition func

* Update to use TryFrom logic

* Add cloudpickle to independent scheduler requirements

Co-authored-by: Charles Blackmon-Luca <[email protected]>
Co-authored-by: Andy Grove <[email protected]>

* Support Joins using DataFusion planner/parser (#512)

* Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral

* Updates for test_filter

* more of test_filter.py working with the exception of some date pytests

* Add workflow to keep datafusion dev branch up to date (#440)

* Include setuptools-rust in conda build recipie, in host and run

* Remove PyArrow dependency

* rebase with datafusion-sql-planner

* refactor changes that were inadvertent during rebase

* timestamp with loglca time zone

* Bump DataFusion version (#494)

* bump DataFusion version

* remove unnecessary downcasts and use separate structs for TableSource and TableProvider

* Include RelDataType work

* Include RelDataType work

* Introduced SqlTypeName Enum in Rust and mappings for Python

* impl PyExpr.getIndex()

* add getRowType() for logical.rs

* Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes

* use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict

* linter changes, why did that work on my local pre-commit??

* linter changes, why did that work on my local pre-commit??

* Convert final strs to SqlTypeName Enum

* removed a few print statements

* commit to share with colleague

* updates

* checkpoint

* Temporarily disable conda run_test.py script since it uses features not yet implemented

* formatting after upstream merge

* expose fromString method for SqlTypeName to use Enums instead of strings for type checking

* expanded SqlTypeName from_string() support

* accept INT as INTEGER

* tests update

* checkpoint

* checkpoint

* Refactor PyExpr by removing From trait, and using recursion to expand expression list for rex calls

* skip test that uses create statement for gpuci

* Basic DataFusion Select Functionality (#489)

* Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral

* Updates for test_filter

* more of test_filter.py working with the exception of some date pytests

* Add workflow to keep datafusion dev branch up to date (#440)

* Include setuptools-rust in conda build recipie, in host and run

* Remove PyArrow dependency

* rebase with datafusion-sql-planner

* refactor changes that were inadvertent during rebase

* timestamp with loglca time zone

* Include RelDataType work

* Include RelDataType work

* Introduced SqlTypeName Enum in Rust and mappings for Python

* impl PyExpr.getIndex()

* add getRowType() for logical.rs

* Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes

* use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict

* linter changes, why did that work on my local pre-commit??

* linter changes, why did that work on my local pre-commit??

* Convert final strs to SqlTypeName Enum

* removed a few print statements

* Temporarily disable conda run_test.py script since it uses features not yet implemented

* expose fromString method for SqlTypeName to use Enums instead of strings for type checking

* expanded SqlTypeName from_string() support

* accept INT as INTEGER

* Remove print statements

* Default to UTC if tz is None

* Delegate timezone handling to the arrow library

* Updates from review

Co-authored-by: Charles Blackmon-Luca <[email protected]>

* updates for expression

* uncommented pytests

* uncommented pytests

* code cleanup for review

* code cleanup for review

* Enabled more pytest that work now

* Enabled more pytest that work now

* Output Expression as String when BinaryExpr does not contain a named alias

* Output Expression as String when BinaryExpr does not contain a named alias

* Disable 2 pytest that are causing gpuCI issues. They will be address in a follow up PR

* Handle Between operation for case-when

* adjust timestamp casting

* Refactor projection _column_name() logic to the _column_name logic in expression.rs

* removed println! statements

* introduce join getCondition() logic for retrieving the combining Rex logic for joining

* Updates from review

* Add Offset and point to repo with offset in datafusion

* Introduce offset

* limit updates

* commit before upstream merge

* Code formatting

* update Cargo.toml to use Arrow-DataFusion version with LIMIT logic

* Bump DataFusion version to get changes around variant_name()

* Use map partitions for determining the offset

* Merge with upstream

* Rename underlying DataContainer's DataFrame instance to match the column container names

* Adjust ColumnContainer mapping after join.py logic to entire the bakend mapping is reset

* Add enumerate to column_{i} generation string to ensure columns exist in both dataframes

* Adjust join schema logic to perform merge instead of join on rust side to avoid name collisions

* Handle DataFusion COUNT(UInt8(1)) as COUNT(*)

* commit before merge

* Update function for gathering index of a expression

* Update for review check

* Adjust RelDataType to retrieve fully qualified column names

* Adjust base.py to get fully qualified column  name

* Enable passing pytests in test_join.py

* Adjust keys provided by getting backend column mapping name

* Adjust output_col to not use the backend_column name for special reserved exprs

* uncomment cross join pytest which works now

* Uncomment passing pytests in test_select.py

* Review updates

* Add back complex join case condition, not just cross join but 'complex' joins

Co-authored-by: Charles Blackmon-Luca <[email protected]>
Co-authored-by: Andy Grove <[email protected]>

* Datafusion is not (#557)

* Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral

* Updates for test_filter

* more of test_filter.py working with the exception of some date pytests

* Add workflow to keep datafusion dev branch up to date (#440)

* Include setuptools-rust in conda build recipie, in host and run

* Remove PyArrow dependency

* rebase with datafusion-sql-planner

* refactor changes that were inadvertent during rebase

* timestamp with loglca time zone

* Bump DataFusion version (#494)

* bump DataFusion version

* remove unnecessary downcasts and use separate structs for TableSource and TableProvider

* Include RelDataType work

* Include RelDataType work

* Introduced SqlTypeName Enum in Rust and mappings for Python

* impl PyExpr.getIndex()

* add getRowType() for logical.rs

* Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes

* use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict

* linter changes, why did that work on my local pre-commit??

* linter changes, why did that work on my local pre-commit??

* Convert final strs to SqlTypeName Enum

* removed a few print statements

* commit to share with colleague

* updates

* checkpoint

* Temporarily disable conda run_test.py script since it uses features not yet implemented

* formatting after upstream merge

* expose fromString method for SqlTypeName to use Enums instead of strings for type checking

* expanded SqlTypeName from_string() support

* accept INT as INTEGER

* tests update

* checkpoint

* checkpoint

* Refactor PyExpr by removing From trait, and using recursion to expand expression list for rex calls

* skip test that uses create statement for gpuci

* Basic DataFusion Select Functionality (#489)

* Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral

* Updates for test_filter

* more of test_filter.py working with the exception of some date pytests

* Add workflow to keep datafusion dev branch up to date (#440)

* Include setuptools-rust in conda build recipie, in host and run

* Remove PyArrow dependency

* rebase with datafusion-sql-planner

* refactor changes that were inadvertent during rebase

* timestamp with loglca time zone

* Include RelDataType work

* Include RelDataType work

* Introduced SqlTypeName Enum in Rust and mappings for Python

* impl PyExpr.getIndex()

* add getRowType() for logical.rs

* Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes

* use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict

* linter changes, why did that work on my local pre-commit??

* linter changes, why did that work on my local pre-commit??

* Convert final strs to SqlTypeName Enum

* removed a few print statements

* Temporarily disable conda run_test.py script since it uses features not yet implemented

* expose fromString method for SqlTypeName to use Enums instead of strings for type checking

* expanded SqlTypeName from_string() support

* accept INT as INTEGER

* Remove print statements

* Default to UTC if tz is None

* Delegate timezone handling to the arrow library

* Updates from review

Co-authored-by: Charles Blackmon-Luca <[email protected]>

* updates for expression

* uncommented pytests

* uncommented pytests

* code cleanup for review

* code cleanup for review

* Enabled more pytest that work now

* Enabled more pytest that work now

* Output Expression as String when BinaryExpr does not contain a named alias

* Output Expression as String when BinaryExpr does not contain a named alias

* Disable 2 pytest that are causing gpuCI issues. They will be address in a follow up PR

* Handle Between operation for case-when

* adjust timestamp casting

* Refactor projection _column_name() logic to the _column_name logic in expression.rs

* removed println! statements

* introduce join getCondition() logic for retrieving the combining Rex logic for joining

* Updates from review

* Add Offset and point to repo with offset in datafusion

* Introduce offset

* limit updates

* commit before upstream merge

* Code formatting

* update Cargo.toml to use Arrow-DataFusion version with LIMIT logic

* Bump DataFusion version to get changes around variant_name()

* Use map partitions for determining the offset

* Merge with upstream

* Rename underlying DataContainer's DataFrame instance to match the column container names

* Adjust ColumnContainer mapping after join.py logic to entire the bakend mapping is reset

* Add enumerate to column_{i} generation string to ensure columns exist in both dataframes

* Adjust join schema logic to perform merge instead of join on rust side to avoid name collisions

* Handle DataFusion COUNT(UInt8(1)) as COUNT(*)

* commit before merge

* Update function for gathering index of a expression

* Update for review check

* Adjust RelDataType to retrieve fully qualified column names

* Adjust base.py to get fully qualified column  name

* Enable passing pytests in test_join.py

* Adjust keys provided by getting backend column mapping name

* Adjust output_col to not use the backend_column name for special reserved exprs

* uncomment cross join pytest which works now

* Uncomment passing pytests in test_select.py

* Review updates

* Add back complex join case condition, not just cross join but 'complex' joins

* Add support for 'is not null' clause

Co-authored-by: Charles Blackmon-Luca <[email protected]>
Co-authored-by: Andy Grove <[email protected]>

* [REVIEW] Add support for `UNION` (#542)

Fixes: #470

This PR adds UNION support in dask-sql which uses datafusion's union logic.

* [REVIEW] Fix issue with duplicates in column renaming (#559)

* initial commit

* add union.rs

* fix

* debug

* updates

* handle multiple inputs

* xfail

* remove xfails

* style

* cleanup

* un-xfail

* address reviews

* fix projection issue

* address reviews

* enable tests (#560)

Looks like LIMIT has been ported, hence enabling more tests across different files.

* Add CODEOWNERS file (#562)

* Add CODEOWNERS file

* Can only specify users with write access

* Remove accidental dev commit

* Upgrade DataFusion version & support non-equijoin join conditions (#566)

* use latest datafusion

* fix regression

* remove DaskTableProvider

* simplify code

* Add ayushdg and galipremsagar to rust CODEOWNERS (#572)

* Enable DataFusion CBO and introduce DaskSqlOptimizer (#558)

* Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral

* Updates for test_filter

* more of test_filter.py working with the exception of some date pytests

* Add workflow to keep datafusion dev branch up to date (#440)

* Include setuptools-rust in conda build recipie, in host and run

* Remove PyArrow dependency

* rebase with datafusion-sql-planner

* refactor changes that were inadvertent during rebase

* timestamp with loglca time zone

* Bump DataFusion version (#494)

* bump DataFusion version

* remove unnecessary downcasts and use separate structs for TableSource and TableProvider

* Include RelDataType work

* Include RelDataType work

* Introduced SqlTypeName Enum in Rust and mappings for Python

* impl PyExpr.getIndex()

* add getRowType() for logical.rs

* Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes

* use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict

* linter changes, why did that work on my local pre-commit??

* linter changes, why did that work on my local pre-commit??

* Convert final strs to SqlTypeName Enum

* removed a few print statements

* commit to share with colleague

* updates

* checkpoint

* Temporarily disable conda run_test.py script since it uses features not yet implemented

* formatting after upstream merge

* expose fromString method for SqlTypeName to use Enums instead of strings for type checking

* expanded SqlTypeName from_string() support

* accept INT as INTEGER

* tests update

* checkpoint

* checkpoint

* Refactor PyExpr by removing From trait, and using recursion to expand expression list for rex calls

* skip test that uses create statement for gpuci

* Basic DataFusion Select Functionality (#489)

* Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral

* Updates for test_filter

* more of test_filter.py working with the exception of some date pytests

* Add workflow to keep datafusion dev branch up to date (#440)

* Include setuptools-rust in conda build recipie, in host and run

* Remove PyArrow dependency

* rebase with datafusion-sql-planner

* refactor changes that were inadvertent during rebase

* timestamp with loglca time zone

* Include RelDataType work

* Include RelDataType work

* Introduced SqlTypeName Enum in Rust and mappings for Python

* impl PyExpr.getIndex()

* add getRowType() for logical.rs

* Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes

* use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict

* linter changes, why did that work on my local pre-commit??

* linter changes, why did that work on my local pre-commit??

* Convert final strs to SqlTypeName Enum

* removed a few print statements

* Temporarily disable conda run_test.py script since it uses features not yet implemented

* expose fromString method for SqlTypeName to use Enums instead of strings for type checking

* expanded SqlTypeName from_string() support

* accept INT as INTEGER

* Remove print statements

* Default to UTC if tz is None

* Delegate timezone handling to the arrow library

* Updates from review

Co-authored-by: Charles Blackmon-Luca <[email protected]>

* updates for expression

* uncommented pytests

* uncommented pytests

* code cleanup for review

* code cleanup for review

* Enabled more pytest that work now

* Enabled more pytest that work now

* Output Expression as String when BinaryExpr does not contain a named alias

* Output Expression as String when BinaryExpr does not contain a named alias

* Disable 2 pytest that are causing gpuCI issues. They will be address in a follow up PR

* Handle Between operation for case-when

* adjust timestamp casting

* Refactor projection _column_name() logic to the _column_name logic in expression.rs

* removed println! statements

* introduce join getCondition() logic for retrieving the combining Rex logic for joining

* Updates from review

* Add Offset and point to repo with offset in datafusion

* Introduce offset

* limit updates

* commit before upstream merge

* Code formatting

* update Cargo.toml to use Arrow-DataFusion version with LIMIT logic

* Bump DataFusion version to get changes around variant_name()

* Use map partitions for determining the offset

* Merge with upstream

* Rename underlying DataContainer's DataFrame instance to match the column container names

* Adjust ColumnContainer mapping after join.py logic to entire the bakend mapping is reset

* Add enumerate to column_{i} generation string to ensure columns exist in both dataframes

* Adjust join schema logic to perform merge instead of join on rust side to avoid name collisions

* Handle DataFusion COUNT(UInt8(1)) as COUNT(*)

* commit before merge

* Update function for gathering index of a expression

* Update for review check

* Adjust RelDataType to retrieve fully qualified column names

* Adjust base.py to get fully qualified column  name

* Enable passing pytests in test_join.py

* Adjust keys provided by getting backend column mapping name

* Adjust output_col to not use the backend_column name for special reserved exprs

* uncomment cross join pytest which works now

* Uncomment passing pytests in test_select.py

* Review updates

* Add back complex join case condition, not just cross join but 'complex' joins

* Enable DataFusion CBO logic

* Disable EliminateFilter optimization rule

* updates

* Disable tests that hit CBO generated plan edge cases of yet to be implemented logic

* [REVIEW] - Modifiy sql.skip_optimize to use dask_config.get and remove used method parameter

* [REVIEW] - change name of configuration from skip_optimize to optimize

* [REVIEW] - Add OptimizeException catch and raise statements back

* Found issue where backend column names which are results of a single aggregate resulting column, COUNT(*) for example, need to get the first agg df column since names are not valid

* Remove SQL from OptimizationException

* skip tests that CBO plan reorganization causes missing features to be present

Co-authored-by: Charles Blackmon-Luca <[email protected]>
Co-authored-by: Andy Grove <[email protected]>

* Only use the specific DataFusion crates that we need (#568)

* use specific datafusion crates

* clean up imports

* Fix some clippy warnings (#574)

* Datafusion invalid projection (#571)

* Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral

* Updates for test_filter

* more of test_filter.py working with the exception of some date pytests

* Add workflow to keep datafusion dev branch up to date (#440)

* Include setuptools-rust in conda build recipie, in host and run

* Remove PyArrow dependency

* rebase with datafusion-sql-planner

* refactor changes that were inadvertent during rebase

* timestamp with loglca time zone

* Bump DataFusion version (#494)

* bump DataFusion version

* remove unnecessary downcasts and use separate structs for TableSource and TableProvider

* Include RelDataType work

* Include RelDataType work

* Introduced SqlTypeName Enum in Rust and mappings for Python

* impl PyExpr.getIndex()

* add getRowType() for logical.rs

* Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes

* use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict

* linter changes, why did that work on my local pre-commit??

* linter changes, why did that work on my local pre-commit??

* Convert final strs to SqlTypeName Enum

* removed a few print statements

* commit to share with colleague

* updates

* checkpoint

* Temporarily disable conda run_test.py script since it uses features not yet implemented

* formatting after upstream merge

* expose fromString method for SqlTypeName to use Enums instead of strings for type checking

* expanded SqlTypeName from_string() support

* accept INT as INTEGER

* tests update

* checkpoint

* checkpoint

* Refactor PyExpr by removing From trait, and using recursion to expand expression list for rex calls

* skip test that uses create statement for gpuci

* Basic DataFusion Select Functionality (#489)

* Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral

* Updates for test_filter

* more of test_filter.py working with the exception of some date pytests

* Add workflow to keep datafusion dev branch up to date (#440)

* Include setuptools-rust in conda build recipie, in host and run

* Remove PyArrow dependency

* rebase with datafusion-sql-planner

* refactor changes that were inadvertent during rebase

* timestamp with loglca time zone

* Include RelDataType work

* Include RelDataType work

* Introduced SqlTypeName Enum in Rust and mappings for Python

* impl PyExpr.getIndex()

* add getRowType() for logical.rs

* Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes

* use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict

* linter changes, why did that work on my local pre-commit??

* linter changes, why did that work on my local pre-commit??

* Convert final strs to SqlTypeName Enum

* removed a few print statements

* Temporarily disable conda run_test.py script since it uses features not yet implemented

* expose fromString method for SqlTypeName to use Enums instead of strings for type checking

* expanded SqlTypeName from_string() support

* accept INT as INTEGER

* Remove print statements

* Default to UTC if tz is None

* Delegate timezone handling to the arrow library

* Updates from review

Co-authored-by: Charles Blackmon-Luca <[email protected]>

* updates for expression

* uncommented pytests

* uncommented pytests

* code cleanup for review

* code cleanup for review

* Enabled more pytest that work now

* Enabled more pytest that work now

* Output Expression as String when BinaryExpr does not contain a named alias

* Output Expression as String when BinaryExpr does not contain a named alias

* Disable 2 pytest that are causing gpuCI issues. They will be address in a follow up PR

* Handle Between operation for case-when

* adjust timestamp casting

* Refactor projection _column_name() logic to the _column_name logic in expression.rs

* removed println! statements

* introduce join getCondition() logic for retrieving the combining Rex logic for joining

* Updates from review

* Add Offset and point to repo with offset in datafusion

* Introduce offset

* limit updates

* commit before upstream merge

* Code formatting

* update Cargo.toml to use Arrow-DataFusion version with LIMIT logic

* Bump DataFusion version to get changes around variant_name()

* Use map partitions for determining the offset

* Merge with upstream

* Rename underlying DataContainer's DataFrame instance to match the column container names

* Adjust ColumnContainer mapping after join.py logic to entire the bakend mapping is reset

* Add enumerate to column_{i} generation string to ensure columns exist in both dataframes

* Adjust join schema logic to perform merge instead of join on rust side to avoid name collisions

* Handle DataFusion COUNT(UInt8(1)) as COUNT(*)

* commit before merge

* Update function for gathering index of a expression

* Update for review check

* Adjust RelDataType to retrieve fully qualified column names

* Adjust base.py to get fully qualified column  name

* Enable passing pytests in test_join.py

* Adjust keys provided by getting backend column mapping name

* Adjust output_col to not use the backend_column name for special reserved exprs

* uncomment cross join pytest which works now

* Uncomment passing pytests in test_select.py

* Review updates

* Add back complex join case condition, not just cross join but 'complex' joins

* Enable DataFusion CBO logic

* Disable EliminateFilter optimization rule

* updates

* Disable tests that hit CBO generated plan edge cases of yet to be implemented logic

* [REVIEW] - Modifiy sql.skip_optimize to use dask_config.get and remove used method parameter

* [REVIEW] - change name of configuration from skip_optimize to optimize

* [REVIEW] - Add OptimizeException catch and raise statements back

* Found issue where backend column names which are results of a single aggregate resulting column, COUNT(*) for example, need to get the first agg df column since names are not valid

* Remove SQL from OptimizationException

* skip tests that CBO plan reorganization causes missing features to be present

* If TableScan contains projections use those instead of all of the TableColums for limiting columns read during table_scan

* [REVIEW] remove compute(), remove temp row_type variable

* [REVIEW] - Add test for projection pushdown

* [REVIEW] - Add some more parametrized test combinations

* [REVIEW] - Use iterator instead of for loop and simplify contains_projections

* [REVIEW] - merge upstream and adjust imports

* [REVIEW] - Rename pytest function and remove duplicate table creation

Co-authored-by: Charles Blackmon-Luca <[email protected]>
Co-authored-by: Andy Grove <[email protected]>

* Datafusion upstream merge (#576)

* Add basic predicate-pushdown optimization (#433)

* basic predicate-pushdown support

* remove explict Dispatch class

* use _Frame.fillna

* cleanup comments

* test coverage

* improve test coverage

* add xfail test for dt accessor in predicate and fix test_show.py

* fix some naming issues

* add config and use assert_eq

* add logging events when predicate-pushdown bails

* move bail logic earlier in function

* address easier code review comments

* typo fix

* fix creation_info access bug

* convert any expression to DNF

* csv test coverage

* include IN coverage

* improve test rigor

* address code review

* skip parquet tests when deps are not installed

* fix bug

* add pyarrow dep to cluster workers

* roll back test skipping changes

Co-authored-by: Charles Blackmon-Luca <[email protected]>

* Add workflow to keep datafusion dev branch up to date (#440)

* Update gpuCI `RAPIDS_VER` to `22.06` (#434)

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Bump black to 22.3.0 (#443)

* Check for ucx-py nightlies when updating gpuCI (#441)

* Simplify gpuCI updating workflow

* Add check for cuML nightly version

* Add handling for newer `prompt_toolkit` versions in cmd tests (#447)

* Add handling for newer prompt-toolkit version

* Place compatibility code in _compat

* Fix version for gha-find-replace (#446)

* Update versions of Java dependencies (#445)

* Update versions for java dependencies with cves

* Rerun tests

* Update jackson databind version (#449)

* Update versions for java dependencies with cves

* Rerun tests

* update jackson-databind dependency

* Disable SQL server functionality (#448)

* Disable SQL server functionality

* Update docs/source/server.rst

Co-authored-by: Ayush Dattagupta <[email protected]>

* Disable server at lowest possible level

* Skip all server tests

* Add tests to ensure server is disabled

* Fix CVE fix test

Co-authored-by: Ayush Dattagupta <[email protected]>

* Update dask pinnings for release (#450)

* Add Java source code to source distribution (#451)

* Bump `httpclient` dependency (#453)

* Revert "Disable SQL server functionality (#448)"

This reverts commit 37a3a61fb13b0c56fcc10bf8ef01f4885a58dae8.

* Bump httpclient version

* Unpin Dask/distributed versions (#452)

* Unpin dask/distributed post release

* Remove dask/distributed version ceiling

* Add jsonschema to ci testing (#454)

* Add jsonschema to ci env

* Fix typo in config schema

* Switch tests from `pd.testing.assert_frame_equal` to `dd.assert_eq` (#365)

* Start moving tests to dd.assert_eq

* Use assert_eq in datetime filter test

* Resolve most resulting test failures

* Resolve remaining test failures

* Convert over tests

* Convert more tests

* Consolidate select limit cpu/gpu test

* Remove remaining assert_series_equal

* Remove explicit cudf imports from many tests

* Resolve rex test failures

* Remove some additional compute calls

* Consolidate sorting tests with getfixturevalue

* Fix failed join test

* Remove breakpoint

* Use custom assert_eq function for tests

* Resolve test failures / seg faults

* Remove unnecessary testing utils

* Resolve local test failures

* Generalize RAND test

* Avoid closing client if using independent cluster

* Fix failures on Windows

* Resolve black failures

* Make random test variables more clear

* Set max pin on antlr4-python-runtime  (#456)

* Set max pin on antlr4-python-runtime due to incompatibilities with fugue_sql

* update comment on antlr max pin version

* Move / minimize number of cudf / dask-cudf imports (#480)

* Move / minimize number of cudf / dask-cudf imports

* Add tests for GPU-related errors

* Fix unbound local error

* Fix ddf value error

* Use `map_partitions` to compute LIMIT / OFFSET (#517)

* Use map_partitions to compute limit / offset

* Use partition_info to extract partition_index

* Use `dev` images for independent cluster testing (#518)

* Switch to dask dev images

* Use mamba for conda installs in images

* Remove sleep call for installation

* Use timeout / until to wait for cluster to be initialized

* Add documentation for FugueSQL integrations (#523)

* Add documentation for FugueSQL integrations

* Minor nitpick around autodoc obj -> class

* Timestampdiff support (#495)

* added timestampdiff

* initial work for timestampdiff

* Added test cases for timestampdiff

* Update interval month dtype mapping

* Add datetimesubOperator

* Uncomment timestampdiff literal tests

* Update logic for handling interval_months for pandas/cudf series and scalars

* Add negative diff testcases, and gpu tests

* Update reinterpret and timedelta to explicitly cast to int64 instead of int

* Simplify cast_column_to_type mapping logic

* Add scalar handling to castOperation and reuse it for reinterpret

Co-authored-by: rajagurnath <[email protected]>

* Relax jsonschema testing dependency (#546)

* Update upstream testing workflows (#536)

* Use dask nightly conda packages for upstream testing

* Add independent cluster testing to nightly upstream CI [test-upstream]

* Remove unnecessary dask install [test-upstream]

* Remove strict channel policy to allow nightly dask installs

* Use nightly Dask packages in independent cluster test [test-upstream]

* Use channels argument to install Dask conda nightlies [test-upstream]

* Fix channel expression

* [test-upstream]

* Need to add mamba update command to get dask conda nightlies

* Use conda nightlies for dask-sql import test

* Add import test to upstream nightly tests

* [test-upstream]

* Make sure we have nightly Dask for import tests [test-upstream]

* Fix pyarrow / cloudpickle failures in cluster testing (#553)

* Explicitly install libstdcxx-ng in clusters

* Make pyarrow dependency consistent across testing

* Make libstdcxx-ng dep a min version

* Add cloudpickle to cluster dependencies

* cloudpickle must be in the scheduler environment

* Bump cloudpickle version

* Move cloudpickle install to workers

* Fix pyarrow constraint in cluster spec

* Use bash -l as default entrypoint for all jobs (#552)

* Constrain dask/distributed for release (#563)

* Unpin dask/distributed for development (#564)

* Unpin dask/distributed post release

* Remove dask/distributed version ceiling

* update dask-sphinx-theme (#567)

* Introduce subquery.py to handle subquery expressions

* update ordering

* Make sure scheduler has Dask nightlies in upstream cluster testing (#573)

* Make sure scheduler has Dask nightlies in upstream cluster testing

* empty commit to [test-upstream]

* Update gpuCI `RAPIDS_VER` to `22.08` (#565)

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* updates

* Remove startswith function merged by mistake

* [REVIEW] - Remove instance that are meant for the currently removed timestampdiff

* Modify test environment pinnings to cover minimum versions (#555)

* Remove black/isort deps as we prefer pre-commit

* Unpin all non python/jdk dependencies

* Minor package corrections for py3.9 jdk11 env

* Set min version constraints for all non-testing dependencies

* Pin all non-test deps for 3.8 testing

* Bump sklearn min version to 1.0.0

* Bump pyarrow min version to 1.0.1

* Fix pip notation for fugue

* Use unpinned deps for cluster testing for now

* Add fugue deps to environments, bump pandas to 1.0.2

* Add back antlr4 version ceiling

* Explicitly mark all fugue dependencies

* Alter test_analyze to avoid rtol

* Bump pandas to 1.0.5 to fix upstream numpy issues

* Alter datetime casting util to dodge panda casting failures

* Bump pandas to 1.1.0 for groupby dropna support

* Simplify string dtype check for get_supported_aggregations

* Add check_dtype=False back to test_group_by_nan

* Bump cluster to python 3.9

* Bump fastapi to 0.69.0, resolve remaining JDBC failures

* Typo - correct pandas version

* Generalize test_multi_case_when's dtype check

* Bump pandas to 1.1.1 to resolve flaky test failures

* Constrain mlflow for windows python 3.8 testing

* Selectors don't work for conda env files

* Problems seem to persist in 1.1.1, bump to 1.1.2

* Remove accidental debug changes

* [test-upstream]

* Use python 3.9 for upstream cluster testing [test-upstream]

* Updated missed pandas pinning

* Unconstrain mlflow to see if Windows failures persist

* Add min version for protobuf

* Bump pyarrow min version to allow for newer protobuf versions

* Don't move jar to local mvn repo (#579)

* Add tests for intersection

* Add tests for intersection

* Add another intersection test, even more simple but for testing raw intersection

* Use Timedelta when doing ReduceOperation(s) against datetime64 dtypes

* Cleanup

* Use an either/or strategy for converting to Timedelta objects

* Support more than 2 operands for Timedelta conversions

* fix merge issues, is_frame() function of call.py was removed accidentally before

* Remove pytest …
  • Loading branch information
16 people authored Sep 21, 2022
1 parent 442b871 commit 28910e0
Show file tree
Hide file tree
Showing 232 changed files with 11,738 additions and 5,753 deletions.
5 changes: 5 additions & 0 deletions .github/CODEOWNERS
Validating CODEOWNERS rules …
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
# global codeowners
* @ayushdg @charlesbluca @galipremsagar

# rust codeowners
dask_planner/ @ayushdg @galipremsagar @jdye64
17 changes: 17 additions & 0 deletions .github/actions/setup-builder/action.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@
name: Prepare Rust Builder
description: 'Prepare Rust Build Environment'
inputs:
rust-version:
description: 'version of rust to install (e.g. stable)'
required: true
default: 'stable'
runs:
using: "composite"
steps:
- name: Setup Rust toolchain
shell: bash
run: |
echo "Installing ${{ inputs.rust-version }}"
rustup toolchain install ${{ inputs.rust-version }}
rustup default ${{ inputs.rust-version }}
rustup component add rustfmt
15 changes: 13 additions & 2 deletions .github/workflows/conda.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,18 @@ on:
push:
branches:
- main
- datafusion-sql-planner
pull_request:
paths:
- setup.py
- dask_planner/Cargo.toml
- dask_planner/Cargo.lock
- dask_planner/pyproject.toml
- dask_planner/rust-toolchain.toml
- continuous_integration/recipe/**
- .github/workflows/conda.yml
schedule:
- cron: '0 0 * * 0'

# When this workflow is queued, automatically cancel any previous running
# or pending jobs from the same branch
Expand Down Expand Up @@ -49,12 +60,12 @@ jobs:
- name: Upload conda package
if: |
github.event_name == 'push'
&& github.ref == 'refs/heads/main'
&& github.repository == 'dask-contrib/dask-sql'
env:
ANACONDA_API_TOKEN: ${{ secrets.DASK_CONDA_TOKEN }}
LABEL: ${{ github.ref == 'refs/heads/datafusion-sql-planner' && 'dev_datafusion' || 'dev' }}
run: |
# install anaconda for upload
mamba install anaconda-client
anaconda upload --label dev noarch/*.tar.bz2
anaconda upload --label $LABEL linux-64/*.tar.bz2
72 changes: 72 additions & 0 deletions .github/workflows/rust.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
name: Rust

on:
# always trigger on PR
push:
pull_request:
# manual trigger
# https://docs.github.com/en/actions/managing-workflow-runs/manually-running-a-workflow
workflow_dispatch:

jobs:
# Check crate compiles
linux-build-lib:
name: cargo check
runs-on: ubuntu-latest
container:
image: amd64/rust
env:
# Disable full debug symbol generation to speed up CI build and keep memory down
# "1" means line tables only, which is useful for panic tracebacks.
RUSTFLAGS: "-C debuginfo=1"
steps:
- uses: actions/checkout@v3
- name: Cache Cargo
uses: actions/cache@v3
with:
# these represent dependencies downloaded by cargo
# and thus do not depend on the OS, arch nor rust version.
path: /github/home/.cargo
key: cargo-cache-
- name: Setup Rust toolchain
uses: ./.github/actions/setup-builder
with:
rust-version: stable
- name: Check workspace in debug mode
run: |
cd dask_planner
cargo check
- name: Check workspace in release mode
run: |
cd dask_planner
cargo check --release
# test the crate
linux-test:
name: cargo test (amd64)
needs: [linux-build-lib]
runs-on: ubuntu-latest
container:
image: amd64/rust
env:
# Disable full debug symbol generation to speed up CI build and keep memory down
# "1" means line tables only, which is useful for panic tracebacks.
RUSTFLAGS: "-C debuginfo=1"
steps:
- uses: actions/checkout@v3
with:
submodules: true
- name: Cache Cargo
uses: actions/cache@v3
with:
path: /github/home/.cargo
# this key equals the ones on `linux-build-lib` for re-use
key: cargo-cache-
- name: Setup Rust toolchain
uses: ./.github/actions/setup-builder
with:
rust-version: stable
- name: Run tests
run: |
cd dask_planner
cargo test
98 changes: 31 additions & 67 deletions .github/workflows/test-upstream.yml
Original file line number Diff line number Diff line change
Expand Up @@ -10,58 +10,24 @@ defaults:
shell: bash -l {0}

jobs:
build:
# This build step should be similar to the deploy build, to make sure we actually test
# the future deployable
name: Build the jar on ubuntu
runs-on: ubuntu-latest
if: github.repository == 'dask-contrib/dask-sql'
steps:
- uses: actions/checkout@v2
- name: Cache local Maven repository
uses: actions/cache@v2
with:
path: ~/.m2/repository
key: ${{ runner.os }}-maven-v1-jdk11-${{ hashFiles('**/pom.xml') }}
- name: Set up Python
uses: conda-incubator/setup-miniconda@v2
with:
miniforge-variant: Mambaforge
use-mamba: true
python-version: "3.8"
channel-priority: strict
activate-environment: dask-sql
environment-file: continuous_integration/environment-3.8-jdk11-dev.yaml
- name: Install dependencies and build the jar
run: |
python setup.py build_ext
- name: Upload the jar
uses: actions/upload-artifact@v1
with:
name: jar
path: dask_sql/jar/DaskSQL.jar

test-dev:
name: "Test upstream dev (${{ matrix.os }}, java: ${{ matrix.java }}, python: ${{ matrix.python }})"
needs: build
name: "Test upstream dev (${{ matrix.os }}, python: ${{ matrix.python }})"
runs-on: ${{ matrix.os }}
if: github.repository == 'dask-contrib/dask-sql'
env:
CONDA_FILE: continuous_integration/environment-${{ matrix.python }}-jdk${{ matrix.java }}-dev.yaml
CONDA_FILE: continuous_integration/environment-${{ matrix.python }}-dev.yaml
defaults:
run:
shell: bash -l {0}
strategy:
fail-fast: false
matrix:
java: [8, 11]
os: [ubuntu-latest, windows-latest]
python: ["3.8", "3.9", "3.10"]
steps:
- uses: actions/checkout@v2
with:
fetch-depth: 0 # Fetch all history for all branches and tags.
- name: Cache local Maven repository
uses: actions/cache@v2
with:
path: ~/.m2/repository
key: ${{ runner.os }}-maven-v1-jdk${{ matrix.java }}-${{ hashFiles('**/pom.xml') }}
- name: Set up Python
uses: conda-incubator/setup-miniconda@v2
with:
Expand All @@ -72,21 +38,21 @@ jobs:
channels: dask/label/dev,conda-forge,nodefaults
activate-environment: dask-sql
environment-file: ${{ env.CONDA_FILE }}
- name: Download the pre-build jar
uses: actions/download-artifact@v1
- name: Setup Rust Toolchain
uses: actions-rs/toolchain@v1
id: rust-toolchain
with:
name: jar
path: dask_sql/jar/
toolchain: stable
override: true
- name: Build the Rust DataFusion bindings
run: |
python setup.py build install
- name: Install hive testing dependencies for Linux
if: matrix.os == 'ubuntu-latest'
run: |
mamba install -c conda-forge sasl>=0.3.1
docker pull bde2020/hive:2.3.2-postgresql-metastore
docker pull bde2020/hive-metastore-postgresql:2.3.0
- name: Set proper JAVA_HOME for Windows
if: matrix.os == 'windows-latest'
run: |
echo "JAVA_HOME=${{ env.CONDA }}\envs\dask-sql\Library" >> $GITHUB_ENV
- name: Install upstream dev Dask / dask-ml
run: |
mamba update dask
Expand All @@ -101,11 +67,6 @@ jobs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Cache local Maven repository
uses: actions/cache@v2
with:
path: ~/.m2/repository
key: ${{ runner.os }}-maven-v1-jdk11-${{ hashFiles('**/pom.xml') }}
- name: Set up Python
uses: conda-incubator/setup-miniconda@v2
with:
Expand All @@ -115,12 +76,16 @@ jobs:
channel-priority: strict
channels: dask/label/dev,conda-forge,nodefaults
activate-environment: dask-sql
environment-file: continuous_integration/environment-3.9-jdk11-dev.yaml
- name: Download the pre-build jar
uses: actions/download-artifact@v1
environment-file: continuous_integration/environment-3.9-dev.yaml
- name: Setup Rust Toolchain
uses: actions-rs/toolchain@v1
id: rust-toolchain
with:
name: jar
path: dask_sql/jar/
toolchain: stable
override: true
- name: Build the Rust DataFusion bindings
run: |
python setup.py build install
- name: Install cluster dependencies
run: |
mamba install python-blosc lz4 -c conda-forge
Expand Down Expand Up @@ -151,23 +116,22 @@ jobs:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Cache local Maven repository
uses: actions/cache@v2
with:
path: ~/.m2/repository
key: ${{ runner.os }}-maven-v1-jdk11-${{ hashFiles('**/pom.xml') }}
- name: Set up Python
uses: conda-incubator/setup-miniconda@v2
with:
python-version: "3.8"
mamba-version: "*"
channels: dask/label/dev,conda-forge,nodefaults
channel-priority: strict
- name: Download the pre-build jar
uses: actions/download-artifact@v1
- name: Setup Rust Toolchain
uses: actions-rs/toolchain@v1
id: rust-toolchain
with:
name: jar
path: dask_sql/jar/
toolchain: stable
override: true
- name: Build the Rust DataFusion bindings
run: |
python setup.py build install
- name: Install upstream dev Dask / dask-ml
if: needs.detect-ci-trigger.outputs.triggered == 'true'
run: |
Expand Down
Loading

0 comments on commit 28910e0

Please sign in to comment.