Relax `jsonschema` testing dependency #546

charlesbluca · 2022-05-20T15:31:08Z

This matches up with Dask's jsonschema specification for testing, and unblocks creating a cuDF / dask-sql dev environment, which got blocked with rapidsai/cudf#10769

codecov-commenter · 2022-05-20T15:46:17Z

Codecov Report

Merging #546 (9a18c78) into main (cb3d903) will increase coverage by 0.14%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##             main     #546      +/-   ##
==========================================
+ Coverage   82.71%   82.85%   +0.14%     
==========================================
  Files          69       69              
  Lines        3500     3500              
  Branches      707      707              
==========================================
+ Hits         2895     2900       +5     
+ Misses        516      507       -9     
- Partials       89       93       +4

Impacted Files	Coverage Δ
dask_sql/_version.py	`34.00% <0.00%> (+1.44%)`	⬆️

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

* Add basic predicate-pushdown optimization (#433) * basic predicate-pushdown support * remove explict Dispatch class * use _Frame.fillna * cleanup comments * test coverage * improve test coverage * add xfail test for dt accessor in predicate and fix test_show.py * fix some naming issues * add config and use assert_eq * add logging events when predicate-pushdown bails * move bail logic earlier in function * address easier code review comments * typo fix * fix creation_info access bug * convert any expression to DNF * csv test coverage * include IN coverage * improve test rigor * address code review * skip parquet tests when deps are not installed * fix bug * add pyarrow dep to cluster workers * roll back test skipping changes Co-authored-by: Charles Blackmon-Luca <[email protected]> * Add workflow to keep datafusion dev branch up to date (#440) * Update gpuCI `RAPIDS_VER` to `22.06` (#434) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Bump black to 22.3.0 (#443) * Check for ucx-py nightlies when updating gpuCI (#441) * Simplify gpuCI updating workflow * Add check for cuML nightly version * Add handling for newer `prompt_toolkit` versions in cmd tests (#447) * Add handling for newer prompt-toolkit version * Place compatibility code in _compat * Fix version for gha-find-replace (#446) * Update versions of Java dependencies (#445) * Update versions for java dependencies with cves * Rerun tests * Update jackson databind version (#449) * Update versions for java dependencies with cves * Rerun tests * update jackson-databind dependency * Disable SQL server functionality (#448) * Disable SQL server functionality * Update docs/source/server.rst Co-authored-by: Ayush Dattagupta <[email protected]> * Disable server at lowest possible level * Skip all server tests * Add tests to ensure server is disabled * Fix CVE fix test Co-authored-by: Ayush Dattagupta <[email protected]> * Update dask pinnings for release (#450) * Add Java source code to source distribution (#451) * Bump `httpclient` dependency (#453) * Revert "Disable SQL server functionality (#448)" This reverts commit 37a3a61. * Bump httpclient version * Unpin Dask/distributed versions (#452) * Unpin dask/distributed post release * Remove dask/distributed version ceiling * Add jsonschema to ci testing (#454) * Add jsonschema to ci env * Fix typo in config schema * Switch tests from `pd.testing.assert_frame_equal` to `dd.assert_eq` (#365) * Start moving tests to dd.assert_eq * Use assert_eq in datetime filter test * Resolve most resulting test failures * Resolve remaining test failures * Convert over tests * Convert more tests * Consolidate select limit cpu/gpu test * Remove remaining assert_series_equal * Remove explicit cudf imports from many tests * Resolve rex test failures * Remove some additional compute calls * Consolidate sorting tests with getfixturevalue * Fix failed join test * Remove breakpoint * Use custom assert_eq function for tests * Resolve test failures / seg faults * Remove unnecessary testing utils * Resolve local test failures * Generalize RAND test * Avoid closing client if using independent cluster * Fix failures on Windows * Resolve black failures * Make random test variables more clear * Set max pin on antlr4-python-runtime (#456) * Set max pin on antlr4-python-runtime due to incompatibilities with fugue_sql * update comment on antlr max pin version * Move / minimize number of cudf / dask-cudf imports (#480) * Move / minimize number of cudf / dask-cudf imports * Add tests for GPU-related errors * Fix unbound local error * Fix ddf value error * Use `map_partitions` to compute LIMIT / OFFSET (#517) * Use map_partitions to compute limit / offset * Use partition_info to extract partition_index * Use `dev` images for independent cluster testing (#518) * Switch to dask dev images * Use mamba for conda installs in images * Remove sleep call for installation * Use timeout / until to wait for cluster to be initialized * Add documentation for FugueSQL integrations (#523) * Add documentation for FugueSQL integrations * Minor nitpick around autodoc obj -> class * Timestampdiff support (#495) * added timestampdiff * initial work for timestampdiff * Added test cases for timestampdiff * Update interval month dtype mapping * Add datetimesubOperator * Uncomment timestampdiff literal tests * Update logic for handling interval_months for pandas/cudf series and scalars * Add negative diff testcases, and gpu tests * Update reinterpret and timedelta to explicitly cast to int64 instead of int * Simplify cast_column_to_type mapping logic * Add scalar handling to castOperation and reuse it for reinterpret Co-authored-by: rajagurnath <[email protected]> * Relax jsonschema testing dependency (#546) * Update upstream testing workflows (#536) * Use dask nightly conda packages for upstream testing * Add independent cluster testing to nightly upstream CI [test-upstream] * Remove unnecessary dask install [test-upstream] * Remove strict channel policy to allow nightly dask installs * Use nightly Dask packages in independent cluster test [test-upstream] * Use channels argument to install Dask conda nightlies [test-upstream] * Fix channel expression * [test-upstream] * Need to add mamba update command to get dask conda nightlies * Use conda nightlies for dask-sql import test * Add import test to upstream nightly tests * [test-upstream] * Make sure we have nightly Dask for import tests [test-upstream] * Fix pyarrow / cloudpickle failures in cluster testing (#553) * Explicitly install libstdcxx-ng in clusters * Make pyarrow dependency consistent across testing * Make libstdcxx-ng dep a min version * Add cloudpickle to cluster dependencies * cloudpickle must be in the scheduler environment * Bump cloudpickle version * Move cloudpickle install to workers * Fix pyarrow constraint in cluster spec * Use bash -l as default entrypoint for all jobs (#552) * Constrain dask/distributed for release (#563) * Unpin dask/distributed for development (#564) * Unpin dask/distributed post release * Remove dask/distributed version ceiling * update dask-sphinx-theme (#567) * Introduce subquery.py to handle subquery expressions * update ordering * Make sure scheduler has Dask nightlies in upstream cluster testing (#573) * Make sure scheduler has Dask nightlies in upstream cluster testing * empty commit to [test-upstream] * Update gpuCI `RAPIDS_VER` to `22.08` (#565) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * updates * Remove startswith function merged by mistake * [REVIEW] - Remove instance that are meant for the currently removed timestampdiff * Modify test environment pinnings to cover minimum versions (#555) * Remove black/isort deps as we prefer pre-commit * Unpin all non python/jdk dependencies * Minor package corrections for py3.9 jdk11 env * Set min version constraints for all non-testing dependencies * Pin all non-test deps for 3.8 testing * Bump sklearn min version to 1.0.0 * Bump pyarrow min version to 1.0.1 * Fix pip notation for fugue * Use unpinned deps for cluster testing for now * Add fugue deps to environments, bump pandas to 1.0.2 * Add back antlr4 version ceiling * Explicitly mark all fugue dependencies * Alter test_analyze to avoid rtol * Bump pandas to 1.0.5 to fix upstream numpy issues * Alter datetime casting util to dodge panda casting failures * Bump pandas to 1.1.0 for groupby dropna support * Simplify string dtype check for get_supported_aggregations * Add check_dtype=False back to test_group_by_nan * Bump cluster to python 3.9 * Bump fastapi to 0.69.0, resolve remaining JDBC failures * Typo - correct pandas version * Generalize test_multi_case_when's dtype check * Bump pandas to 1.1.1 to resolve flaky test failures * Constrain mlflow for windows python 3.8 testing * Selectors don't work for conda env files * Problems seem to persist in 1.1.1, bump to 1.1.2 * Remove accidental debug changes * [test-upstream] * Use python 3.9 for upstream cluster testing [test-upstream] * Updated missed pandas pinning * Unconstrain mlflow to see if Windows failures persist * Add min version for protobuf * Bump pyarrow min version to allow for newer protobuf versions * Don't move jar to local mvn repo (#579) * Add tests for intersection * Add tests for intersection * Add another intersection test, even more simple but for testing raw intersection * Use Timedelta when doing ReduceOperation(s) against datetime64 dtypes * Cleanup * Use an either/or strategy for converting to Timedelta objects * Support more than 2 operands for Timedelta conversions * fix merge issues, is_frame() function of call.py was removed accidentally before * Remove pytest that was testing Calcite exception messages. Calcite is no longer used so no need for this test * comment out gpu tests, will be enabled in datafusion-filter PR * Don't check dtype for failing test Co-authored-by: Richard (Rick) Zamora <[email protected]> Co-authored-by: Charles Blackmon-Luca <[email protected]> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Ayush Dattagupta <[email protected]> Co-authored-by: rajagurnath <[email protected]> Co-authored-by: Sarah Charlotte Johnson <[email protected]> Co-authored-by: ksonj <[email protected]>

* Add basic predicate-pushdown optimization (#433) * basic predicate-pushdown support * remove explict Dispatch class * use _Frame.fillna * cleanup comments * test coverage * improve test coverage * add xfail test for dt accessor in predicate and fix test_show.py * fix some naming issues * add config and use assert_eq * add logging events when predicate-pushdown bails * move bail logic earlier in function * address easier code review comments * typo fix * fix creation_info access bug * convert any expression to DNF * csv test coverage * include IN coverage * improve test rigor * address code review * skip parquet tests when deps are not installed * fix bug * add pyarrow dep to cluster workers * roll back test skipping changes Co-authored-by: Charles Blackmon-Luca <[email protected]> * Add workflow to keep datafusion dev branch up to date (#440) * Update gpuCI `RAPIDS_VER` to `22.06` (#434) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Bump black to 22.3.0 (#443) * Check for ucx-py nightlies when updating gpuCI (#441) * Simplify gpuCI updating workflow * Add check for cuML nightly version * Add handling for newer `prompt_toolkit` versions in cmd tests (#447) * Add handling for newer prompt-toolkit version * Place compatibility code in _compat * Fix version for gha-find-replace (#446) * Update versions of Java dependencies (#445) * Update versions for java dependencies with cves * Rerun tests * Update jackson databind version (#449) * Update versions for java dependencies with cves * Rerun tests * update jackson-databind dependency * Disable SQL server functionality (#448) * Disable SQL server functionality * Update docs/source/server.rst Co-authored-by: Ayush Dattagupta <[email protected]> * Disable server at lowest possible level * Skip all server tests * Add tests to ensure server is disabled * Fix CVE fix test Co-authored-by: Ayush Dattagupta <[email protected]> * Update dask pinnings for release (#450) * Add Java source code to source distribution (#451) * Bump `httpclient` dependency (#453) * Revert "Disable SQL server functionality (#448)" This reverts commit 37a3a61. * Bump httpclient version * Unpin Dask/distributed versions (#452) * Unpin dask/distributed post release * Remove dask/distributed version ceiling * Add jsonschema to ci testing (#454) * Add jsonschema to ci env * Fix typo in config schema * Switch tests from `pd.testing.assert_frame_equal` to `dd.assert_eq` (#365) * Start moving tests to dd.assert_eq * Use assert_eq in datetime filter test * Resolve most resulting test failures * Resolve remaining test failures * Convert over tests * Convert more tests * Consolidate select limit cpu/gpu test * Remove remaining assert_series_equal * Remove explicit cudf imports from many tests * Resolve rex test failures * Remove some additional compute calls * Consolidate sorting tests with getfixturevalue * Fix failed join test * Remove breakpoint * Use custom assert_eq function for tests * Resolve test failures / seg faults * Remove unnecessary testing utils * Resolve local test failures * Generalize RAND test * Avoid closing client if using independent cluster * Fix failures on Windows * Resolve black failures * Make random test variables more clear * Set max pin on antlr4-python-runtime (#456) * Set max pin on antlr4-python-runtime due to incompatibilities with fugue_sql * update comment on antlr max pin version * Move / minimize number of cudf / dask-cudf imports (#480) * Move / minimize number of cudf / dask-cudf imports * Add tests for GPU-related errors * Fix unbound local error * Fix ddf value error * Use `map_partitions` to compute LIMIT / OFFSET (#517) * Use map_partitions to compute limit / offset * Use partition_info to extract partition_index * Use `dev` images for independent cluster testing (#518) * Switch to dask dev images * Use mamba for conda installs in images * Remove sleep call for installation * Use timeout / until to wait for cluster to be initialized * Add documentation for FugueSQL integrations (#523) * Add documentation for FugueSQL integrations * Minor nitpick around autodoc obj -> class * Timestampdiff support (#495) * added timestampdiff * initial work for timestampdiff * Added test cases for timestampdiff * Update interval month dtype mapping * Add datetimesubOperator * Uncomment timestampdiff literal tests * Update logic for handling interval_months for pandas/cudf series and scalars * Add negative diff testcases, and gpu tests * Update reinterpret and timedelta to explicitly cast to int64 instead of int * Simplify cast_column_to_type mapping logic * Add scalar handling to castOperation and reuse it for reinterpret Co-authored-by: rajagurnath <[email protected]> * Relax jsonschema testing dependency (#546) * Update upstream testing workflows (#536) * Use dask nightly conda packages for upstream testing * Add independent cluster testing to nightly upstream CI [test-upstream] * Remove unnecessary dask install [test-upstream] * Remove strict channel policy to allow nightly dask installs * Use nightly Dask packages in independent cluster test [test-upstream] * Use channels argument to install Dask conda nightlies [test-upstream] * Fix channel expression * [test-upstream] * Need to add mamba update command to get dask conda nightlies * Use conda nightlies for dask-sql import test * Add import test to upstream nightly tests * [test-upstream] * Make sure we have nightly Dask for import tests [test-upstream] * Fix pyarrow / cloudpickle failures in cluster testing (#553) * Explicitly install libstdcxx-ng in clusters * Make pyarrow dependency consistent across testing * Make libstdcxx-ng dep a min version * Add cloudpickle to cluster dependencies * cloudpickle must be in the scheduler environment * Bump cloudpickle version * Move cloudpickle install to workers * Fix pyarrow constraint in cluster spec * Use bash -l as default entrypoint for all jobs (#552) * Constrain dask/distributed for release (#563) * Unpin dask/distributed for development (#564) * Unpin dask/distributed post release * Remove dask/distributed version ceiling * update dask-sphinx-theme (#567) * Make sure scheduler has Dask nightlies in upstream cluster testing (#573) * Make sure scheduler has Dask nightlies in upstream cluster testing * empty commit to [test-upstream] * Update gpuCI `RAPIDS_VER` to `22.08` (#565) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Modify test environment pinnings to cover minimum versions (#555) * Remove black/isort deps as we prefer pre-commit * Unpin all non python/jdk dependencies * Minor package corrections for py3.9 jdk11 env * Set min version constraints for all non-testing dependencies * Pin all non-test deps for 3.8 testing * Bump sklearn min version to 1.0.0 * Bump pyarrow min version to 1.0.1 * Fix pip notation for fugue * Use unpinned deps for cluster testing for now * Add fugue deps to environments, bump pandas to 1.0.2 * Add back antlr4 version ceiling * Explicitly mark all fugue dependencies * Alter test_analyze to avoid rtol * Bump pandas to 1.0.5 to fix upstream numpy issues * Alter datetime casting util to dodge panda casting failures * Bump pandas to 1.1.0 for groupby dropna support * Simplify string dtype check for get_supported_aggregations * Add check_dtype=False back to test_group_by_nan * Bump cluster to python 3.9 * Bump fastapi to 0.69.0, resolve remaining JDBC failures * Typo - correct pandas version * Generalize test_multi_case_when's dtype check * Bump pandas to 1.1.1 to resolve flaky test failures * Constrain mlflow for windows python 3.8 testing * Selectors don't work for conda env files * Problems seem to persist in 1.1.1, bump to 1.1.2 * Remove accidental debug changes * [test-upstream] * Use python 3.9 for upstream cluster testing [test-upstream] * Updated missed pandas pinning * Unconstrain mlflow to see if Windows failures persist * Add min version for protobuf * Bump pyarrow min version to allow for newer protobuf versions * Don't move jar to local mvn repo (#579) * Add max version constraint for `fugue` (#639) * Remove antlr4-python3-runtime constraint from 3.9+ test envs * Revert "Remove antlr4-python3-runtime constraint from 3.9+ test envs" This reverts commit ef30656. * Add max version constraint for fugue in 3.9+ envs * Constrain Fugue in remaining env/setup files * Clarify fugue constraint comments * Add pinning back to python 3.8 jdk11 tests * More reversions to python 3.8 jdk11 testing env * Add environment file & documentation for GPU tests (#633) * Add gpuCI environment file * Add documentation for GPU tests / environment * Add GPU testing to docs page * Validate UDF metadata (#641) * initial * improvements * bugfixes * Move UDF validation to registration, cache relevant info Co-authored-by: Charles Blackmon-Luca <[email protected]> * Set Dask-sql as the default Fugue Dask engine when installed (#640) * Set Dask-sql as the default Fugue Dask engine when installed * Set Dask-sql as the default Fugue Dask engine when installed * Add max version constraint for `fugue` (#639) * Remove antlr4-python3-runtime constraint from 3.9+ test envs * Revert "Remove antlr4-python3-runtime constraint from 3.9+ test envs" This reverts commit ef30656. * Add max version constraint for fugue in 3.9+ envs * Constrain Fugue in remaining env/setup files * Clarify fugue constraint comments * Add pinning back to python 3.8 jdk11 tests * More reversions to python 3.8 jdk11 testing env * update * update * update * fix tests * update tests * update a few things * update * fix * conda install fugue in testing envs * Remove diff from features notebook * Alter documentation to mention automatic registration of execution engine * Expand FugueSQL notebook * Don't manually close client in simple statement test Co-authored-by: Charles Blackmon-Luca <[email protected]> * Add Rust setup to upstream testing workflow * Resolve style failures * Bump fugue version in CI envs * Add back scalar case for cast operation * Resolve UDF failures * Resolve UDF failures for windows * Remove calcite-specific reinterpret Co-authored-by: Richard (Rick) Zamora <[email protected]> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Ayush Dattagupta <[email protected]> Co-authored-by: rajagurnath <[email protected]> Co-authored-by: Sarah Charlotte Johnson <[email protected]> Co-authored-by: ksonj <[email protected]> Co-authored-by: brandon-b-miller <[email protected]> Co-authored-by: Han Wang <[email protected]>

* First pass at datafusion parsing * updates * updates * updates * DaskSchema implementation for Python in Rust * updated mappings so that Python types map to PyArrow types which is the type also used by Datafusion Statements which are logical plans * Add ability to add columns to an existing DaskTable * Add ability to tables to be added to the DaskSchema * Completion of _get_ral() function in dask-sql. Still does not actually compute yet * Finished converting base class and DaskRelDataType and DaskRelDataTypeField * Can make a very simple pass of a projection on a TableScan operation query work now * updates * Allow for the rough registration of Schemas to the DaskSQLContext * pytest test_context.py working/checkpoint * all unit tests passing/checkpoint * checkpoint * Update on test_select.py * Refactor setup.py * Refactored Rust code to traverse the AST SQL parse tree * Datafusion aggregate (#471) * Add basic predicate-pushdown optimization (#433) * basic predicate-pushdown support * remove explict Dispatch class * use _Frame.fillna * cleanup comments * test coverage * improve test coverage * add xfail test for dt accessor in predicate and fix test_show.py * fix some naming issues * add config and use assert_eq * add logging events when predicate-pushdown bails * move bail logic earlier in function * address easier code review comments * typo fix * fix creation_info access bug * convert any expression to DNF * csv test coverage * include IN coverage * improve test rigor * address code review * skip parquet tests when deps are not installed * fix bug * add pyarrow dep to cluster workers * roll back test skipping changes Co-authored-by: Charles Blackmon-Luca <[email protected]> * Add workflow to keep datafusion dev branch up to date (#440) * Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral * Updates for test_filter * more of test_filter.py working with the exception of some date pytests * Updates to dates and parsing dates like postgresql does * Update gpuCI `RAPIDS_VER` to `22.06` (#434) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Bump black to 22.3.0 (#443) * Check for ucx-py nightlies when updating gpuCI (#441) * Simplify gpuCI updating workflow * Add check for cuML nightly version * Refactored to adjust for better type management * Refactor schema and statements * update types * fix syntax issues and renamed function name calls * Add handling for newer `prompt_toolkit` versions in cmd tests (#447) * Add handling for newer prompt-toolkit version * Place compatibility code in _compat * Fix version for gha-find-replace (#446) * Improved error handling and code clean up * move pieces of logical.rs to seperated files to ensure code readability * left join working * Update versions of Java dependencies (#445) * Update versions for java dependencies with cves * Rerun tests * Update jackson databind version (#449) * Update versions for java dependencies with cves * Rerun tests * update jackson-databind dependency * Disable SQL server functionality (#448) * Disable SQL server functionality * Update docs/source/server.rst Co-authored-by: Ayush Dattagupta <[email protected]> * Disable server at lowest possible level * Skip all server tests * Add tests to ensure server is disabled * Fix CVE fix test Co-authored-by: Ayush Dattagupta <[email protected]> * Update dask pinnings for release (#450) * Add Java source code to source distribution (#451) * Bump `httpclient` dependency (#453) * Revert "Disable SQL server functionality (#448)" This reverts commit 37a3a61fb13b0c56fcc10bf8ef01f4885a58dae8. * Bump httpclient version * Unpin Dask/distributed versions (#452) * Unpin dask/distributed post release * Remove dask/distributed version ceiling * Add jsonschema to ci testing (#454) * Add jsonschema to ci env * Fix typo in config schema * Switch tests from `pd.testing.assert_frame_equal` to `dd.assert_eq` (#365) * Start moving tests to dd.assert_eq * Use assert_eq in datetime filter test * Resolve most resulting test failures * Resolve remaining test failures * Convert over tests * Convert more tests * Consolidate select limit cpu/gpu test * Remove remaining assert_series_equal * Remove explicit cudf imports from many tests * Resolve rex test failures * Remove some additional compute calls * Consolidate sorting tests with getfixturevalue * Fix failed join test * Remove breakpoint * Use custom assert_eq function for tests * Resolve test failures / seg faults * Remove unnecessary testing utils * Resolve local test failures * Generalize RAND test * Avoid closing client if using independent cluster * Fix failures on Windows * Resolve black failures * Make random test variables more clear * First basic working checkpoint for group by * Set max pin on antlr4-python-runtime (#456) * Set max pin on antlr4-python-runtime due to incompatibilities with fugue_sql * update comment on antlr max pin version * Updates to style * stage pre-commit changes for upstream merge * Fix black failures * Updates to Rust formatting * Fix rust lint and clippy * Remove jar building step which is no longer needed * Remove Java from github workflows matrix * Removes jar and Java references from test.yml * Update Release workflow to remove references to Java * Update rust.yml to remove references from linux-build-lib * Add pre-commit.sh file to provide pre-commit support for Rust in a convenient script * Removed overlooked jdk references * cargo clippy auto fixes * Address all Rust clippy warnings * Include setuptools-rust in conda build recipie * Include setuptools-rust in conda build recipie, in host and run * Adjustments for conda build, committing for others to help with error and see it occurring in CI * Include sql.yaml in package files * Include pyarrow in run section of conda build to ensure tests pass * include setuptools-rust in host and run of conda since removing caused errors * to_string() method had been removed in rust and not removed here, caused conda run_test.py to fail when this line was hit * Replace commented out tests with pytest.skip and bump version of pyarrow to 7.0.0 * Fix setup.py syntax issue introduced on last commit by find/replace * Rename Datafusion -> DataFusion and Apache DataFusion -> Arrow DataFusion * Fix docs build environment * Include Rust compiler in docs environment * Bump Rust compiler version to 1.59 * Ok, well readthedocs didn't like that * Store libdask_planner.so and retrieve it between github workflows * Cache the Rust library binary * Remove Cargo.lock from git * Remove unused datafusion-expr crate * Build datafusion at each test step instead of caching binaries * Remove maven and jar cache steps from test-upstream.yaml * Removed dangling 'build' workflow step reference * Lowered PyArrow version to 6.0.1 since cudf has a hard requirement on that version for the version of cudf we are using * Add Rust build step to test in dask cluster * Install setuptools-rust for pip to use for bare requirements import * Include pyarrow 6.0.1 via conda as a bare minimum dependency * Remove cudf dependency for python 3.9 which is causing build issues on windows * Address documentation from review * Install Rust as readthedocs post_create_environment step * Run rust install non-interactively * Run rust install non-interactively * Rust isn't available in PyPi so remove that dependency * Append ~/.cargo/bin to the PATH * Print out some environment information for debugging * Print out some environment information for debugging * More - Increase verbosity * More - Increase verbosity * More - Increase verbosity * Switch RTD over to use Conda instead of Pip since having issues with Rust and pip * Try to use mamba for building docs environment * Partial review suggestion address, checking CI still works * Skip mistakenly enabled tests * Use DataFusion master branch, and fix syntax issues related to the version bump * More updates after bumping DataFusion version to master * Use actions-rs in github workflows debug flag for setup.py * Remove setuptools-rust from conda * Use re-exported Rust types for BuiltinScalarFunction * Move python imports to TYPE_CHECKING section where applicable * Address review concerns and remove pre-commit.sh file * Pin to a specific github rev for DataFusion Co-authored-by: Richard (Rick) Zamora <[email protected]> Co-authored-by: Charles Blackmon-Luca <[email protected]> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Ayush Dattagupta <[email protected]> * Bump DataFusion version (#494) * bump DataFusion version * remove unnecessary downcasts and use separate structs for TableSource and TableProvider * Basic DataFusion Select Functionality (#489) * Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral * Updates for test_filter * more of test_filter.py working with the exception of some date pytests * Add workflow to keep datafusion dev branch up to date (#440) * Include setuptools-rust in conda build recipie, in host and run * Remove PyArrow dependency * rebase with datafusion-sql-planner * refactor changes that were inadvertent during rebase * timestamp with loglca time zone * Include RelDataType work * Include RelDataType work * Introduced SqlTypeName Enum in Rust and mappings for Python * impl PyExpr.getIndex() * add getRowType() for logical.rs * Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes * use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict * linter changes, why did that work on my local pre-commit?? * linter changes, why did that work on my local pre-commit?? * Convert final strs to SqlTypeName Enum * removed a few print statements * Temporarily disable conda run_test.py script since it uses features not yet implemented * expose fromString method for SqlTypeName to use Enums instead of strings for type checking * expanded SqlTypeName from_string() support * accept INT as INTEGER * Remove print statements * Default to UTC if tz is None * Delegate timezone handling to the arrow library * Updates from review Co-authored-by: Charles Blackmon-Luca <[email protected]> * Allow for Cast parsing and logicalplan (#498) * Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral * Updates for test_filter * more of test_filter.py working with the exception of some date pytests * Add workflow to keep datafusion dev branch up to date (#440) * Include setuptools-rust in conda build recipie, in host and run * Remove PyArrow dependency * rebase with datafusion-sql-planner * refactor changes that were inadvertent during rebase * timestamp with loglca time zone * Include RelDataType work * Include RelDataType work * Introduced SqlTypeName Enum in Rust and mappings for Python * impl PyExpr.getIndex() * add getRowType() for logical.rs * Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes * use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict * linter changes, why did that work on my local pre-commit?? * linter changes, why did that work on my local pre-commit?? * Convert final strs to SqlTypeName Enum * removed a few print statements * commit to share with colleague * updates * Temporarily disable conda run_test.py script since it uses features not yet implemented * formatting after upstream merge * expose fromString method for SqlTypeName to use Enums instead of strings for type checking * expanded SqlTypeName from_string() support * accept INT as INTEGER * checkpoint * Refactor PyExpr by removing From trait, and using recursion to expand expression list for rex calls * skip test that uses create statement for gpuci * uncommented pytests * uncommented pytests * code cleanup for review * code cleanup for review * Enabled more pytest that work now * Enabled more pytest that work now * Disable 2 pytest that are causing gpuCI issues. They will be address in a follow up PR * Mark just the GPU tests as skipped Co-authored-by: Charles Blackmon-Luca <[email protected]> * Minor code cleanup in row_type() (#504) * Minor code cleanup in row_type() * remove unwrap * Bump Rust version to 1.60 from 1.59 (#508) * Improve code for getting column name from expression (#509) * helper code for getting column name from expression * Update dask_planner/src/expression.rs Co-authored-by: Jeremy Dyer <[email protected]> * Update dask_planner/src/expression.rs Co-authored-by: Jeremy Dyer <[email protected]> * fix build * Improve error handling Co-authored-by: Jeremy Dyer <[email protected]> * Update exceptions that are thrown (#507) * Update exceptions that are thrown * Remove Java error regex formatting logic. Rust messages will be presented already formatted from Rust itself * Removed lingering test that was still trying to test out Java specific error messages * Update dask_planner/src/sql.rs Co-authored-by: Andy Grove <[email protected]> * clean up logical_relational_algebra function Co-authored-by: Andy Grove <[email protected]> * add support for expr_to_field for Expr::Sort expressions (#515) * reduce crate dependencies (#516) * Datafusion dsql explain (#511) * Planner: Add explain logical plan bindings * Planner: Add explain plan accessor to PyLogicalPlan * Python: Add Explain plan plugin * Python: Register explain plan plugin and add special check to directly return string results * Planner: Update imports and accessor logic after merge with upstream * Python: Add sql EXPLAIN tests * Planner: Replace the pub use with use for LogicalPlan * Port sort logic to the datafusion planner (#505) * Implement PySort logical_plan * Add a sort plan accessor to the logical plan * Python: Update the sort plugin * Python: Uncomment tests * PLanner: Update accessor pattern for concrete logical plan implementations * Test: Address review comments * add support for expr_to_field for Expr::Sort expressions * Planner: Update sort expr utilities and import cleanup * Python: Re-enable skipped sort tests * Python: Handle case where orderby column name is an alias * Apply suggestions from code review Remove redundant unwrap + re-wrap Co-authored-by: Andy Grove <[email protected]> * Style: Fix formatting * Planner: Remove public scope for LogicalPlan import * Python: Add more complex sort tests with alias that error right now * Python: Remove old commented code Co-authored-by: Andy Grove <[email protected]> * Add helper method to convert LogicalPlan to Python type (#522) * Add helper method to convert LogicalPlan to Python type * simplify more * Support CASE WHEN and BETWEEN (#502) * Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral * Updates for test_filter * more of test_filter.py working with the exception of some date pytests * Add workflow to keep datafusion dev branch up to date (#440) * Include setuptools-rust in conda build recipie, in host and run * Remove PyArrow dependency * rebase with datafusion-sql-planner * refactor changes that were inadvertent during rebase * timestamp with loglca time zone * Bump DataFusion version (#494) * bump DataFusion version * remove unnecessary downcasts and use separate structs for TableSource and TableProvider * Include RelDataType work * Include RelDataType work * Introduced SqlTypeName Enum in Rust and mappings for Python * impl PyExpr.getIndex() * add getRowType() for logical.rs * Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes * use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict * linter changes, why did that work on my local pre-commit?? * linter changes, why did that work on my local pre-commit?? * Convert final strs to SqlTypeName Enum * removed a few print statements * commit to share with colleague * updates * checkpoint * Temporarily disable conda run_test.py script since it uses features not yet implemented * formatting after upstream merge * expose fromString method for SqlTypeName to use Enums instead of strings for type checking * expanded SqlTypeName from_string() support * accept INT as INTEGER * tests update * checkpoint * checkpoint * Refactor PyExpr by removing From trait, and using recursion to expand expression list for rex calls * skip test that uses create statement for gpuci * Basic DataFusion Select Functionality (#489) * Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral * Updates for test_filter * more of test_filter.py working with the exception of some date pytests * Add workflow to keep datafusion dev branch up to date (#440) * Include setuptools-rust in conda build recipie, in host and run * Remove PyArrow dependency * rebase with datafusion-sql-planner * refactor changes that were inadvertent during rebase * timestamp with loglca time zone * Include RelDataType work * Include RelDataType work * Introduced SqlTypeName Enum in Rust and mappings for Python * impl PyExpr.getIndex() * add getRowType() for logical.rs * Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes * use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict * linter changes, why did that work on my local pre-commit?? * linter changes, why did that work on my local pre-commit?? * Convert final strs to SqlTypeName Enum * removed a few print statements * Temporarily disable conda run_test.py script since it uses features not yet implemented * expose fromString method for SqlTypeName to use Enums instead of strings for type checking * expanded SqlTypeName from_string() support * accept INT as INTEGER * Remove print statements * Default to UTC if tz is None * Delegate timezone handling to the arrow library * Updates from review Co-authored-by: Charles Blackmon-Luca <[email protected]> * updates for expression * uncommented pytests * uncommented pytests * code cleanup for review * code cleanup for review * Enabled more pytest that work now * Enabled more pytest that work now * Output Expression as String when BinaryExpr does not contain a named alias * Output Expression as String when BinaryExpr does not contain a named alias * Disable 2 pytest that are causing gpuCI issues. They will be address in a follow up PR * Handle Between operation for case-when * adjust timestamp casting * Refactor projection _column_name() logic to the _column_name logic in expression.rs * removed println! statements * Updates from review * refactor String::from() to .to_string() * When no ELSE statement is present in CASE/WHEN statement default to None * Remove println * Re-enable rex test that previously could not be ran Co-authored-by: Charles Blackmon-Luca <[email protected]> Co-authored-by: Andy Grove <[email protected]> * Upgrade to DataFusion 8.0.0 (#533) * upgrade to DataFusion 8.0.0 * fmt * Enable passing tests (#539) * Uncomment working pytests * Enable PyTests that are now passing * Datafusion crossjoin (#521) * Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral * Updates for test_filter * more of test_filter.py working with the exception of some date pytests * Add workflow to keep datafusion dev branch up to date (#440) * Include setuptools-rust in conda build recipie, in host and run * Remove PyArrow dependency * rebase with datafusion-sql-planner * refactor changes that were inadvertent during rebase * timestamp with loglca time zone * Bump DataFusion version (#494) * bump DataFusion version * remove unnecessary downcasts and use separate structs for TableSource and TableProvider * Include RelDataType work * Include RelDataType work * Introduced SqlTypeName Enum in Rust and mappings for Python * impl PyExpr.getIndex() * add getRowType() for logical.rs * Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes * use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict * linter changes, why did that work on my local pre-commit?? * linter changes, why did that work on my local pre-commit?? * Convert final strs to SqlTypeName Enum * removed a few print statements * commit to share with colleague * updates * checkpoint * Temporarily disable conda run_test.py script since it uses features not yet implemented * formatting after upstream merge * expose fromString method for SqlTypeName to use Enums instead of strings for type checking * expanded SqlTypeName from_string() support * accept INT as INTEGER * tests update * checkpoint * checkpoint * Refactor PyExpr by removing From trait, and using recursion to expand expression list for rex calls * skip test that uses create statement for gpuci * Basic DataFusion Select Functionality (#489) * Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral * Updates for test_filter * more of test_filter.py working with the exception of some date pytests * Add workflow to keep datafusion dev branch up to date (#440) * Include setuptools-rust in conda build recipie, in host and run * Remove PyArrow dependency * rebase with datafusion-sql-planner * refactor changes that were inadvertent during rebase * timestamp with loglca time zone * Include RelDataType work * Include RelDataType work * Introduced SqlTypeName Enum in Rust and mappings for Python * impl PyExpr.getIndex() * add getRowType() for logical.rs * Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes * use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict * linter changes, why did that work on my local pre-commit?? * linter changes, why did that work on my local pre-commit?? * Convert final strs to SqlTypeName Enum * removed a few print statements * Temporarily disable conda run_test.py script since it uses features not yet implemented * expose fromString method for SqlTypeName to use Enums instead of strings for type checking * expanded SqlTypeName from_string() support * accept INT as INTEGER * Remove print statements * Default to UTC if tz is None * Delegate timezone handling to the arrow library * Updates from review Co-authored-by: Charles Blackmon-Luca <[email protected]> * updates for expression * uncommented pytests * uncommented pytests * code cleanup for review * code cleanup for review * Enabled more pytest that work now * Enabled more pytest that work now * Output Expression as String when BinaryExpr does not contain a named alias * Output Expression as String when BinaryExpr does not contain a named alias * Disable 2 pytest that are causing gpuCI issues. They will be address in a follow up PR * Handle Between operation for case-when * adjust timestamp casting * Refactor projection _column_name() logic to the _column_name logic in expression.rs * removed println! statements * Updates from review * refactor String::from() to .to_string() * Fix mappings * Add cross_join.py and cross_join.rs * Add pytest for cross_join * Address review comments * Fix module import issue where typo was introduced * manually supply test fixtures * Remove request from test method signature Co-authored-by: Charles Blackmon-Luca <[email protected]> Co-authored-by: Andy Grove <[email protected]> * Implement TryFrom for plans (#543) * impl TryFrom for plans * fix todo * fix compilation error * code cleanup * revert change to error message * simplify code * cargo fmt * Support for LIMIT clause with DataFusion (#529) * Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral * Updates for test_filter * more of test_filter.py working with the exception of some date pytests * Add workflow to keep datafusion dev branch up to date (#440) * Include setuptools-rust in conda build recipie, in host and run * Remove PyArrow dependency * rebase with datafusion-sql-planner * refactor changes that were inadvertent during rebase * timestamp with loglca time zone * Bump DataFusion version (#494) * bump DataFusion version * remove unnecessary downcasts and use separate structs for TableSource and TableProvider * Include RelDataType work * Include RelDataType work * Introduced SqlTypeName Enum in Rust and mappings for Python * impl PyExpr.getIndex() * add getRowType() for logical.rs * Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes * use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict * linter changes, why did that work on my local pre-commit?? * linter changes, why did that work on my local pre-commit?? * Convert final strs to SqlTypeName Enum * removed a few print statements * commit to share with colleague * updates * checkpoint * Temporarily disable conda run_test.py script since it uses features not yet implemented * formatting after upstream merge * expose fromString method for SqlTypeName to use Enums instead of strings for type checking * expanded SqlTypeName from_string() support * accept INT as INTEGER * tests update * checkpoint * checkpoint * Refactor PyExpr by removing From trait, and using recursion to expand expression list for rex calls * skip test that uses create statement for gpuci * Basic DataFusion Select Functionality (#489) * Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral * Updates for test_filter * more of test_filter.py working with the exception of some date pytests * Add workflow to keep datafusion dev branch up to date (#440) * Include setuptools-rust in conda build recipie, in host and run * Remove PyArrow dependency * rebase with datafusion-sql-planner * refactor changes that were inadvertent during rebase * timestamp with loglca time zone * Include RelDataType work * Include RelDataType work * Introduced SqlTypeName Enum in Rust and mappings for Python * impl PyExpr.getIndex() * add getRowType() for logical.rs * Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes * use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict * linter changes, why did that work on my local pre-commit?? * linter changes, why did that work on my local pre-commit?? * Convert final strs to SqlTypeName Enum * removed a few print statements * Temporarily disable conda run_test.py script since it uses features not yet implemented * expose fromString method for SqlTypeName to use Enums instead of strings for type checking * expanded SqlTypeName from_string() support * accept INT as INTEGER * Remove print statements * Default to UTC if tz is None * Delegate timezone handling to the arrow library * Updates from review Co-authored-by: Charles Blackmon-Luca <[email protected]> * updates for expression * uncommented pytests * uncommented pytests * code cleanup for review * code cleanup for review * Enabled more pytest that work now * Enabled more pytest that work now * Output Expression as String when BinaryExpr does not contain a named alias * Output Expression as String when BinaryExpr does not contain a named alias * Disable 2 pytest that are causing gpuCI issues. They will be address in a follow up PR * Handle Between operation for case-when * adjust timestamp casting * Refactor projection _column_name() logic to the _column_name logic in expression.rs * removed println! statements * Updates from review * Add Offset and point to repo with offset in datafusion * Introduce offset * limit updates * commit before upstream merge * Code formatting * update Cargo.toml to use Arrow-DataFusion version with LIMIT logic * Bump DataFusion version to get changes around variant_name() * Use map partitions for determining the offset * Refactor offset partition func * Update to use TryFrom logic * Add cloudpickle to independent scheduler requirements Co-authored-by: Charles Blackmon-Luca <[email protected]> Co-authored-by: Andy Grove <[email protected]> * Support Joins using DataFusion planner/parser (#512) * Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral * Updates for test_filter * more of test_filter.py working with the exception of some date pytests * Add workflow to keep datafusion dev branch up to date (#440) * Include setuptools-rust in conda build recipie, in host and run * Remove PyArrow dependency * rebase with datafusion-sql-planner * refactor changes that were inadvertent during rebase * timestamp with loglca time zone * Bump DataFusion version (#494) * bump DataFusion version * remove unnecessary downcasts and use separate structs for TableSource and TableProvider * Include RelDataType work * Include RelDataType work * Introduced SqlTypeName Enum in Rust and mappings for Python * impl PyExpr.getIndex() * add getRowType() for logical.rs * Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes * use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict * linter changes, why did that work on my local pre-commit?? * linter changes, why did that work on my local pre-commit?? * Convert final strs to SqlTypeName Enum * removed a few print statements * commit to share with colleague * updates * checkpoint * Temporarily disable conda run_test.py script since it uses features not yet implemented * formatting after upstream merge * expose fromString method for SqlTypeName to use Enums instead of strings for type checking * expanded SqlTypeName from_string() support * accept INT as INTEGER * tests update * checkpoint * checkpoint * Refactor PyExpr by removing From trait, and using recursion to expand expression list for rex calls * skip test that uses create statement for gpuci * Basic DataFusion Select Functionality (#489) * Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral * Updates for test_filter * more of test_filter.py working with the exception of some date pytests * Add workflow to keep datafusion dev branch up to date (#440) * Include setuptools-rust in conda build recipie, in host and run * Remove PyArrow dependency * rebase with datafusion-sql-planner * refactor changes that were inadvertent during rebase * timestamp with loglca time zone * Include RelDataType work * Include RelDataType work * Introduced SqlTypeName Enum in Rust and mappings for Python * impl PyExpr.getIndex() * add getRowType() for logical.rs * Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes * use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict * linter changes, why did that work on my local pre-commit?? * linter changes, why did that work on my local pre-commit?? * Convert final strs to SqlTypeName Enum * removed a few print statements * Temporarily disable conda run_test.py script since it uses features not yet implemented * expose fromString method for SqlTypeName to use Enums instead of strings for type checking * expanded SqlTypeName from_string() support * accept INT as INTEGER * Remove print statements * Default to UTC if tz is None * Delegate timezone handling to the arrow library * Updates from review Co-authored-by: Charles Blackmon-Luca <[email protected]> * updates for expression * uncommented pytests * uncommented pytests * code cleanup for review * code cleanup for review * Enabled more pytest that work now * Enabled more pytest that work now * Output Expression as String when BinaryExpr does not contain a named alias * Output Expression as String when BinaryExpr does not contain a named alias * Disable 2 pytest that are causing gpuCI issues. They will be address in a follow up PR * Handle Between operation for case-when * adjust timestamp casting * Refactor projection _column_name() logic to the _column_name logic in expression.rs * removed println! statements * introduce join getCondition() logic for retrieving the combining Rex logic for joining * Updates from review * Add Offset and point to repo with offset in datafusion * Introduce offset * limit updates * commit before upstream merge * Code formatting * update Cargo.toml to use Arrow-DataFusion version with LIMIT logic * Bump DataFusion version to get changes around variant_name() * Use map partitions for determining the offset * Merge with upstream * Rename underlying DataContainer's DataFrame instance to match the column container names * Adjust ColumnContainer mapping after join.py logic to entire the bakend mapping is reset * Add enumerate to column_{i} generation string to ensure columns exist in both dataframes * Adjust join schema logic to perform merge instead of join on rust side to avoid name collisions * Handle DataFusion COUNT(UInt8(1)) as COUNT(*) * commit before merge * Update function for gathering index of a expression * Update for review check * Adjust RelDataType to retrieve fully qualified column names * Adjust base.py to get fully qualified column name * Enable passing pytests in test_join.py * Adjust keys provided by getting backend column mapping name * Adjust output_col to not use the backend_column name for special reserved exprs * uncomment cross join pytest which works now * Uncomment passing pytests in test_select.py * Review updates * Add back complex join case condition, not just cross join but 'complex' joins Co-authored-by: Charles Blackmon-Luca <[email protected]> Co-authored-by: Andy Grove <[email protected]> * Datafusion is not (#557) * Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral * Updates for test_filter * more of test_filter.py working with the exception of some date pytests * Add workflow to keep datafusion dev branch up to date (#440) * Include setuptools-rust in conda build recipie, in host and run * Remove PyArrow dependency * rebase with datafusion-sql-planner * refactor changes that were inadvertent during rebase * timestamp with loglca time zone * Bump DataFusion version (#494) * bump DataFusion version * remove unnecessary downcasts and use separate structs for TableSource and TableProvider * Include RelDataType work * Include RelDataType work * Introduced SqlTypeName Enum in Rust and mappings for Python * impl PyExpr.getIndex() * add getRowType() for logical.rs * Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes * use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict * linter changes, why did that work on my local pre-commit?? * linter changes, why did that work on my local pre-commit?? * Convert final strs to SqlTypeName Enum * removed a few print statements * commit to share with colleague * updates * checkpoint * Temporarily disable conda run_test.py script since it uses features not yet implemented * formatting after upstream merge * expose fromString method for SqlTypeName to use Enums instead of strings for type checking * expanded SqlTypeName from_string() support * accept INT as INTEGER * tests update * checkpoint * checkpoint * Refactor PyExpr by removing From trait, and using recursion to expand expression list for rex calls * skip test that uses create statement for gpuci * Basic DataFusion Select Functionality (#489) * Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral * Updates for test_filter * more of test_filter.py working with the exception of some date pytests * Add workflow to keep datafusion dev branch up to date (#440) * Include setuptools-rust in conda build recipie, in host and run * Remove PyArrow dependency * rebase with datafusion-sql-planner * refactor changes that were inadvertent during rebase * timestamp with loglca time zone * Include RelDataType work * Include RelDataType work * Introduced SqlTypeName Enum in Rust and mappings for Python * impl PyExpr.getIndex() * add getRowType() for logical.rs * Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes * use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict * linter changes, why did that work on my local pre-commit?? * linter changes, why did that work on my local pre-commit?? * Convert final strs to SqlTypeName Enum * removed a few print statements * Temporarily disable conda run_test.py script since it uses features not yet implemented * expose fromString method for SqlTypeName to use Enums instead of strings for type checking * expanded SqlTypeName from_string() support * accept INT as INTEGER * Remove print statements * Default to UTC if tz is None * Delegate timezone handling to the arrow library * Updates from review Co-authored-by: Charles Blackmon-Luca <[email protected]> * updates for expression * uncommented pytests * uncommented pytests * code cleanup for review * code cleanup for review * Enabled more pytest that work now * Enabled more pytest that work now * Output Expression as String when BinaryExpr does not contain a named alias * Output Expression as String when BinaryExpr does not contain a named alias * Disable 2 pytest that are causing gpuCI issues. They will be address in a follow up PR * Handle Between operation for case-when * adjust timestamp casting * Refactor projection _column_name() logic to the _column_name logic in expression.rs * removed println! statements * introduce join getCondition() logic for retrieving the combining Rex logic for joining * Updates from review * Add Offset and point to repo with offset in datafusion * Introduce offset * limit updates * commit before upstream merge * Code formatting * update Cargo.toml to use Arrow-DataFusion version with LIMIT logic * Bump DataFusion version to get changes around variant_name() * Use map partitions for determining the offset * Merge with upstream * Rename underlying DataContainer's DataFrame instance to match the column container names * Adjust ColumnContainer mapping after join.py logic to entire the bakend mapping is reset * Add enumerate to column_{i} generation string to ensure columns exist in both dataframes * Adjust join schema logic to perform merge instead of join on rust side to avoid name collisions * Handle DataFusion COUNT(UInt8(1)) as COUNT(*) * commit before merge * Update function for gathering index of a expression * Update for review check * Adjust RelDataType to retrieve fully qualified column names * Adjust base.py to get fully qualified column name * Enable passing pytests in test_join.py * Adjust keys provided by getting backend column mapping name * Adjust output_col to not use the backend_column name for special reserved exprs * uncomment cross join pytest which works now * Uncomment passing pytests in test_select.py * Review updates * Add back complex join case condition, not just cross join but 'complex' joins * Add support for 'is not null' clause Co-authored-by: Charles Blackmon-Luca <[email protected]> Co-authored-by: Andy Grove <[email protected]> * [REVIEW] Add support for `UNION` (#542) Fixes: #470 This PR adds UNION support in dask-sql which uses datafusion's union logic. * [REVIEW] Fix issue with duplicates in column renaming (#559) * initial commit * add union.rs * fix * debug * updates * handle multiple inputs * xfail * remove xfails * style * cleanup * un-xfail * address reviews * fix projection issue * address reviews * enable tests (#560) Looks like LIMIT has been ported, hence enabling more tests across different files. * Add CODEOWNERS file (#562) * Add CODEOWNERS file * Can only specify users with write access * Remove accidental dev commit * Upgrade DataFusion version & support non-equijoin join conditions (#566) * use latest datafusion * fix regression * remove DaskTableProvider * simplify code * Add ayushdg and galipremsagar to rust CODEOWNERS (#572) * Enable DataFusion CBO and introduce DaskSqlOptimizer (#558) * Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral * Updates for test_filter * more of test_filter.py working with the exception of some date pytests * Add workflow to keep datafusion dev branch up to date (#440) * Include setuptools-rust in conda build recipie, in host and run * Remove PyArrow dependency * rebase with datafusion-sql-planner * refactor changes that were inadvertent during rebase * timestamp with loglca time zone * Bump DataFusion version (#494) * bump DataFusion version * remove unnecessary downcasts and use separate structs for TableSource and TableProvider * Include RelDataType work * Include RelDataType work * Introduced SqlTypeName Enum in Rust and mappings for Python * impl PyExpr.getIndex() * add getRowType() for logical.rs * Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes * use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict * linter changes, why did that work on my local pre-commit?? * linter changes, why did that work on my local pre-commit?? * Convert final strs to SqlTypeName Enum * removed a few print statements * commit to share with colleague * updates * checkpoint * Temporarily disable conda run_test.py script since it uses features not yet implemented * formatting after upstream merge * expose fromString method for SqlTypeName to use Enums instead of strings for type checking * expanded SqlTypeName from_string() support * accept INT as INTEGER * tests update * checkpoint * checkpoint * Refactor PyExpr by removing From trait, and using recursion to expand expression list for rex calls * skip test that uses create statement for gpuci * Basic DataFusion Select Functionality (#489) * Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral * Updates for test_filter * more of test_filter.py working with the exception of some date pytests * Add workflow to keep datafusion dev branch up to date (#440) * Include setuptools-rust in conda build recipie, in host and run * Remove PyArrow dependency * rebase with datafusion-sql-planner * refactor changes that were inadvertent during rebase * timestamp with loglca time zone * Include RelDataType work * Include RelDataType work * Introduced SqlTypeName Enum in Rust and mappings for Python * impl PyExpr.getIndex() * add getRowType() for logical.rs * Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes * use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict * linter changes, why did that work on my local pre-commit?? * linter changes, why did that work on my local pre-commit?? * Convert final strs to SqlTypeName Enum * removed a few print statements * Temporarily disable conda run_test.py script since it uses features not yet implemented * expose fromString method for SqlTypeName to use Enums instead of strings for type checking * expanded SqlTypeName from_string() support * accept INT as INTEGER * Remove print statements * Default to UTC if tz is None * Delegate timezone handling to the arrow library * Updates from review Co-authored-by: Charles Blackmon-Luca <[email protected]> * updates for expression * uncommented pytests * uncommented pytests * code cleanup for review * code cleanup for review * Enabled more pytest that work now * Enabled more pytest that work now * Output Expression as String when BinaryExpr does not contain a named alias * Output Expression as String when BinaryExpr does not contain a named alias * Disable 2 pytest that are causing gpuCI issues. They will be address in a follow up PR * Handle Between operation for case-when * adjust timestamp casting * Refactor projection _column_name() logic to the _column_name logic in expression.rs * removed println! statements * introduce join getCondition() logic for retrieving the combining Rex logic for joining * Updates from review * Add Offset and point to repo with offset in datafusion * Introduce offset * limit updates * commit before upstream merge * Code formatting * update Cargo.toml to use Arrow-DataFusion version with LIMIT logic * Bump DataFusion version to get changes around variant_name() * Use map partitions for determining the offset * Merge with upstream * Rename underlying DataContainer's DataFrame instance to match the column container names * Adjust ColumnContainer mapping after join.py logic to entire the bakend mapping is reset * Add enumerate to column_{i} generation string to ensure columns exist in both dataframes * Adjust join schema logic to perform merge instead of join on rust side to avoid name collisions * Handle DataFusion COUNT(UInt8(1)) as COUNT(*) * commit before merge * Update function for gathering index of a expression * Update for review check * Adjust RelDataType to retrieve fully qualified column names * Adjust base.py to get fully qualified column name * Enable passing pytests in test_join.py * Adjust keys provided by getting backend column mapping name * Adjust output_col to not use the backend_column name for special reserved exprs * uncomment cross join pytest which works now * Uncomment passing pytests in test_select.py * Review updates * Add back complex join case condition, not just cross join but 'complex' joins * Enable DataFusion CBO logic * Disable EliminateFilter optimization rule * updates * Disable tests that hit CBO generated plan edge cases of yet to be implemented logic * [REVIEW] - Modifiy sql.skip_optimize to use dask_config.get and remove used method parameter * [REVIEW] - change name of configuration from skip_optimize to optimize * [REVIEW] - Add OptimizeException catch and raise statements back * Found issue where backend column names which are results of a single aggregate resulting column, COUNT(*) for example, need to get the first agg df column since names are not valid * Remove SQL from OptimizationException * skip tests that CBO plan reorganization causes missing features to be present Co-authored-by: Charles Blackmon-Luca <[email protected]> Co-authored-by: Andy Grove <[email protected]> * Only use the specific DataFusion crates that we need (#568) * use specific datafusion crates * clean up imports * Fix some clippy warnings (#574) * Datafusion invalid projection (#571) * Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral * Updates for test_filter * more of test_filter.py working with the exception of some date pytests * Add workflow to keep datafusion dev branch up to date (#440) * Include setuptools-rust in conda build recipie, in host and run * Remove PyArrow dependency * rebase with datafusion-sql-planner * refactor changes that were inadvertent during rebase * timestamp with loglca time zone * Bump DataFusion version (#494) * bump DataFusion version * remove unnecessary downcasts and use separate structs for TableSource and TableProvider * Include RelDataType work * Include RelDataType work * Introduced SqlTypeName Enum in Rust and mappings for Python * impl PyExpr.getIndex() * add getRowType() for logical.rs * Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes * use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict * linter changes, why did that work on my local pre-commit?? * linter changes, why did that work on my local pre-commit?? * Convert final strs to SqlTypeName Enum * removed a few print statements * commit to share with colleague * updates * checkpoint * Temporarily disable conda run_test.py script since it uses features not yet implemented * formatting after upstream merge * expose fromString method for SqlTypeName to use Enums instead of strings for type checking * expanded SqlTypeName from_string() support * accept INT as INTEGER * tests update * checkpoint * checkpoint * Refactor PyExpr by removing From trait, and using recursion to expand expression list for rex calls * skip test that uses create statement for gpuci * Basic DataFusion Select Functionality (#489) * Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral * Updates for test_filter * more of test_filter.py working with the exception of some date pytests * Add workflow to keep datafusion dev branch up to date (#440) * Include setuptools-rust in conda build recipie, in host and run * Remove PyArrow dependency * rebase with datafusion-sql-planner * refactor changes that were inadvertent during rebase * timestamp with loglca time zone * Include RelDataType work * Include RelDataType work * Introduced SqlTypeName Enum in Rust and mappings for Python * impl PyExpr.getIndex() * add getRowType() for logical.rs * Introduce DaskTypeMap for storing correlating SqlTypeName and DataTypes * use str values instead of Rust Enums, Python is unable to Hash the Rust Enums if used in a dict * linter changes, why did that work on my local pre-commit?? * linter changes, why did that work on my local pre-commit?? * Convert final strs to SqlTypeName Enum * removed a few print statements * Temporarily disable conda run_test.py script since it uses features not yet implemented * expose fromString method for SqlTypeName to use Enums instead of strings for type checking * expanded SqlTypeName from_string() support * accept INT as INTEGER * Remove print statements * Default to UTC if tz is None * Delegate timezone handling to the arrow library * Updates from review Co-authored-by: Charles Blackmon-Luca <[email protected]> * updates for expression * uncommented pytests * uncommented pytests * code cleanup for review * code cleanup for review * Enabled more pytest that work now * Enabled more pytest that work now * Output Expression as String when BinaryExpr does not contain a named alias * Output Expression as String when BinaryExpr does not contain a named alias * Disable 2 pytest that are causing gpuCI issues. They will be address in a follow up PR * Handle Between operation for case-when * adjust timestamp casting * Refactor projection _column_name() logic to the _column_name logic in expression.rs * removed println! statements * introduce join getCondition() logic for retrieving the combining Rex logic for joining * Updates from review * Add Offset and point to repo with offset in datafusion * Introduce offset * limit updates * commit before upstream merge * Code formatting * update Cargo.toml to use Arrow-DataFusion version with LIMIT logic * Bump DataFusion version to get changes around variant_name() * Use map partitions for determining the offset * Merge with upstream * Rename underlying DataContainer's DataFrame instance to match the column container names * Adjust ColumnContainer mapping after join.py logic to entire the bakend mapping is reset * Add enumerate to column_{i} generation string to ensure columns exist in both dataframes * Adjust join schema logic to perform merge instead of join on rust side to avoid name collisions * Handle DataFusion COUNT(UInt8(1)) as COUNT(*) * commit before merge * Update function for gathering index of a expression * Update for review check * Adjust RelDataType to retrieve fully qualified column names * Adjust base.py to get fully qualified column name * Enable passing pytests in test_join.py * Adjust keys provided by getting backend column mapping name * Adjust output_col to not use the backend_column name for special reserved exprs * uncomment cross join pytest which works now * Uncomment passing pytests in test_select.py * Review updates * Add back complex join case condition, not just cross join but 'complex' joins * Enable DataFusion CBO logic * Disable EliminateFilter optimization rule * updates * Disable tests that hit CBO generated plan edge cases of yet to be implemented logic * [REVIEW] - Modifiy sql.skip_optimize to use dask_config.get and remove used method parameter * [REVIEW] - change name of configuration from skip_optimize to optimize * [REVIEW] - Add OptimizeException catch and raise statements back * Found issue where backend column names which are results of a single aggregate resulting column, COUNT(*) for example, need to get the first agg df column since names are not valid * Remove SQL from OptimizationException * skip tests that CBO plan reorganization causes missing features to be present * If TableScan contains projections use those instead of all of the TableColums for limiting columns read during table_scan * [REVIEW] remove compute(), remove temp row_type variable * [REVIEW] - Add test for projection pushdown * [REVIEW] - Add some more parametrized test combinations * [REVIEW] - Use iterator instead of for loop and simplify contains_projections * [REVIEW] - merge upstream and adjust imports * [REVIEW] - Rename pytest function and remove duplicate table creation Co-authored-by: Charles Blackmon-Luca <[email protected]> Co-authored-by: Andy Grove <[email protected]> * Datafusion upstream merge (#576) * Add basic predicate-pushdown optimization (#433) * basic predicate-pushdown support * remove explict Dispatch class * use _Frame.fillna * cleanup comments * test coverage * improve test coverage * add xfail test for dt accessor in predicate and fix test_show.py * fix some naming issues * add config and use assert_eq * add logging events when predicate-pushdown bails * move bail logic earlier in function * address easier code review comments * typo fix * fix creation_info access bug * convert any expression to DNF * csv test coverage * include IN coverage * improve test rigor * address code review * skip parquet tests when deps are not installed * fix bug * add pyarrow dep to cluster workers * roll back test skipping changes Co-authored-by: Charles Blackmon-Luca <[email protected]> * Add workflow to keep datafusion dev branch up to date (#440) * Update gpuCI `RAPIDS_VER` to `22.06` (#434) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Bump black to 22.3.0 (#443) * Check for ucx-py nightlies when updating gpuCI (#441) * Simplify gpuCI updating workflow * Add check for cuML nightly version * Add handling for newer `prompt_toolkit` versions in cmd tests (#447) * Add handling for newer prompt-toolkit version * Place compatibility code in _compat * Fix version for gha-find-replace (#446) * Update versions of Java dependencies (#445) * Update versions for java dependencies with cves * Rerun tests * Update jackson databind version (#449) * Update versions for java dependencies with cves * Rerun tests * update jackson-databind dependency * Disable SQL server functionality (#448) * Disable SQL server functionality * Update docs/source/server.rst Co-authored-by: Ayush Dattagupta <[email protected]> * Disable server at lowest possible level * Skip all server tests * Add tests to ensure server is disabled * Fix CVE fix test Co-authored-by: Ayush Dattagupta <[email protected]> * Update dask pinnings for release (#450) * Add Java source code to source distribution (#451) * Bump `httpclient` dependency (#453) * Revert "Disable SQL server functionality (#448)" This reverts commit 37a3a61fb13b0c56fcc10bf8ef01f4885a58dae8. * Bump httpclient version * Unpin Dask/distributed versions (#452) * Unpin dask/distributed post release * Remove dask/distributed version ceiling * Add jsonschema to ci testing (#454) * Add jsonschema to ci env * Fix typo in config schema * Switch tests from `pd.testing.assert_frame_equal` to `dd.assert_eq` (#365) * Start moving tests to dd.assert_eq * Use assert_eq in datetime filter test * Resolve most resulting test failures * Resolve remaining test failures * Convert over tests * Convert more tests * Consolidate select limit cpu/gpu test * Remove remaining assert_series_equal * Remove explicit cudf imports from many tests * Resolve rex test failures * Remove some additional compute calls * Consolidate sorting tests with getfixturevalue * Fix failed join test * Remove breakpoint * Use custom assert_eq function for tests * Resolve test failures / seg faults * Remove unnecessary testing utils * Resolve local test failures * Generalize RAND test * Avoid closing client if using independent cluster * Fix failures on Windows * Resolve black failures * Make random test variables more clear * Set max pin on antlr4-python-runtime (#456) * Set max pin on antlr4-python-runtime due to incompatibilities with fugue_sql * update comment on antlr max pin version * Move / minimize number of cudf / dask-cudf imports (#480) * Move / minimize number of cudf / dask-cudf imports * Add tests for GPU-related errors * Fix unbound local error * Fix ddf value error * Use `map_partitions` to compute LIMIT / OFFSET (#517) * Use map_partitions to compute limit / offset * Use partition_info to extract partition_index * Use `dev` images for independent cluster testing (#518) * Switch to dask dev images * Use mamba for conda installs in images * Remove sleep call for installation * Use timeout / until to wait for cluster to be initialized * Add documentation for FugueSQL integrations (#523) * Add documentation for FugueSQL integrations * Minor nitpick around autodoc obj -> class * Timestampdiff support (#495) * added timestampdiff * initial work for timestampdiff * Added test cases for timestampdiff * Update interval month dtype mapping * Add datetimesubOperator * Uncomment timestampdiff literal tests * Update logic for handling interval_months for pandas/cudf series and scalars * Add negative diff testcases, and gpu tests * Update reinterpret and timedelta to explicitly cast to int64 instead of int * Simplify cast_column_to_type mapping logic * Add scalar handling to castOperation and reuse it for reinterpret Co-authored-by: rajagurnath <[email protected]> * Relax jsonschema testing dependency (#546) * Update upstream testing workflows (#536) * Use dask nightly conda packages for upstream testing * Add independent cluster testing to nightly upstream CI [test-upstream] * Remove unnecessary dask install [test-upstream] * Remove strict channel policy to allow nightly dask installs * Use nightly Dask packages in independent cluster test [test-upstream] * Use channels argument to install Dask conda nightlies [test-upstream] * Fix channel expression * [test-upstream] * Need to add mamba update command to get dask conda nightlies * Use conda nightlies for dask-sql import test * Add import test to upstream nightly tests * [test-upstream] * Make sure we have nightly Dask for import tests [test-upstream] * Fix pyarrow / cloudpickle failures in cluster testing (#553) * Explicitly install libstdcxx-ng in clusters * Make pyarrow dependency consistent across testing * Make libstdcxx-ng dep a min version * Add cloudpickle to cluster dependencies * cloudpickle must be in the scheduler environment * Bump cloudpickle version * Move cloudpickle install to workers * Fix pyarrow constraint in cluster spec * Use bash -l as default entrypoint for all jobs (#552) * Constrain dask/distributed for release (#563) * Unpin dask/distributed for development (#564) * Unpin dask/distributed post release * Remove dask/distributed version ceiling * update dask-sphinx-theme (#567) * Introduce subquery.py to handle subquery expressions * update ordering * Make sure scheduler has Dask nightlies in upstream cluster testing (#573) * Make sure scheduler has Dask nightlies in upstream cluster testing * empty commit to [test-upstream] * Update gpuCI `RAPIDS_VER` to `22.08` (#565) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * updates * Remove startswith function merged by mistake * [REVIEW] - Remove instance that are meant for the currently removed timestampdiff * Modify test environment pinnings to cover minimum versions (#555) * Remove black/isort deps as we prefer pre-commit * Unpin all non python/jdk dependencies * Minor package corrections for py3.9 jdk11 env * Set min version constraints for all non-testing dependencies * Pin all non-test deps for 3.8 testing * Bump sklearn min version to 1.0.0 * Bump pyarrow min version to 1.0.1 * Fix pip notation for fugue * Use unpinned deps for cluster testing for now * Add fugue deps to environments, bump pandas to 1.0.2 * Add back antlr4 version ceiling * Explicitly mark all fugue dependencies * Alter test_analyze to avoid rtol * Bump pandas to 1.0.5 to fix upstream numpy issues * Alter datetime casting util to dodge panda casting failures * Bump pandas to 1.1.0 for groupby dropna support * Simplify string dtype check for get_supported_aggregations * Add check_dtype=False back to test_group_by_nan * Bump cluster to python 3.9 * Bump fastapi to 0.69.0, resolve remaining JDBC failures * Typo - correct pandas version * Generalize test_multi_case_when's dtype check * Bump pandas to 1.1.1 to resolve flaky test failures * Constrain mlflow for windows python 3.8 testing * Selectors don't work for conda env files * Problems seem to persist in 1.1.1, bump to 1.1.2 * Remove accidental debug changes * [test-upstream] * Use python 3.9 for upstream cluster testing [test-upstream] * Updated missed pandas pinning * Unconstrain mlflow to see if Windows failures persist * Add min version for protobuf * Bump pyarrow min version to allow for newer protobuf versions * Don't move jar to local mvn repo (#579) * Add tests for intersection * Add tests for intersection * Add another intersection test, even more simple but for testing raw intersection * Use Timedelta when doing ReduceOperation(s) against datetime64 dtypes * Cleanup * Use an either/or strategy for converting to Timedelta objects * Support more than 2 operands for Timedelta conversions * fix merge issues, is_frame() function of call.py was removed accidentally before * Remove pytest …

Relax jsonschema testing dependency

9a18c78

ayushdg approved these changes May 20, 2022

View reviewed changes

charlesbluca merged commit 8ec3ed5 into dask-contrib:main May 20, 2022

charlesbluca deleted the relax-jsonschema branch July 19, 2022 10:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Relax `jsonschema` testing dependency #546

Relax `jsonschema` testing dependency #546

charlesbluca commented May 20, 2022

codecov-commenter commented May 20, 2022 •

edited

Loading

Relax jsonschema testing dependency #546

Relax jsonschema testing dependency #546

Conversation

charlesbluca commented May 20, 2022

codecov-commenter commented May 20, 2022 • edited Loading

Codecov Report

Relax `jsonschema` testing dependency #546

Relax `jsonschema` testing dependency #546

codecov-commenter commented May 20, 2022 •

edited

Loading