Skip to content

Commit

Permalink
Datafusion aggregate (#471)
Browse files Browse the repository at this point in the history
* Add basic predicate-pushdown optimization (#433)

* basic predicate-pushdown support

* remove explict Dispatch class

* use _Frame.fillna

* cleanup comments

* test coverage

* improve test coverage

* add xfail test for dt accessor in predicate and fix test_show.py

* fix some naming issues

* add config and use assert_eq

* add logging events when predicate-pushdown bails

* move bail logic earlier in function

* address easier code review comments

* typo fix

* fix creation_info access bug

* convert any expression to DNF

* csv test coverage

* include IN coverage

* improve test rigor

* address code review

* skip parquet tests when deps are not installed

* fix bug

* add pyarrow dep to cluster workers

* roll back test skipping changes

Co-authored-by: Charles Blackmon-Luca <[email protected]>

* Add workflow to keep datafusion dev branch up to date (#440)

* Condition for BinaryExpr, filter, input_ref, rexcall, and rexliteral

* Updates for test_filter

* more of test_filter.py working with the exception of some date pytests

* Updates to dates and parsing dates like postgresql does

* Update gpuCI `RAPIDS_VER` to `22.06` (#434)

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Bump black to 22.3.0 (#443)

* Check for ucx-py nightlies when updating gpuCI (#441)

* Simplify gpuCI updating workflow

* Add check for cuML nightly version

* Refactored to adjust for better type management

* Refactor schema and statements

* update types

* fix syntax issues and renamed function name calls

* Add handling for newer `prompt_toolkit` versions in cmd tests (#447)

* Add handling for newer prompt-toolkit version

* Place compatibility code in _compat

* Fix version for gha-find-replace (#446)

* Improved error handling and code clean up

* move pieces of logical.rs to seperated files to ensure code readability

* left join working

* Update versions of Java dependencies (#445)

* Update versions for java dependencies with cves

* Rerun tests

* Update jackson databind version (#449)

* Update versions for java dependencies with cves

* Rerun tests

* update jackson-databind dependency

* Disable SQL server functionality (#448)

* Disable SQL server functionality

* Update docs/source/server.rst

Co-authored-by: Ayush Dattagupta <[email protected]>

* Disable server at lowest possible level

* Skip all server tests

* Add tests to ensure server is disabled

* Fix CVE fix test

Co-authored-by: Ayush Dattagupta <[email protected]>

* Update dask pinnings for release (#450)

* Add Java source code to source distribution (#451)

* Bump `httpclient` dependency (#453)

* Revert "Disable SQL server functionality (#448)"

This reverts commit 37a3a61.

* Bump httpclient version

* Unpin Dask/distributed versions (#452)

* Unpin dask/distributed post release

* Remove dask/distributed version ceiling

* Add jsonschema to ci testing (#454)

* Add jsonschema to ci env

* Fix typo in config schema

* Switch tests from `pd.testing.assert_frame_equal` to `dd.assert_eq` (#365)

* Start moving tests to dd.assert_eq

* Use assert_eq in datetime filter test

* Resolve most resulting test failures

* Resolve remaining test failures

* Convert over tests

* Convert more tests

* Consolidate select limit cpu/gpu test

* Remove remaining assert_series_equal

* Remove explicit cudf imports from many tests

* Resolve rex test failures

* Remove some additional compute calls

* Consolidate sorting tests with getfixturevalue

* Fix failed join test

* Remove breakpoint

* Use custom assert_eq function for tests

* Resolve test failures / seg faults

* Remove unnecessary testing utils

* Resolve local test failures

* Generalize RAND test

* Avoid closing client if using independent cluster

* Fix failures on Windows

* Resolve black failures

* Make random test variables more clear

* First basic working checkpoint for group by

* Set max pin on antlr4-python-runtime  (#456)

* Set max pin on antlr4-python-runtime due to incompatibilities with fugue_sql

* update comment on antlr max pin version

* Updates to style

* stage pre-commit changes for upstream merge

* Fix black failures

* Updates to Rust formatting

* Fix rust lint and clippy

* Remove jar building step which is no longer needed

* Remove Java from github workflows matrix

* Removes jar and Java references from test.yml

* Update Release workflow to remove references to Java

* Update rust.yml to remove references from linux-build-lib

* Add pre-commit.sh file to provide pre-commit support for Rust in a convenient script

* Removed overlooked jdk references

* cargo clippy auto fixes

* Address all Rust clippy warnings

* Include setuptools-rust in conda build recipie

* Include setuptools-rust in conda build recipie, in host and run

* Adjustments for conda build, committing for others to help with error and see it occurring in CI

* Include sql.yaml in package files

* Include pyarrow in run section of conda build to ensure tests pass

* include setuptools-rust in host and run of conda since removing caused errors

* to_string() method had been removed in rust and not removed here, caused conda run_test.py to fail when this line was hit

* Replace commented out tests with pytest.skip and bump version of pyarrow to 7.0.0

* Fix setup.py syntax issue introduced on last commit by find/replace

* Rename Datafusion -> DataFusion and Apache DataFusion -> Arrow DataFusion

* Fix docs build environment

* Include Rust compiler in docs environment

* Bump Rust compiler version to 1.59

* Ok, well readthedocs didn't like that

* Store libdask_planner.so and retrieve it between github workflows

* Cache the Rust library binary

* Remove Cargo.lock from git

* Remove unused datafusion-expr crate

* Build datafusion at each test step instead of caching binaries

* Remove maven and jar cache steps from test-upstream.yaml

* Removed dangling 'build' workflow step reference

* Lowered PyArrow version to 6.0.1 since cudf has a hard requirement on that version for the version of cudf we are using

* Add Rust build step to test in dask cluster

* Install setuptools-rust for pip to use for bare requirements import

* Include pyarrow 6.0.1 via conda as a bare minimum dependency

* Remove cudf dependency for python 3.9 which is causing build issues on windows

* Address documentation from review

* Install Rust as readthedocs post_create_environment step

* Run rust install non-interactively

* Run rust install non-interactively

* Rust isn't available in PyPi so remove that dependency

* Append ~/.cargo/bin to the PATH

* Print out some environment information for debugging

* Print out some environment information for debugging

* More - Increase verbosity

* More - Increase verbosity

* More - Increase verbosity

* Switch RTD over to use Conda instead of Pip since having issues with Rust and pip

* Try to use mamba for building docs environment

* Partial review suggestion address, checking CI still works

* Skip mistakenly enabled tests

* Use DataFusion master branch, and fix syntax issues related to the version bump

* More updates after bumping DataFusion version to master

* Use actions-rs in github workflows debug flag for setup.py

* Remove setuptools-rust from conda

* Use re-exported Rust types for BuiltinScalarFunction

* Move python imports to TYPE_CHECKING section where applicable

* Address review concerns and remove pre-commit.sh file

* Pin to a specific github rev for DataFusion

Co-authored-by: Richard (Rick) Zamora <[email protected]>
Co-authored-by: Charles Blackmon-Luca <[email protected]>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Ayush Dattagupta <[email protected]>
  • Loading branch information
5 people authored Apr 21, 2022
1 parent 55bf4c2 commit afeee32
Show file tree
Hide file tree
Showing 103 changed files with 3,703 additions and 4,882 deletions.
2 changes: 2 additions & 0 deletions .github/docker-compose.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -11,5 +11,7 @@ services:
container_name: dask-worker
image: daskdev/dask:latest
command: dask-worker dask-scheduler:8786
environment:
EXTRA_CONDA_PACKAGES: "pyarrow>=4.0.0" # required for parquet IO
volumes:
- /tmp:/tmp
7 changes: 1 addition & 6 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,11 +16,6 @@ jobs:
if: github.repository == 'dask-contrib/dask-sql'
steps:
- uses: actions/checkout@v2
- name: Cache local Maven repository
uses: actions/cache@v2
with:
path: ~/.m2/repository
key: ${{ runner.os }}-maven-v1-jdk11-${{ hashFiles('**/pom.xml') }}
- name: Set up Python
uses: conda-incubator/setup-miniconda@v2
with:
Expand All @@ -29,7 +24,7 @@ jobs:
python-version: "3.8"
channel-priority: strict
activate-environment: dask-sql
environment-file: continuous_integration/environment-3.8-jdk11-dev.yaml
environment-file: continuous_integration/environment-3.8-dev.yaml
- name: Install dependencies
run: |
pip install setuptools wheel twine
Expand Down
54 changes: 2 additions & 52 deletions .github/workflows/test-upstream.yml
Original file line number Diff line number Diff line change
Expand Up @@ -5,64 +5,23 @@ on:
workflow_dispatch: # allows you to trigger the workflow run manually

jobs:
build:
# This build step should be similar to the deploy build, to make sure we actually test
# the future deployable
name: Build the jar on ubuntu
runs-on: ubuntu-latest
if: github.repository == 'dask-contrib/dask-sql'
defaults:
run:
shell: bash -l {0}
steps:
- uses: actions/checkout@v2
- name: Cache local Maven repository
uses: actions/cache@v2
with:
path: ~/.m2/repository
key: ${{ runner.os }}-maven-v1-jdk11-${{ hashFiles('**/pom.xml') }}
- name: Set up Python
uses: conda-incubator/setup-miniconda@v2
with:
miniforge-variant: Mambaforge
use-mamba: true
python-version: "3.8"
channel-priority: strict
activate-environment: dask-sql
environment-file: continuous_integration/environment-3.8-jdk11-dev.yaml
- name: Install dependencies and build the jar
run: |
python setup.py build_ext
- name: Upload the jar
uses: actions/upload-artifact@v1
with:
name: jar
path: dask_sql/jar/DaskSQL.jar

test-dev:
name: "Test upstream dev (${{ matrix.os }}, java: ${{ matrix.java }}, python: ${{ matrix.python }})"
needs: build
name: "Test upstream dev (${{ matrix.os }}, python: ${{ matrix.python }})"
runs-on: ${{ matrix.os }}
env:
CONDA_FILE: continuous_integration/environment-${{ matrix.python }}-jdk${{ matrix.java }}-dev.yaml
CONDA_FILE: continuous_integration/environment-${{ matrix.python }}-dev.yaml
defaults:
run:
shell: bash -l {0}
strategy:
fail-fast: false
matrix:
java: [8, 11]
os: [ubuntu-latest, windows-latest]
python: ["3.8", "3.9", "3.10"]
steps:
- uses: actions/checkout@v2
with:
fetch-depth: 0 # Fetch all history for all branches and tags.
- name: Cache local Maven repository
uses: actions/cache@v2
with:
path: ~/.m2/repository
key: ${{ runner.os }}-maven-v1-jdk${{ matrix.java }}-${{ hashFiles('**/pom.xml') }}
- name: Set up Python
uses: conda-incubator/setup-miniconda@v2
with:
Expand All @@ -72,21 +31,12 @@ jobs:
channel-priority: strict
activate-environment: dask-sql
environment-file: ${{ env.CONDA_FILE }}
- name: Download the pre-build jar
uses: actions/download-artifact@v1
with:
name: jar
path: dask_sql/jar/
- name: Install hive testing dependencies for Linux
if: matrix.os == 'ubuntu-latest'
run: |
mamba install -c conda-forge sasl>=0.3.1
docker pull bde2020/hive:2.3.2-postgresql-metastore
docker pull bde2020/hive-metastore-postgresql:2.3.0
- name: Set proper JAVA_HOME for Windows
if: matrix.os == 'windows-latest'
run: |
echo "JAVA_HOME=${{ env.CONDA }}\envs\dask-sql\Library" >> $GITHUB_ENV
- name: Install upstream dev Dask / dask-ml
run: |
python -m pip install --no-deps git+https://github.com/dask/dask
Expand Down
100 changes: 28 additions & 72 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
---
# Test the main branch and every pull request by
# 1. building the jar on ubuntu
# 2. testing code (using the build jar) on ubuntu and windows, with different java versions
# 1. build dask_planner (Arrow DataFusion Rust bindings) on ubuntu
# 2. testing code (using the build DataFusion bindings) on ubuntu and windows
name: Test Python package
on:
push:
Expand Down Expand Up @@ -36,55 +36,20 @@ jobs:
with:
keyword: "[test-upstream]"

build:
# This build step should be similar to the deploy build, to make sure we actually test
# the future deployable
name: Build the jar on ubuntu
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Cache local Maven repository
uses: actions/cache@v2
with:
path: ~/.m2/repository
key: ${{ runner.os }}-maven-v1-jdk11-${{ hashFiles('**/pom.xml') }}
- name: Set up Python
uses: conda-incubator/setup-miniconda@v2
with:
miniforge-variant: Mambaforge
use-mamba: true
python-version: "3.8"
channel-priority: strict
activate-environment: dask-sql
environment-file: continuous_integration/environment-3.8-jdk11-dev.yaml
- name: Build the jar
run: |
python setup.py build_ext
- name: Upload the jar
uses: actions/upload-artifact@v1
with:
name: jar
path: dask_sql/jar/DaskSQL.jar

test:
name: "Test (${{ matrix.os }}, java: ${{ matrix.java }}, python: ${{ matrix.python }})"
needs: [detect-ci-trigger, build]
name: "Build & Test (${{ matrix.os }}, python: ${{ matrix.python }}, Rust: ${{ matrix.toolchain }})"
needs: [detect-ci-trigger]
runs-on: ${{ matrix.os }}
env:
CONDA_FILE: continuous_integration/environment-${{ matrix.python }}-jdk${{ matrix.java }}-dev.yaml
CONDA_FILE: continuous_integration/environment-${{ matrix.python }}-dev.yaml
strategy:
fail-fast: false
matrix:
java: [8, 11]
os: [ubuntu-latest, windows-latest]
python: ["3.8", "3.9", "3.10"]
toolchain: [stable]
steps:
- uses: actions/checkout@v2
- name: Cache local Maven repository
uses: actions/cache@v2
with:
path: ~/.m2/repository
key: ${{ runner.os }}-maven-v1-jdk${{ matrix.java }}-${{ hashFiles('**/pom.xml') }}
- name: Set up Python
uses: conda-incubator/setup-miniconda@v2
with:
Expand All @@ -94,21 +59,21 @@ jobs:
channel-priority: strict
activate-environment: dask-sql
environment-file: ${{ env.CONDA_FILE }}
- name: Download the pre-build jar
uses: actions/download-artifact@v1
- name: Setup Rust Toolchain
uses: actions-rs/toolchain@v1
id: rust-toolchain
with:
name: jar
path: dask_sql/jar/
toolchain: stable
override: true
- name: Build the Rust DataFusion bindings
run: |
python setup.py build install
- name: Install hive testing dependencies for Linux
if: matrix.os == 'ubuntu-latest'
run: |
mamba install -c conda-forge sasl>=0.3.1
docker pull bde2020/hive:2.3.2-postgresql-metastore
docker pull bde2020/hive-metastore-postgresql:2.3.0
- name: Set proper JAVA_HOME for Windows
if: matrix.os == 'windows-latest'
run: |
echo "JAVA_HOME=${{ env.CONDA }}\envs\dask-sql\Library" >> $GITHUB_ENV
- name: Optionally install upstream dev Dask / dask-ml
if: needs.detect-ci-trigger.outputs.triggered == 'true'
run: |
Expand All @@ -133,15 +98,10 @@ jobs:

cluster:
name: "Test in a dask cluster"
needs: [detect-ci-trigger, build]
needs: [detect-ci-trigger]
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Cache local Maven repository
uses: actions/cache@v2
with:
path: ~/.m2/repository
key: ${{ runner.os }}-maven-v1-jdk11-${{ hashFiles('**/pom.xml') }}
- name: Set up Python
uses: conda-incubator/setup-miniconda@v2
with:
Expand All @@ -150,12 +110,16 @@ jobs:
python-version: "3.8"
channel-priority: strict
activate-environment: dask-sql
environment-file: continuous_integration/environment-3.8-jdk11-dev.yaml
- name: Download the pre-build jar
uses: actions/download-artifact@v1
with:
name: jar
path: dask_sql/jar/
environment-file: continuous_integration/environment-3.8-dev.yaml
- name: Setup Rust Toolchain
uses: actions-rs/toolchain@v1
id: rust-toolchain
with:
toolchain: stable
override: true
- name: Build the Rust DataFusion bindings
run: |
python setup.py build install
- name: Install dependencies
run: |
mamba install python-blosc lz4 -c conda-forge
Expand Down Expand Up @@ -184,29 +148,21 @@ jobs:
import:
name: "Test importing with bare requirements"
needs: [detect-ci-trigger, build]
needs: [detect-ci-trigger]
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Cache local Maven repository
uses: actions/cache@v2
with:
path: ~/.m2/repository
key: ${{ runner.os }}-maven-v1-jdk11-${{ hashFiles('**/pom.xml') }}
- name: Set up Python
uses: conda-incubator/setup-miniconda@v2
with:
python-version: "3.8"
mamba-version: "*"
channels: conda-forge,defaults
channel-priority: strict
- name: Download the pre-build jar
uses: actions/download-artifact@v1
with:
name: jar
path: dask_sql/jar/
- name: Install dependencies and nothing else
run: |
conda install setuptools-rust
conda install pyarrow>=4.0.0
pip install -e .
which python
Expand Down
59 changes: 45 additions & 14 deletions .github/workflows/update-gpuci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -13,39 +13,70 @@ jobs:
steps:
- uses: actions/checkout@v2

- name: Parse current axis YAML
uses: the-coding-turtle/[email protected]
with:
file: continuous_integration/gpuci/axis.yaml

- name: Get latest cuDF nightly version
id: latest_version
id: cudf_latest
uses: jacobtomlinson/[email protected]
with:
org: "rapidsai-nightly"
package: "cudf"
version_system: "CalVer"

- name: Strip git tags from versions
- name: Get latest cuML nightly version
id: cuml_latest
uses: jacobtomlinson/[email protected]
with:
org: "rapidsai-nightly"
package: "cuml"
version_system: "CalVer"

- name: Get latest UCX-Py nightly version
id: ucx_py_latest
uses: jacobtomlinson/[email protected]
with:
org: "rapidsai-nightly"
package: "ucx-py"
version_system: "CalVer"

- name: Get old RAPIDS / UCX-Py versions
env:
FULL_RAPIDS_VER: ${{ steps.latest_version.outputs.version }}
run: echo "RAPIDS_VER=${FULL_RAPIDS_VER::-10}" >> $GITHUB_ENV
FULL_CUDF_VER: ${{ steps.cudf_latest.outputs.version }}
FULL_CUML_VER: ${{ steps.cuml_latest.outputs.version }}
FULL_UCX_PY_VER: ${{ steps.ucx_py_latest.outputs.version }}
run: |
echo RAPIDS_VER=$RAPIDS_VER_0 >> $GITHUB_ENV
echo UCX_PY_VER=$(curl -sL https://version.gpuci.io/rapids/$RAPIDS_VER_0) >> $GITHUB_ENV
echo NEW_CUDF_VER=${FULL_CUDF_VER::-10} >> $GITHUB_ENV
echo NEW_CUML_VER=${FULL_CUML_VER::-10} >> $GITHUB_ENV
echo NEW_UCX_PY_VER=${FULL_UCX_PY_VER::-10} >> $GITHUB_ENV
- name: Find and Replace Release
uses: jacobtomlinson/gha-find-replace@0.1.4
- name: Update RAPIDS version
uses: jacobtomlinson/gha-find-replace@v2
with:
include: 'continuous_integration\/gpuci\/axis\.yaml'
find: "RAPIDS_VER:\n- .*"
replace: |-
RAPIDS_VER:
- "${{ env.RAPIDS_VER }}"
find: "${{ env.RAPIDS_VER }}"
replace: "${{ env.NEW_CUDF_VER }}"
regex: false

- name: Create Pull Request
uses: peter-evans/create-pull-request@v3
# make sure ucx-py nightlies are available and that cuDF/cuML nightly versions match up
if: |
env.UCX_PY_VER != env.NEW_UCX_PY_VER &&
env.NEW_CUDF_VER == env.NEW_CUML_VER
with:
token: ${{ secrets.GITHUB_TOKEN }}
draft: true
commit-message: "Update gpuCI `RAPIDS_VER` to `${{ env.RAPIDS_VER }}`"
title: "Update gpuCI `RAPIDS_VER` to `${{ env.RAPIDS_VER }}`"
commit-message: "Update gpuCI `RAPIDS_VER` to `${{ env.NEW_CUDF_VER }}`"
title: "Update gpuCI `RAPIDS_VER` to `${{ env.NEW_CUDF_VER }}`"
team-reviewers: "dask/gpu"
author: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
branch: "upgrade-gpuci-rapids"
body: |
A new cuDF nightly version has been detected.
New cuDF and ucx-py nightly versions have been detected.
Updated `axis.yaml` to use `${{ env.RAPIDS_VER }}`.
Updated `axis.yaml` to use `${{ env.NEW_CUDF_VER }}`.
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -60,3 +60,4 @@ dask_sql/jar
dask-worker-space/
node_modules/
docs/source/_build/
dask_planner/Cargo.lock
Loading

0 comments on commit afeee32

Please sign in to comment.