Skip to content

Commit

Permalink
Datafusion upstream merge (#576)
Browse files Browse the repository at this point in the history
* Add basic predicate-pushdown optimization (#433)

* basic predicate-pushdown support

* remove explict Dispatch class

* use _Frame.fillna

* cleanup comments

* test coverage

* improve test coverage

* add xfail test for dt accessor in predicate and fix test_show.py

* fix some naming issues

* add config and use assert_eq

* add logging events when predicate-pushdown bails

* move bail logic earlier in function

* address easier code review comments

* typo fix

* fix creation_info access bug

* convert any expression to DNF

* csv test coverage

* include IN coverage

* improve test rigor

* address code review

* skip parquet tests when deps are not installed

* fix bug

* add pyarrow dep to cluster workers

* roll back test skipping changes

Co-authored-by: Charles Blackmon-Luca <[email protected]>

* Add workflow to keep datafusion dev branch up to date (#440)

* Update gpuCI `RAPIDS_VER` to `22.06` (#434)

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Bump black to 22.3.0 (#443)

* Check for ucx-py nightlies when updating gpuCI (#441)

* Simplify gpuCI updating workflow

* Add check for cuML nightly version

* Add handling for newer `prompt_toolkit` versions in cmd tests (#447)

* Add handling for newer prompt-toolkit version

* Place compatibility code in _compat

* Fix version for gha-find-replace (#446)

* Update versions of Java dependencies (#445)

* Update versions for java dependencies with cves

* Rerun tests

* Update jackson databind version (#449)

* Update versions for java dependencies with cves

* Rerun tests

* update jackson-databind dependency

* Disable SQL server functionality (#448)

* Disable SQL server functionality

* Update docs/source/server.rst

Co-authored-by: Ayush Dattagupta <[email protected]>

* Disable server at lowest possible level

* Skip all server tests

* Add tests to ensure server is disabled

* Fix CVE fix test

Co-authored-by: Ayush Dattagupta <[email protected]>

* Update dask pinnings for release (#450)

* Add Java source code to source distribution (#451)

* Bump `httpclient` dependency (#453)

* Revert "Disable SQL server functionality (#448)"

This reverts commit 37a3a61.

* Bump httpclient version

* Unpin Dask/distributed versions (#452)

* Unpin dask/distributed post release

* Remove dask/distributed version ceiling

* Add jsonschema to ci testing (#454)

* Add jsonschema to ci env

* Fix typo in config schema

* Switch tests from `pd.testing.assert_frame_equal` to `dd.assert_eq` (#365)

* Start moving tests to dd.assert_eq

* Use assert_eq in datetime filter test

* Resolve most resulting test failures

* Resolve remaining test failures

* Convert over tests

* Convert more tests

* Consolidate select limit cpu/gpu test

* Remove remaining assert_series_equal

* Remove explicit cudf imports from many tests

* Resolve rex test failures

* Remove some additional compute calls

* Consolidate sorting tests with getfixturevalue

* Fix failed join test

* Remove breakpoint

* Use custom assert_eq function for tests

* Resolve test failures / seg faults

* Remove unnecessary testing utils

* Resolve local test failures

* Generalize RAND test

* Avoid closing client if using independent cluster

* Fix failures on Windows

* Resolve black failures

* Make random test variables more clear

* Set max pin on antlr4-python-runtime  (#456)

* Set max pin on antlr4-python-runtime due to incompatibilities with fugue_sql

* update comment on antlr max pin version

* Move / minimize number of cudf / dask-cudf imports (#480)

* Move / minimize number of cudf / dask-cudf imports

* Add tests for GPU-related errors

* Fix unbound local error

* Fix ddf value error

* Use `map_partitions` to compute LIMIT / OFFSET (#517)

* Use map_partitions to compute limit / offset

* Use partition_info to extract partition_index

* Use `dev` images for independent cluster testing (#518)

* Switch to dask dev images

* Use mamba for conda installs in images

* Remove sleep call for installation

* Use timeout / until to wait for cluster to be initialized

* Add documentation for FugueSQL integrations (#523)

* Add documentation for FugueSQL integrations

* Minor nitpick around autodoc obj -> class

* Timestampdiff support (#495)

* added timestampdiff

* initial work for timestampdiff

* Added test cases for timestampdiff

* Update interval month dtype mapping

* Add datetimesubOperator

* Uncomment timestampdiff literal tests

* Update logic for handling interval_months for pandas/cudf series and scalars

* Add negative diff testcases, and gpu tests

* Update reinterpret and timedelta to explicitly cast to int64 instead of int

* Simplify cast_column_to_type mapping logic

* Add scalar handling to castOperation and reuse it for reinterpret

Co-authored-by: rajagurnath <[email protected]>

* Relax jsonschema testing dependency (#546)

* Update upstream testing workflows (#536)

* Use dask nightly conda packages for upstream testing

* Add independent cluster testing to nightly upstream CI [test-upstream]

* Remove unnecessary dask install [test-upstream]

* Remove strict channel policy to allow nightly dask installs

* Use nightly Dask packages in independent cluster test [test-upstream]

* Use channels argument to install Dask conda nightlies [test-upstream]

* Fix channel expression

* [test-upstream]

* Need to add mamba update command to get dask conda nightlies

* Use conda nightlies for dask-sql import test

* Add import test to upstream nightly tests

* [test-upstream]

* Make sure we have nightly Dask for import tests [test-upstream]

* Fix pyarrow / cloudpickle failures in cluster testing (#553)

* Explicitly install libstdcxx-ng in clusters

* Make pyarrow dependency consistent across testing

* Make libstdcxx-ng dep a min version

* Add cloudpickle to cluster dependencies

* cloudpickle must be in the scheduler environment

* Bump cloudpickle version

* Move cloudpickle install to workers

* Fix pyarrow constraint in cluster spec

* Use bash -l as default entrypoint for all jobs (#552)

* Constrain dask/distributed for release (#563)

* Unpin dask/distributed for development (#564)

* Unpin dask/distributed post release

* Remove dask/distributed version ceiling

* update dask-sphinx-theme (#567)

* Introduce subquery.py to handle subquery expressions

* update ordering

* Make sure scheduler has Dask nightlies in upstream cluster testing (#573)

* Make sure scheduler has Dask nightlies in upstream cluster testing

* empty commit to [test-upstream]

* Update gpuCI `RAPIDS_VER` to `22.08` (#565)

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* updates

* Remove startswith function merged by mistake

* [REVIEW] - Remove instance that are meant for the currently removed timestampdiff

* Modify test environment pinnings to cover minimum versions (#555)

* Remove black/isort deps as we prefer pre-commit

* Unpin all non python/jdk dependencies

* Minor package corrections for py3.9 jdk11 env

* Set min version constraints for all non-testing dependencies

* Pin all non-test deps for 3.8 testing

* Bump sklearn min version to 1.0.0

* Bump pyarrow min version to 1.0.1

* Fix pip notation for fugue

* Use unpinned deps for cluster testing for now

* Add fugue deps to environments, bump pandas to 1.0.2

* Add back antlr4 version ceiling

* Explicitly mark all fugue dependencies

* Alter test_analyze to avoid rtol

* Bump pandas to 1.0.5 to fix upstream numpy issues

* Alter datetime casting util to dodge panda casting failures

* Bump pandas to 1.1.0 for groupby dropna support

* Simplify string dtype check for get_supported_aggregations

* Add check_dtype=False back to test_group_by_nan

* Bump cluster to python 3.9

* Bump fastapi to 0.69.0, resolve remaining JDBC failures

* Typo - correct pandas version

* Generalize test_multi_case_when's dtype check

* Bump pandas to 1.1.1 to resolve flaky test failures

* Constrain mlflow for windows python 3.8 testing

* Selectors don't work for conda env files

* Problems seem to persist in 1.1.1, bump to 1.1.2

* Remove accidental debug changes

* [test-upstream]

* Use python 3.9 for upstream cluster testing [test-upstream]

* Updated missed pandas pinning

* Unconstrain mlflow to see if Windows failures persist

* Add min version for protobuf

* Bump pyarrow min version to allow for newer protobuf versions

* Don't move jar to local mvn repo (#579)

* Add tests for intersection

* Add tests for intersection

* Add another intersection test, even more simple but for testing raw intersection

* Use Timedelta when doing ReduceOperation(s) against datetime64 dtypes

* Cleanup

* Use an either/or strategy for converting to Timedelta objects

* Support more than 2 operands for Timedelta conversions

* fix merge issues, is_frame() function of call.py was removed accidentally before

* Remove pytest that was testing Calcite exception messages. Calcite is no longer used so no need for this test

* comment out gpu tests, will be enabled in datafusion-filter PR

* Don't check dtype for failing test

Co-authored-by: Richard (Rick) Zamora <[email protected]>
Co-authored-by: Charles Blackmon-Luca <[email protected]>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Ayush Dattagupta <[email protected]>
Co-authored-by: rajagurnath <[email protected]>
Co-authored-by: Sarah Charlotte Johnson <[email protected]>
Co-authored-by: ksonj <[email protected]>
  • Loading branch information
8 people authored Jun 17, 2022
1 parent 230d726 commit 453249e
Show file tree
Hide file tree
Showing 42 changed files with 571 additions and 333 deletions.
21 changes: 21 additions & 0 deletions .github/cluster-upstream.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
# Docker-compose setup used during tests
version: '3'
services:
dask-scheduler:
container_name: dask-scheduler
image: daskdev/dask:dev-py3.9
command: dask-scheduler
environment:
USE_MAMBA: "true"
EXTRA_CONDA_PACKAGES: "dask/label/dev::dask cloudpickle>=2.1.0"
ports:
- "8786:8786"
dask-worker:
container_name: dask-worker
image: daskdev/dask:dev-py3.9
command: dask-worker dask-scheduler:8786
environment:
USE_MAMBA: "true"
EXTRA_CONDA_PACKAGES: "dask/label/dev::dask cloudpickle>=2.1.0 pyarrow>=3.0.0 libstdcxx-ng>=12.1.0"
volumes:
- /tmp:/tmp
6 changes: 3 additions & 3 deletions .github/docker-compose.yaml → .github/cluster.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@ version: '3'
services:
dask-scheduler:
container_name: dask-scheduler
image: daskdev/dask:dev
image: daskdev/dask:dev-py3.9
command: dask-scheduler
environment:
USE_MAMBA: "true"
Expand All @@ -12,10 +12,10 @@ services:
- "8786:8786"
dask-worker:
container_name: dask-worker
image: daskdev/dask:dev
image: daskdev/dask:dev-py3.9
command: dask-worker dask-scheduler:8786
environment:
USE_MAMBA: "true"
EXTRA_CONDA_PACKAGES: "pyarrow>=4.0.0" # required for parquet IO
EXTRA_CONDA_PACKAGES: "cloudpickle>=2.1.0 pyarrow>=3.0.0 libstdcxx-ng>=12.1.0"
volumes:
- /tmp:/tmp
30 changes: 30 additions & 0 deletions .github/workflows/datafusion-sync.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
name: Keep datafusion branch up to date
on:
push:
branches:
- main

# When this workflow is queued, automatically cancel any previous running
# or pending jobs
concurrency:
group: datafusion-sync
cancel-in-progress: true

jobs:
sync-branches:
runs-on: ubuntu-latest
if: github.repository == 'dask-contrib/dask-sql'
steps:
- name: Checkout
uses: actions/checkout@v2
- name: Set up Node
uses: actions/setup-node@v2
with:
node-version: 12
- name: Opening pull request
id: pull
uses: tretuna/[email protected]
with:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
FROM_BRANCH: main
TO_BRANCH: datafusion-sql-planner
107 changes: 100 additions & 7 deletions .github/workflows/test-upstream.yml
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,11 @@ on:
- cron: "0 0 * * *" # Daily “At 00:00” UTC
workflow_dispatch: # allows you to trigger the workflow run manually

# Required shell entrypoint to have properly activated conda environments
defaults:
run:
shell: bash -l {0}

jobs:
test-dev:
name: "Test upstream dev (${{ matrix.os }}, python: ${{ matrix.python }})"
Expand All @@ -29,6 +34,7 @@ jobs:
use-mamba: true
python-version: ${{ matrix.python }}
channel-priority: strict
channels: dask/label/dev,conda-forge,nodefaults
activate-environment: dask-sql
environment-file: ${{ env.CONDA_FILE }}
- name: Install hive testing dependencies for Linux
Expand All @@ -39,23 +45,110 @@ jobs:
docker pull bde2020/hive-metastore-postgresql:2.3.0
- name: Install upstream dev Dask / dask-ml
run: |
python -m pip install --no-deps git+https://github.com/dask/dask
python -m pip install --no-deps git+https://github.com/dask/distributed
mamba update dask
python -m pip install --no-deps git+https://github.com/dask/dask-ml
- name: Test with pytest
run: |
pytest --junitxml=junit/test-results.xml --cov-report=xml -n auto tests --dist loadfile
cluster-dev:
name: "Test upstream dev in a dask cluster"
needs: build
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Cache local Maven repository
uses: actions/cache@v2
with:
path: ~/.m2/repository
key: ${{ runner.os }}-maven-v1-jdk11-${{ hashFiles('**/pom.xml') }}
- name: Set up Python
uses: conda-incubator/setup-miniconda@v2
with:
miniforge-variant: Mambaforge
use-mamba: true
python-version: "3.9"
channel-priority: strict
channels: dask/label/dev,conda-forge,nodefaults
activate-environment: dask-sql
environment-file: continuous_integration/environment-3.9-jdk11-dev.yaml
- name: Download the pre-build jar
uses: actions/download-artifact@v1
with:
name: jar
path: dask_sql/jar/
- name: Install cluster dependencies
run: |
mamba install python-blosc lz4 -c conda-forge
which python
pip list
mamba list
- name: Install upstream dev dask-ml
run: |
mamba update dask
python -m pip install --no-deps git+https://github.com/dask/dask-ml
- name: run a dask cluster
run: |
docker-compose -f .github/cluster-upstream.yml up -d
# periodically ping logs until a connection has been established; assume failure after 2 minutes
timeout 2m bash -c 'until docker logs dask-worker 2>&1 | grep -q "Starting established connection"; do sleep 1; done'
docker logs dask-scheduler
docker logs dask-worker
- name: Test with pytest while running an independent dask cluster
run: |
DASK_SQL_TEST_SCHEDULER="tcp://127.0.0.1:8786" pytest --junitxml=junit/test-cluster-results.xml --cov-report=xml -n auto tests --dist loadfile
import-dev:
name: "Test importing with bare requirements and upstream dev"
needs: build
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v2
- name: Cache local Maven repository
uses: actions/cache@v2
with:
path: ~/.m2/repository
key: ${{ runner.os }}-maven-v1-jdk11-${{ hashFiles('**/pom.xml') }}
- name: Set up Python
uses: conda-incubator/setup-miniconda@v2
with:
python-version: "3.8"
mamba-version: "*"
channels: dask/label/dev,conda-forge,nodefaults
channel-priority: strict
- name: Download the pre-build jar
uses: actions/download-artifact@v1
with:
name: jar
path: dask_sql/jar/
- name: Install upstream dev Dask / dask-ml
if: needs.detect-ci-trigger.outputs.triggered == 'true'
run: |
mamba update dask
python -m pip install --no-deps git+https://github.com/dask/dask-ml
- name: Install dependencies and nothing else
run: |
pip install -e .
which python
pip list
mamba list
- name: Try to import dask-sql
run: |
python -c "import dask_sql; print('ok')"
report-failures:
name: Open issue for upstream dev failures
needs: test-dev
needs: [test-dev, cluster-dev]
if: |
always()
&& needs.test-dev.result == 'failure'
&& (
needs.test-dev.result == 'failure' || needs.cluster-dev.result == 'failure'
)
runs-on: ubuntu-latest
defaults:
run:
shell: bash
steps:
- uses: actions/checkout@v2
- name: Report failures
Expand Down
34 changes: 17 additions & 17 deletions .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -57,6 +57,7 @@ jobs:
use-mamba: true
python-version: ${{ matrix.python }}
channel-priority: strict
channels: ${{ needs.detect-ci-trigger.outputs.triggered == 'true' && 'dask/label/dev,conda-forge,nodefaults' || 'conda-forge,nodefaults' }}
activate-environment: dask-sql
environment-file: ${{ env.CONDA_FILE }}
- name: Setup Rust Toolchain
Expand All @@ -77,8 +78,7 @@ jobs:
- name: Optionally install upstream dev Dask / dask-ml
if: needs.detect-ci-trigger.outputs.triggered == 'true'
run: |
python -m pip install --no-deps git+https://github.com/dask/dask
python -m pip install --no-deps git+https://github.com/dask/distributed
mamba update dask
python -m pip install --no-deps git+https://github.com/dask/dask-ml
- name: Test with pytest
run: |
Expand Down Expand Up @@ -107,10 +107,11 @@ jobs:
with:
miniforge-variant: Mambaforge
use-mamba: true
python-version: "3.8"
python-version: "3.9"
channel-priority: strict
channels: ${{ needs.detect-ci-trigger.outputs.triggered == 'true' && 'dask/label/dev,conda-forge,nodefaults' || 'conda-forge,nodefaults' }}
activate-environment: dask-sql
environment-file: continuous_integration/environment-3.8-dev.yaml
environment-file: continuous_integration/environment-3.9-dev.yaml
- name: Setup Rust Toolchain
uses: actions-rs/toolchain@v1
id: rust-toolchain
Expand All @@ -127,18 +128,23 @@ jobs:
which python
pip list
mamba list
- name: Optionally install upstream dev Dask / dask-ml
- name: Optionally install upstream dev dask-ml
if: needs.detect-ci-trigger.outputs.triggered == 'true'
run: |
python -m pip install --no-deps git+https://github.com/dask/dask
python -m pip install --no-deps git+https://github.com/dask/distributed
mamba update dask
python -m pip install --no-deps git+https://github.com/dask/dask-ml
- name: run a dask cluster
env:
UPSTREAM: ${{ needs.detect-ci-trigger.outputs.triggered }}
run: |
docker-compose -f .github/docker-compose.yaml up -d
if [[ $UPSTREAM == "true" ]]; then
docker-compose -f .github/cluster-upstream.yml up -d
else
docker-compose -f .github/cluster.yml up -d
fi
# Wait for installation
sleep 40
# periodically ping logs until a connection has been established; assume failure after 2 minutes
timeout 2m bash -c 'until docker logs dask-worker 2>&1 | grep -q "Starting established connection"; do sleep 1; done'
docker logs dask-scheduler
docker logs dask-worker
Expand All @@ -157,7 +163,7 @@ jobs:
with:
python-version: "3.8"
mamba-version: "*"
channels: conda-forge,defaults
channels: ${{ needs.detect-ci-trigger.outputs.triggered == 'true' && 'dask/label/dev,conda-forge,nodefaults' || 'conda-forge,nodefaults' }}
channel-priority: strict
- name: Install dependencies and nothing else
run: |
Expand All @@ -167,12 +173,6 @@ jobs:
which python
pip list
mamba list
- name: Optionally install upstream dev Dask / dask-ml
if: needs.detect-ci-trigger.outputs.triggered == 'true'
run: |
python -m pip install --no-deps git+https://github.com/dask/dask
python -m pip install --no-deps git+https://github.com/dask/distributed
python -m pip install --no-deps git+https://github.com/dask/dask-ml
- name: Try to import dask-sql
run: |
python -c "import dask_sql; print('ok')"
57 changes: 29 additions & 28 deletions continuous_integration/environment-3.10-dev.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3,40 +3,41 @@ channels:
- conda-forge
- nodefaults
dependencies:
- adagio>=0.2.3
- antlr4-python3-runtime>=4.9.2, <4.10.0 # Remove max pin after qpd(fugue dependency) updates their conda recipe
- black=22.3.0
- ciso8601>=2.2.0
- dask-ml>=2022.1.22
- dask>=2022.3.0
- fastapi>=0.61.1
- fs>=2.4.11
- fastapi>=0.69.0
- intake>=0.6.0
- isort=5.7.0
- jsonschema>=4.4.0
- lightgbm>=3.2.1
- mlflow>=1.19.0
- mock>=4.0.3
- nest-asyncio>=1.4.3
- pandas>=1.0.0 # below 1.0, there were no nullable ext. types
- pip=20.2.4
- pre-commit>=2.11.1
- prompt_toolkit>=3.0.8
- psycopg2>=2.9.1
- pygments>=2.7.1
- pyhive>=0.6.4
- pytest-cov>=2.10.1
- jsonschema
- lightgbm
- maturin>=0.12.8
- mlflow
- mock
- nest-asyncio
- pandas>=1.1.2
- pre-commit
- prompt_toolkit
- psycopg2
- pyarrow>=3.0.0
- pygments
- pyhive
- pytest-cov
- pytest-xdist
- pytest>=6.0.1
- pytest
- python=3.10
- scikit-learn>=0.24.2
- sphinx>=3.2.1
- tpot>=0.11.7
- triad>=0.5.4
- rust>=1.60.0
- scikit-learn>=1.0.0
- setuptools-rust>=1.1.2
- sphinx
- tpot
- tzlocal>=2.1
- uvicorn>=0.11.3
- maturin>=0.12.8
- setuptools-rust>=1.1.2
- rust>=1.60.0
# fugue dependencies; remove when we conda install fugue
- adagio
- antlr4-python3-runtime<4.10
- ciso8601
- fs
- pip
- qpd
- triad
- pip:
- fugue[sql]>=0.5.3
Loading

0 comments on commit 453249e

Please sign in to comment.