Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add ManagedTableDataset for managed Delta Lake tables in Databricks #127

Closed
wants to merge 78 commits into from

Conversation

dannyrfar
Copy link
Contributor

@dannyrfar dannyrfar commented Mar 14, 2023

Description

Creating first of few PRs to add functionality for Databricks in Kedro datasets. This PR includes the ManagedTableDataset which will allow users to interface with managed Delta tables in Databricks or locally in PySpark.

Development notes

Changes include a net new dataset, databricks.ManagedTableDataSet, which allows users to interface with managed delta tables inside of Databricks.

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the relevant RELEASE.md file
  • Added tests to cover my changes

@dannyrfar
Copy link
Contributor Author

hey @noklam just saw your comment in the other PR. I did see those two datasets, this will be more focused on Databricks Unity catalog tables. The SparkDataSet and DeltaTableDataSets are for interfacing with files directly. Both can be used on databricks but are intended for different purposes.

@dannyrfar dannyrfar force-pushed the main branch 3 times, most recently from 7121847 to 9b43324 Compare March 21, 2023 19:11
dannyrfar and others added 26 commits March 21, 2023 15:16
…atasets allows users to interface with Unity catalog tables in Databricks to both read and write.

Signed-off-by: Danny Farah <[email protected]>
…org#99)

* Add non-spark related test changes
Replace kedro.pipeline.Pipeline with
kedro.pipeline.modular_pipeline.pipeline factory.
This is for symmetry with changes made to the main kedro library.

Signed-off-by: Adam Farley <[email protected]>

Signed-off-by: Danny Farah <[email protected]>
* fix links

* fix dill links

Signed-off-by: Danny Farah <[email protected]>
* Fix docs formatting and phrasing for some datasets

Signed-off-by: Deepyaman Datta <[email protected]>

* Manually fix files not resolved with patch command

Signed-off-by: Deepyaman Datta <[email protected]>

* Apply fix from kedro-org#98

Signed-off-by: Deepyaman Datta <[email protected]>

---------

Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Danny Farah <[email protected]>
* bump version and update release notes

* fix pylint errors

Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Merel Theisen <[email protected]>
Signed-off-by: Danny Farah <[email protected]>
* Prefix Docker plugin name with "Kedro-" in usage message

Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Danny Farah <[email protected]>
…o-org#56)

* Keep Kedro-Docker plugin docstring from appearing in `kedro -h`

Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: wmoreiraa <[email protected]>

Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Danny Farah <[email protected]>
* [kedro-docker] Layers size optimization (kedro-org#92)

* [kedro-docker] Layers size optimization

Signed-off-by: Mariusz Strzelecki <[email protected]>

* Adjust test requirements

Signed-off-by: Mariusz Strzelecki <[email protected]>

* Skip coverage check on tests dir (some do not execute on Windows)

Signed-off-by: Mariusz Strzelecki <[email protected]>

* Update .coveragerc with the setup

Signed-off-by: Mariusz Strzelecki <[email protected]>

* Fix bandit so it does not scan kedro-datasets

Signed-off-by: Mariusz Strzelecki <[email protected]>

* Fixed existence test

Signed-off-by: Mariusz Strzelecki <[email protected]>

* Check why dir is not created

Signed-off-by: Mariusz Strzelecki <[email protected]>

* Kedro starters are fixed now

Signed-off-by: Mariusz Strzelecki <[email protected]>

* Increased no-output-timeout for long spark image build

Signed-off-by: Mariusz Strzelecki <[email protected]>

* Spark image optimized

Signed-off-by: Mariusz Strzelecki <[email protected]>

* Linting

Signed-off-by: Mariusz Strzelecki <[email protected]>

* Switch to slim image always

Signed-off-by: Mariusz Strzelecki <[email protected]>

* Trigger build

Signed-off-by: Mariusz Strzelecki <[email protected]>

* Use textwrap.dedent for nicer indentation

Signed-off-by: Mariusz Strzelecki <[email protected]>

* Revert "Use textwrap.dedent for nicer indentation"

This reverts commit 3a1e3f8.

Signed-off-by: Mariusz Strzelecki <[email protected]>

* Revert "Revert "Use textwrap.dedent for nicer indentation""

This reverts commit d322d35.

Signed-off-by: Mariusz Strzelecki <[email protected]>

* Make tests read more lines (to skip all deprecation warnings)

Signed-off-by: Mariusz Strzelecki <[email protected]>

Signed-off-by: Mariusz Strzelecki <[email protected]>
Signed-off-by: Mariusz Strzelecki <[email protected]>
Signed-off-by: Yassine Alouini <[email protected]>

* Release Kedro-Docker 0.3.1 (kedro-org#94)

* Add release notes for kedro-docker 0.3.1

Signed-off-by: Jannic Holzer <[email protected]>

* Update version in kedro_docker module

Signed-off-by: Jannic Holzer <[email protected]>

Signed-off-by: Jannic Holzer <[email protected]>
Signed-off-by: Yassine Alouini <[email protected]>

* Bump version and update release notes (kedro-org#96)

Signed-off-by: Merel Theisen <[email protected]>
Signed-off-by: Yassine Alouini <[email protected]>

* Make the SQLQueryDataSet compatible with mssql.

Signed-off-by: Yassine Alouini <[email protected]>

* Add one test + update RELEASE.md.

Signed-off-by: Yassine Alouini <[email protected]>

* Add missing pyodbc for tests.

Signed-off-by: Yassine Alouini <[email protected]>

* Mock connection as well.

Signed-off-by: Yassine Alouini <[email protected]>

* Add more dates parsing for mssql backend (thanks to [email protected])

Signed-off-by: Yassine Alouini <[email protected]>

* Fix an error in docstring of MetricsDataSet (kedro-org#98)

Signed-off-by: Yassine Alouini <[email protected]>

* Bump relax pyarrow version to work the same way as Pandas (kedro-org#100)

* Bump relax pyarrow version to work the same way as Pandas

We only use PyArrow for `pandas.ParquetDataSet` as such I suggest we keep our versions pinned to the same range as [Pandas does](https://github.com/pandas-dev/pandas/blob/96fc51f5ec678394373e2c779ccff37ddb966e75/pyproject.toml#L100) for the same reason.

As such I suggest we remove the upper bound as we have users requesting later versions in [support channels](https://kedro-org.slack.com/archives/C03RKP2LW64/p1674040509133529)

* Updated release notes

Signed-off-by: Yassine Alouini <[email protected]>

* Add missing type in catalog example.

Signed-off-by: Yassine Alouini <[email protected]>

* Add one more unit tests for adapt_mssql.

Signed-off-by: Yassine Alouini <[email protected]>

* [FIX] Add missing mocker from date test.

Signed-off-by: Yassine Alouini <[email protected]>

* [TEST] Add a wrong input test.

Signed-off-by: Yassine Alouini <[email protected]>

* Add pyodbc dependency.

Signed-off-by: Yassine Alouini <[email protected]>

* [FIX] Remove dict() in tests.

Signed-off-by: Yassine Alouini <[email protected]>

* Change check to check on plugin name (kedro-org#103)

Signed-off-by: Merel Theisen <[email protected]>
Signed-off-by: Yassine Alouini <[email protected]>

* Set coverage in pyproject.toml (kedro-org#105)

Signed-off-by: Merel Theisen <[email protected]>
Signed-off-by: Yassine Alouini <[email protected]>

* Move coverage settings to pyproject.toml (kedro-org#106)

Signed-off-by: Merel Theisen <[email protected]>
Signed-off-by: Yassine Alouini <[email protected]>

* Replace kedro.pipeline with modular_pipeline.pipeline factory (kedro-org#99)

* Add non-spark related test changes
Replace kedro.pipeline.Pipeline with
kedro.pipeline.modular_pipeline.pipeline factory.
This is for symmetry with changes made to the main kedro library.

Signed-off-by: Adam Farley <[email protected]>

Signed-off-by: Yassine Alouini <[email protected]>

* Fix outdated links in Kedro Datasets (kedro-org#111)

* fix links

* fix dill links

Signed-off-by: Yassine Alouini <[email protected]>

* Fix docs formatting and phrasing for some datasets (kedro-org#107)

* Fix docs formatting and phrasing for some datasets

Signed-off-by: Deepyaman Datta <[email protected]>

* Manually fix files not resolved with patch command

Signed-off-by: Deepyaman Datta <[email protected]>

* Apply fix from kedro-org#98

Signed-off-by: Deepyaman Datta <[email protected]>

---------

Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Yassine Alouini <[email protected]>

* Release `kedro-datasets` `version 1.0.2` (kedro-org#112)

* bump version and update release notes

* fix pylint errors

Signed-off-by: Yassine Alouini <[email protected]>

* Bump pytest to 7.2 (kedro-org#113)

Signed-off-by: Merel Theisen <[email protected]>
Signed-off-by: Yassine Alouini <[email protected]>

* Prefix Docker plugin name with "Kedro-" in usage message (kedro-org#57)

* Prefix Docker plugin name with "Kedro-" in usage message

Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Yassine Alouini <[email protected]>

* Keep Kedro-Docker plugin docstring from appearing in `kedro -h` (kedro-org#56)

* Keep Kedro-Docker plugin docstring from appearing in `kedro -h`

Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Yassine Alouini <[email protected]>

* [kedro-datasets ] Add `Polars.CSVDataSet` (kedro-org#95)

Signed-off-by: wmoreiraa <[email protected]>

Signed-off-by: Yassine Alouini <[email protected]>

* Remove deprecated `test_requires` from `setup.py` in Kedro-Docker (kedro-org#54)

Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Yassine Alouini <[email protected]>

* [FIX] Fix ds to data_set.

Signed-off-by: Yassine Alouini <[email protected]>

---------

Signed-off-by: Mariusz Strzelecki <[email protected]>
Signed-off-by: Mariusz Strzelecki <[email protected]>
Signed-off-by: Yassine Alouini <[email protected]>
Signed-off-by: Jannic Holzer <[email protected]>
Signed-off-by: Merel Theisen <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Co-authored-by: Mariusz Strzelecki <[email protected]>
Co-authored-by: Jannic <[email protected]>
Co-authored-by: Merel Theisen <[email protected]>
Co-authored-by: OKA Naoya <[email protected]>
Co-authored-by: Joel <[email protected]>
Co-authored-by: adamfrly <[email protected]>
Co-authored-by: Sajid Alam <[email protected]>
Co-authored-by: Deepyaman Datta <[email protected]>
Co-authored-by: Walber Moreira <[email protected]>
Signed-off-by: Danny Farah <[email protected]>
… file path (kedro-org#114)

* Add databricks deployment check and automatic DBFS path addition

Signed-off-by: Jannic Holzer <[email protected]>

* Add newline at end of file

Signed-off-by: Jannic Holzer <[email protected]>

* Remove spurious 'not'

Signed-off-by: Jannic Holzer <[email protected]>

* Move dbfs utility functions from SparkDataSet

Signed-off-by: Jannic Holzer <[email protected]>

* Add edge case logic to _build_dbfs_path

Signed-off-by: Jannic Holzer <[email protected]>

* Add test for dbfs path construction

Signed-off-by: Jannic Holzer <[email protected]>

* Linting

Signed-off-by: Jannic Holzer <[email protected]>

* Remove spurious print statement :)

Signed-off-by: Jannic Holzer <[email protected]>

* Add pylint disable too-many-public-methods

Signed-off-by: Jannic Holzer <[email protected]>

* Move tests into single method to appease linter

Signed-off-by: Jannic Holzer <[email protected]>

* Modify prefix check to /dbfs/

Signed-off-by: Jannic Holzer <[email protected]>

* Modify prefix check to /dbfs/

Signed-off-by: Jannic Holzer <[email protected]>

* Make warning message clearer

Signed-off-by: Jannic Holzer <[email protected]>

* Add release note

Signed-off-by: Jannic Holzer <[email protected]>

* Fix linting

Signed-off-by: Jannic Holzer <[email protected]>

* Update warning message

Signed-off-by: Jannic Holzer <[email protected]>

* Modify log warning level to error

Signed-off-by: Jannic Holzer <[email protected]>

* Modify message back to warning, refer to undefined behaviour

Signed-off-by: Jannic Holzer <[email protected]>

* Modify required prefix to /dbfs/

Signed-off-by: Jannic Holzer <[email protected]>

* Modify doc string

Signed-off-by: Jannic Holzer <[email protected]>

* Modify warning message

Signed-off-by: Jannic Holzer <[email protected]>

* Split tests and add filepath to warning

Signed-off-by: Jannic Holzer <[email protected]>

* Modify f string in logging call

Signed-off-by: Jannic Holzer <[email protected]>

* Fix tests

Signed-off-by: Jannic Holzer <[email protected]>

* Lint

Signed-off-by: Jannic Holzer <[email protected]>

---------

Signed-off-by: Jannic Holzer <[email protected]>
Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Danny Farah <[email protected]>
* Add Snowpark datasets

Signed-off-by: Vladimir Filimonov <[email protected]>
Signed-off-by: heber-urdaneta <[email protected]>
Signed-off-by: Danny Farah <[email protected]>
* bump version and update release notes

* fix pylint errors

Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Merel Theisen <[email protected]>
Signed-off-by: Danny Farah <[email protected]>
astrojuanlu and others added 12 commits May 3, 2023 12:50
* Migrate kedro-airflow to static metadata

See kedro-org/kedro#2334.

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>

* Add explicit PEP 518 build requirements for kedro-datasets

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>

* Typos

Co-authored-by: Merel Theisen <[email protected]>

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>

* Remove dangling reference to requirements.txt

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>

* Add release notes

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>

---------

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>
Signed-off-by: Danny Farah <[email protected]>
* Migrate kedro-telemetry to static metadata

See kedro-org/kedro#2334.

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>

* Add release notes

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>

---------

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>
Signed-off-by: Danny Farah <[email protected]>
* Add unit test + lint test on GA

* trigger GA - will revert

Signed-off-by: Ankita Katiyar <[email protected]>

* Fix lint

Signed-off-by: Ankita Katiyar <[email protected]>

* Add end to end tests

* Add cache key

Signed-off-by: Ankita Katiyar <[email protected]>

* Add cache action

Signed-off-by: Ankita Katiyar <[email protected]>

* Rename workflow files

Signed-off-by: Ankita Katiyar <[email protected]>

* Lint + add comment + default bash

Signed-off-by: Ankita Katiyar <[email protected]>

* Add windows test

Signed-off-by: Ankita Katiyar <[email protected]>

* Update workflow name + revert changes to READMEs

Signed-off-by: Ankita Katiyar <[email protected]>

* Add kedro-telemetry/RELEASE.md to trufflehog ignore

Signed-off-by: Ankita Katiyar <[email protected]>

* Add pytables to test_requirements remove from workflow

Signed-off-by: Ankita Katiyar <[email protected]>

* Revert "Add pytables to test_requirements remove from workflow"

This reverts commit 8203daa.

* Separate pip freeze step

Signed-off-by: Ankita Katiyar <[email protected]>

---------

Signed-off-by: Ankita Katiyar <[email protected]>
Signed-off-by: Danny Farah <[email protected]>
* Migrate kedro-docker to static metadata

See kedro-org/kedro#2334.

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>

* Address packaging warning

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>

* Fix tests

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>

* Actually install current plugin with dependencies

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>

* Add release notes

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>

---------

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>
Signed-off-by: Danny Farah <[email protected]>
Currently opening gitpod will installed a Python 3.11 which breaks everything because we don't support it set. This PR introduce a simple .gitpod.yml to get it started.

Signed-off-by: Danny Farah <[email protected]>
* Update APIDataSet

Signed-off-by: Nok Chan <[email protected]>

* Sync ParquetDataSet

Signed-off-by: Nok Chan <[email protected]>

* Sync Test

Signed-off-by: Nok Chan <[email protected]>

* Linting

Signed-off-by: Nok Chan <[email protected]>

* Revert Unnecessary ParquetDataSet Changes

Signed-off-by: Nok Chan <[email protected]>

* Sync release notes

Signed-off-by: Nok Chan <[email protected]>

---------

Signed-off-by: Nok Chan <[email protected]>
Signed-off-by: Danny Farah <[email protected]>
…edro-org#182)

* bump tables version and remove step in workflow

Signed-off-by: Ankita Katiyar <[email protected]>

* revert version for linux

Signed-off-by: Ankita Katiyar <[email protected]>

* change version to 3.7

Signed-off-by: Ankita Katiyar <[email protected]>

* remove extra line

Signed-off-by: Ankita Katiyar <[email protected]>

---------

Signed-off-by: Ankita Katiyar <[email protected]>
Signed-off-by: Danny Farah <[email protected]>
* Create validate-pr-title.yaml

* ci: add `ready_for_review` to the PR type triggers

* Update validate-pr-title.yaml

* revert: drop the `ready_for_review` type from list

* ci: restrict the set of scopes to the plugin names

Signed-off-by: Danny Farah <[email protected]>
)

* refactor TensorFlowModelDataset to Set

matching consistency of all other kedro-datasets, DataSet should be camelcase. will be reverted in 0.19.0

Signed-off-by: BrianCechmanek <[email protected]>

* Introdcuing .gitpod.yml to kedro-plugins (kedro-org#185)

Currently opening gitpod will installed a Python 3.11 which breaks everything because we don't support it set. This PR introduce a simple .gitpod.yml to get it started.

Signed-off-by: BrianCechmanek <[email protected]>

* sync APIDataSet  from kedro's `develop` (kedro-org#184)

* Update APIDataSet

Signed-off-by: Nok Chan <[email protected]>

* Sync ParquetDataSet

Signed-off-by: Nok Chan <[email protected]>

* Sync Test

Signed-off-by: Nok Chan <[email protected]>

* Linting

Signed-off-by: Nok Chan <[email protected]>

* Revert Unnecessary ParquetDataSet Changes

Signed-off-by: Nok Chan <[email protected]>

* Sync release notes

Signed-off-by: Nok Chan <[email protected]>

---------

Signed-off-by: Nok Chan <[email protected]>
Signed-off-by: BrianCechmanek <[email protected]>

* [kedro-datasets] Bump version of `tables` in `test_requirements.txt`  (kedro-org#182)

* bump tables version and remove step in workflow

Signed-off-by: Ankita Katiyar <[email protected]>

* revert version for linux

Signed-off-by: Ankita Katiyar <[email protected]>

* change version to 3.7

Signed-off-by: Ankita Katiyar <[email protected]>

* remove extra line

Signed-off-by: Ankita Katiyar <[email protected]>

---------

Signed-off-by: Ankita Katiyar <[email protected]>
Signed-off-by: BrianCechmanek <[email protected]>

* refactor tensorflowModelDataset casing in datasets setup.py

Signed-off-by: BrianCechmanek <[email protected]>

* add tensorflowmodeldataset bugfix to release.md

Signed-off-by: BrianCechmanek <[email protected]>

* Update all the doc reference with TensorFlowModelDataSet

Signed-off-by: Nok <[email protected]>

---------

Signed-off-by: BrianCechmanek <[email protected]>
Signed-off-by: Nok Chan <[email protected]>
Signed-off-by: Ankita Katiyar <[email protected]>
Signed-off-by: Nok <[email protected]>
Co-authored-by: Nok Lam Chan <[email protected]>
Co-authored-by: Ankita Katiyar <[email protected]>
Co-authored-by: Nok <[email protected]>
Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Danny Farah <[email protected]>
@jmholzer jmholzer changed the title First release of databricks.ManagedTableDataset Add ManagedTableDataset for managed Delta Lake tables in Databricks May 4, 2023
@jmholzer jmholzer changed the title Add ManagedTableDataset for managed Delta Lake tables in Databricks feat: Add ManagedTableDataset for managed Delta Lake tables in Databricks May 4, 2023
@jmholzer
Copy link
Contributor

jmholzer commented May 5, 2023

I made a few changes:

  • I added tests to reach 100% coverage in managed_table_dataset.
  • I removed the functions for automatically getting and updating the version cache (and the version cache itself) as all of these were unused (and untested). They are not necessary for the dataset to function, and in other datasets we do not use this approach. They also introduced unnecessary dependencies. If we decide we want the functionality they intended (?) to offer, we can always add this in a future PR.
  • I removed a walrus operator, as we need to support Python 3.7 :).

@jmholzer
Copy link
Contributor

Closing this in favour of #206, which has a clean commit history, has signed commits and is based on the latest commit in main.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.