-
Notifications
You must be signed in to change notification settings - Fork 98
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Add ManagedTableDataset for managed Delta Lake tables in Databricks #127
Conversation
hey @noklam just saw your comment in the other PR. I did see those two datasets, this will be more focused on Databricks Unity catalog tables. The SparkDataSet and DeltaTableDataSets are for interfacing with files directly. Both can be used on databricks but are intended for different purposes. |
7121847
to
9b43324
Compare
…atasets allows users to interface with Unity catalog tables in Databricks to both read and write. Signed-off-by: Danny Farah <[email protected]>
…org#99) * Add non-spark related test changes Replace kedro.pipeline.Pipeline with kedro.pipeline.modular_pipeline.pipeline factory. This is for symmetry with changes made to the main kedro library. Signed-off-by: Adam Farley <[email protected]> Signed-off-by: Danny Farah <[email protected]>
* fix links * fix dill links Signed-off-by: Danny Farah <[email protected]>
* Fix docs formatting and phrasing for some datasets Signed-off-by: Deepyaman Datta <[email protected]> * Manually fix files not resolved with patch command Signed-off-by: Deepyaman Datta <[email protected]> * Apply fix from kedro-org#98 Signed-off-by: Deepyaman Datta <[email protected]> --------- Signed-off-by: Deepyaman Datta <[email protected]> Signed-off-by: Danny Farah <[email protected]>
* bump version and update release notes * fix pylint errors Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Merel Theisen <[email protected]> Signed-off-by: Danny Farah <[email protected]>
* Prefix Docker plugin name with "Kedro-" in usage message Signed-off-by: Deepyaman Datta <[email protected]> Signed-off-by: Danny Farah <[email protected]>
…o-org#56) * Keep Kedro-Docker plugin docstring from appearing in `kedro -h` Signed-off-by: Deepyaman Datta <[email protected]> Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: wmoreiraa <[email protected]> Signed-off-by: Danny Farah <[email protected]>
…dro-org#54) Signed-off-by: Deepyaman Datta <[email protected]> Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]> Signed-off-by: Danny Farah <[email protected]>
…ro-org#118) Signed-off-by: Danny Farah <[email protected]>
* [kedro-docker] Layers size optimization (kedro-org#92) * [kedro-docker] Layers size optimization Signed-off-by: Mariusz Strzelecki <[email protected]> * Adjust test requirements Signed-off-by: Mariusz Strzelecki <[email protected]> * Skip coverage check on tests dir (some do not execute on Windows) Signed-off-by: Mariusz Strzelecki <[email protected]> * Update .coveragerc with the setup Signed-off-by: Mariusz Strzelecki <[email protected]> * Fix bandit so it does not scan kedro-datasets Signed-off-by: Mariusz Strzelecki <[email protected]> * Fixed existence test Signed-off-by: Mariusz Strzelecki <[email protected]> * Check why dir is not created Signed-off-by: Mariusz Strzelecki <[email protected]> * Kedro starters are fixed now Signed-off-by: Mariusz Strzelecki <[email protected]> * Increased no-output-timeout for long spark image build Signed-off-by: Mariusz Strzelecki <[email protected]> * Spark image optimized Signed-off-by: Mariusz Strzelecki <[email protected]> * Linting Signed-off-by: Mariusz Strzelecki <[email protected]> * Switch to slim image always Signed-off-by: Mariusz Strzelecki <[email protected]> * Trigger build Signed-off-by: Mariusz Strzelecki <[email protected]> * Use textwrap.dedent for nicer indentation Signed-off-by: Mariusz Strzelecki <[email protected]> * Revert "Use textwrap.dedent for nicer indentation" This reverts commit 3a1e3f8. Signed-off-by: Mariusz Strzelecki <[email protected]> * Revert "Revert "Use textwrap.dedent for nicer indentation"" This reverts commit d322d35. Signed-off-by: Mariusz Strzelecki <[email protected]> * Make tests read more lines (to skip all deprecation warnings) Signed-off-by: Mariusz Strzelecki <[email protected]> Signed-off-by: Mariusz Strzelecki <[email protected]> Signed-off-by: Mariusz Strzelecki <[email protected]> Signed-off-by: Yassine Alouini <[email protected]> * Release Kedro-Docker 0.3.1 (kedro-org#94) * Add release notes for kedro-docker 0.3.1 Signed-off-by: Jannic Holzer <[email protected]> * Update version in kedro_docker module Signed-off-by: Jannic Holzer <[email protected]> Signed-off-by: Jannic Holzer <[email protected]> Signed-off-by: Yassine Alouini <[email protected]> * Bump version and update release notes (kedro-org#96) Signed-off-by: Merel Theisen <[email protected]> Signed-off-by: Yassine Alouini <[email protected]> * Make the SQLQueryDataSet compatible with mssql. Signed-off-by: Yassine Alouini <[email protected]> * Add one test + update RELEASE.md. Signed-off-by: Yassine Alouini <[email protected]> * Add missing pyodbc for tests. Signed-off-by: Yassine Alouini <[email protected]> * Mock connection as well. Signed-off-by: Yassine Alouini <[email protected]> * Add more dates parsing for mssql backend (thanks to [email protected]) Signed-off-by: Yassine Alouini <[email protected]> * Fix an error in docstring of MetricsDataSet (kedro-org#98) Signed-off-by: Yassine Alouini <[email protected]> * Bump relax pyarrow version to work the same way as Pandas (kedro-org#100) * Bump relax pyarrow version to work the same way as Pandas We only use PyArrow for `pandas.ParquetDataSet` as such I suggest we keep our versions pinned to the same range as [Pandas does](https://github.com/pandas-dev/pandas/blob/96fc51f5ec678394373e2c779ccff37ddb966e75/pyproject.toml#L100) for the same reason. As such I suggest we remove the upper bound as we have users requesting later versions in [support channels](https://kedro-org.slack.com/archives/C03RKP2LW64/p1674040509133529) * Updated release notes Signed-off-by: Yassine Alouini <[email protected]> * Add missing type in catalog example. Signed-off-by: Yassine Alouini <[email protected]> * Add one more unit tests for adapt_mssql. Signed-off-by: Yassine Alouini <[email protected]> * [FIX] Add missing mocker from date test. Signed-off-by: Yassine Alouini <[email protected]> * [TEST] Add a wrong input test. Signed-off-by: Yassine Alouini <[email protected]> * Add pyodbc dependency. Signed-off-by: Yassine Alouini <[email protected]> * [FIX] Remove dict() in tests. Signed-off-by: Yassine Alouini <[email protected]> * Change check to check on plugin name (kedro-org#103) Signed-off-by: Merel Theisen <[email protected]> Signed-off-by: Yassine Alouini <[email protected]> * Set coverage in pyproject.toml (kedro-org#105) Signed-off-by: Merel Theisen <[email protected]> Signed-off-by: Yassine Alouini <[email protected]> * Move coverage settings to pyproject.toml (kedro-org#106) Signed-off-by: Merel Theisen <[email protected]> Signed-off-by: Yassine Alouini <[email protected]> * Replace kedro.pipeline with modular_pipeline.pipeline factory (kedro-org#99) * Add non-spark related test changes Replace kedro.pipeline.Pipeline with kedro.pipeline.modular_pipeline.pipeline factory. This is for symmetry with changes made to the main kedro library. Signed-off-by: Adam Farley <[email protected]> Signed-off-by: Yassine Alouini <[email protected]> * Fix outdated links in Kedro Datasets (kedro-org#111) * fix links * fix dill links Signed-off-by: Yassine Alouini <[email protected]> * Fix docs formatting and phrasing for some datasets (kedro-org#107) * Fix docs formatting and phrasing for some datasets Signed-off-by: Deepyaman Datta <[email protected]> * Manually fix files not resolved with patch command Signed-off-by: Deepyaman Datta <[email protected]> * Apply fix from kedro-org#98 Signed-off-by: Deepyaman Datta <[email protected]> --------- Signed-off-by: Deepyaman Datta <[email protected]> Signed-off-by: Yassine Alouini <[email protected]> * Release `kedro-datasets` `version 1.0.2` (kedro-org#112) * bump version and update release notes * fix pylint errors Signed-off-by: Yassine Alouini <[email protected]> * Bump pytest to 7.2 (kedro-org#113) Signed-off-by: Merel Theisen <[email protected]> Signed-off-by: Yassine Alouini <[email protected]> * Prefix Docker plugin name with "Kedro-" in usage message (kedro-org#57) * Prefix Docker plugin name with "Kedro-" in usage message Signed-off-by: Deepyaman Datta <[email protected]> Signed-off-by: Yassine Alouini <[email protected]> * Keep Kedro-Docker plugin docstring from appearing in `kedro -h` (kedro-org#56) * Keep Kedro-Docker plugin docstring from appearing in `kedro -h` Signed-off-by: Deepyaman Datta <[email protected]> Signed-off-by: Yassine Alouini <[email protected]> * [kedro-datasets ] Add `Polars.CSVDataSet` (kedro-org#95) Signed-off-by: wmoreiraa <[email protected]> Signed-off-by: Yassine Alouini <[email protected]> * Remove deprecated `test_requires` from `setup.py` in Kedro-Docker (kedro-org#54) Signed-off-by: Deepyaman Datta <[email protected]> Signed-off-by: Yassine Alouini <[email protected]> * [FIX] Fix ds to data_set. Signed-off-by: Yassine Alouini <[email protected]> --------- Signed-off-by: Mariusz Strzelecki <[email protected]> Signed-off-by: Mariusz Strzelecki <[email protected]> Signed-off-by: Yassine Alouini <[email protected]> Signed-off-by: Jannic Holzer <[email protected]> Signed-off-by: Merel Theisen <[email protected]> Signed-off-by: Deepyaman Datta <[email protected]> Co-authored-by: Mariusz Strzelecki <[email protected]> Co-authored-by: Jannic <[email protected]> Co-authored-by: Merel Theisen <[email protected]> Co-authored-by: OKA Naoya <[email protected]> Co-authored-by: Joel <[email protected]> Co-authored-by: adamfrly <[email protected]> Co-authored-by: Sajid Alam <[email protected]> Co-authored-by: Deepyaman Datta <[email protected]> Co-authored-by: Walber Moreira <[email protected]> Signed-off-by: Danny Farah <[email protected]>
… file path (kedro-org#114) * Add databricks deployment check and automatic DBFS path addition Signed-off-by: Jannic Holzer <[email protected]> * Add newline at end of file Signed-off-by: Jannic Holzer <[email protected]> * Remove spurious 'not' Signed-off-by: Jannic Holzer <[email protected]> * Move dbfs utility functions from SparkDataSet Signed-off-by: Jannic Holzer <[email protected]> * Add edge case logic to _build_dbfs_path Signed-off-by: Jannic Holzer <[email protected]> * Add test for dbfs path construction Signed-off-by: Jannic Holzer <[email protected]> * Linting Signed-off-by: Jannic Holzer <[email protected]> * Remove spurious print statement :) Signed-off-by: Jannic Holzer <[email protected]> * Add pylint disable too-many-public-methods Signed-off-by: Jannic Holzer <[email protected]> * Move tests into single method to appease linter Signed-off-by: Jannic Holzer <[email protected]> * Modify prefix check to /dbfs/ Signed-off-by: Jannic Holzer <[email protected]> * Modify prefix check to /dbfs/ Signed-off-by: Jannic Holzer <[email protected]> * Make warning message clearer Signed-off-by: Jannic Holzer <[email protected]> * Add release note Signed-off-by: Jannic Holzer <[email protected]> * Fix linting Signed-off-by: Jannic Holzer <[email protected]> * Update warning message Signed-off-by: Jannic Holzer <[email protected]> * Modify log warning level to error Signed-off-by: Jannic Holzer <[email protected]> * Modify message back to warning, refer to undefined behaviour Signed-off-by: Jannic Holzer <[email protected]> * Modify required prefix to /dbfs/ Signed-off-by: Jannic Holzer <[email protected]> * Modify doc string Signed-off-by: Jannic Holzer <[email protected]> * Modify warning message Signed-off-by: Jannic Holzer <[email protected]> * Split tests and add filepath to warning Signed-off-by: Jannic Holzer <[email protected]> * Modify f string in logging call Signed-off-by: Jannic Holzer <[email protected]> * Fix tests Signed-off-by: Jannic Holzer <[email protected]> * Lint Signed-off-by: Jannic Holzer <[email protected]> --------- Signed-off-by: Jannic Holzer <[email protected]> Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Danny Farah <[email protected]>
* Add Snowpark datasets Signed-off-by: Vladimir Filimonov <[email protected]> Signed-off-by: heber-urdaneta <[email protected]> Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Danny Farah <[email protected]>
* bump version and update release notes * fix pylint errors Signed-off-by: Danny Farah <[email protected]>
) Signed-off-by: Merel Theisen <[email protected]> Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Merel Theisen <[email protected]> Signed-off-by: Danny Farah <[email protected]>
* Migrate kedro-airflow to static metadata See kedro-org/kedro#2334. Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> * Add explicit PEP 518 build requirements for kedro-datasets Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> * Typos Co-authored-by: Merel Theisen <[email protected]> Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> * Remove dangling reference to requirements.txt Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> * Add release notes Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> --------- Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> Signed-off-by: Danny Farah <[email protected]>
* Migrate kedro-telemetry to static metadata See kedro-org/kedro#2334. Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> * Add release notes Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> --------- Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> Signed-off-by: Danny Farah <[email protected]>
* Add unit test + lint test on GA * trigger GA - will revert Signed-off-by: Ankita Katiyar <[email protected]> * Fix lint Signed-off-by: Ankita Katiyar <[email protected]> * Add end to end tests * Add cache key Signed-off-by: Ankita Katiyar <[email protected]> * Add cache action Signed-off-by: Ankita Katiyar <[email protected]> * Rename workflow files Signed-off-by: Ankita Katiyar <[email protected]> * Lint + add comment + default bash Signed-off-by: Ankita Katiyar <[email protected]> * Add windows test Signed-off-by: Ankita Katiyar <[email protected]> * Update workflow name + revert changes to READMEs Signed-off-by: Ankita Katiyar <[email protected]> * Add kedro-telemetry/RELEASE.md to trufflehog ignore Signed-off-by: Ankita Katiyar <[email protected]> * Add pytables to test_requirements remove from workflow Signed-off-by: Ankita Katiyar <[email protected]> * Revert "Add pytables to test_requirements remove from workflow" This reverts commit 8203daa. * Separate pip freeze step Signed-off-by: Ankita Katiyar <[email protected]> --------- Signed-off-by: Ankita Katiyar <[email protected]> Signed-off-by: Danny Farah <[email protected]>
* Migrate kedro-docker to static metadata See kedro-org/kedro#2334. Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> * Address packaging warning Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> * Fix tests Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> * Actually install current plugin with dependencies Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> * Add release notes Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> --------- Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> Signed-off-by: Danny Farah <[email protected]>
Currently opening gitpod will installed a Python 3.11 which breaks everything because we don't support it set. This PR introduce a simple .gitpod.yml to get it started. Signed-off-by: Danny Farah <[email protected]>
* Update APIDataSet Signed-off-by: Nok Chan <[email protected]> * Sync ParquetDataSet Signed-off-by: Nok Chan <[email protected]> * Sync Test Signed-off-by: Nok Chan <[email protected]> * Linting Signed-off-by: Nok Chan <[email protected]> * Revert Unnecessary ParquetDataSet Changes Signed-off-by: Nok Chan <[email protected]> * Sync release notes Signed-off-by: Nok Chan <[email protected]> --------- Signed-off-by: Nok Chan <[email protected]> Signed-off-by: Danny Farah <[email protected]>
…edro-org#182) * bump tables version and remove step in workflow Signed-off-by: Ankita Katiyar <[email protected]> * revert version for linux Signed-off-by: Ankita Katiyar <[email protected]> * change version to 3.7 Signed-off-by: Ankita Katiyar <[email protected]> * remove extra line Signed-off-by: Ankita Katiyar <[email protected]> --------- Signed-off-by: Ankita Katiyar <[email protected]> Signed-off-by: Danny Farah <[email protected]>
* Create validate-pr-title.yaml * ci: add `ready_for_review` to the PR type triggers * Update validate-pr-title.yaml * revert: drop the `ready_for_review` type from list * ci: restrict the set of scopes to the plugin names Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Merel Theisen <[email protected]> Signed-off-by: Danny Farah <[email protected]>
) * refactor TensorFlowModelDataset to Set matching consistency of all other kedro-datasets, DataSet should be camelcase. will be reverted in 0.19.0 Signed-off-by: BrianCechmanek <[email protected]> * Introdcuing .gitpod.yml to kedro-plugins (kedro-org#185) Currently opening gitpod will installed a Python 3.11 which breaks everything because we don't support it set. This PR introduce a simple .gitpod.yml to get it started. Signed-off-by: BrianCechmanek <[email protected]> * sync APIDataSet from kedro's `develop` (kedro-org#184) * Update APIDataSet Signed-off-by: Nok Chan <[email protected]> * Sync ParquetDataSet Signed-off-by: Nok Chan <[email protected]> * Sync Test Signed-off-by: Nok Chan <[email protected]> * Linting Signed-off-by: Nok Chan <[email protected]> * Revert Unnecessary ParquetDataSet Changes Signed-off-by: Nok Chan <[email protected]> * Sync release notes Signed-off-by: Nok Chan <[email protected]> --------- Signed-off-by: Nok Chan <[email protected]> Signed-off-by: BrianCechmanek <[email protected]> * [kedro-datasets] Bump version of `tables` in `test_requirements.txt` (kedro-org#182) * bump tables version and remove step in workflow Signed-off-by: Ankita Katiyar <[email protected]> * revert version for linux Signed-off-by: Ankita Katiyar <[email protected]> * change version to 3.7 Signed-off-by: Ankita Katiyar <[email protected]> * remove extra line Signed-off-by: Ankita Katiyar <[email protected]> --------- Signed-off-by: Ankita Katiyar <[email protected]> Signed-off-by: BrianCechmanek <[email protected]> * refactor tensorflowModelDataset casing in datasets setup.py Signed-off-by: BrianCechmanek <[email protected]> * add tensorflowmodeldataset bugfix to release.md Signed-off-by: BrianCechmanek <[email protected]> * Update all the doc reference with TensorFlowModelDataSet Signed-off-by: Nok <[email protected]> --------- Signed-off-by: BrianCechmanek <[email protected]> Signed-off-by: Nok Chan <[email protected]> Signed-off-by: Ankita Katiyar <[email protected]> Signed-off-by: Nok <[email protected]> Co-authored-by: Nok Lam Chan <[email protected]> Co-authored-by: Ankita Katiyar <[email protected]> Co-authored-by: Nok <[email protected]> Signed-off-by: Danny Farah <[email protected]>
Co-authored-by: Jannic <[email protected]> Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Jannic Holzer <[email protected]>
Signed-off-by: Jannic Holzer <[email protected]>
Signed-off-by: Jannic Holzer <[email protected]>
Signed-off-by: Jannic Holzer <[email protected]>
Signed-off-by: Jannic Holzer <[email protected]>
I made a few changes:
|
Signed-off-by: Jannic Holzer <[email protected]>
Signed-off-by: Jannic Holzer <[email protected]>
Closing this in favour of #206, which has a clean commit history, has signed commits and is based on the latest commit in |
Description
Creating first of few PRs to add functionality for Databricks in Kedro datasets. This PR includes the ManagedTableDataset which will allow users to interface with managed Delta tables in Databricks or locally in PySpark.
Development notes
Changes include a net new dataset, databricks.ManagedTableDataSet, which allows users to interface with managed delta tables inside of Databricks.
Checklist
RELEASE.md
file