feat(datasets): add `SparkStreamingDataSet` #198

tingtingQB · 2023-05-02T09:48:56Z

Description

This PR is a continue for the closed PR #168
to add a spark_stream_dataset to the Kedro dataset plugin

Development notes

1 spark dataset and unit tests

Checklist

Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the relevant RELEASE.md file
Added tests to cover my changes

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> Signed-off-by: Tingting_Wan <[email protected]>

Signed-off-by: Tingting_Wan <[email protected]>

…org#161) * Include missing requirements files in sdist Fix kedro-orggh-86. Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> * Migrate most project metadata to `pyproject.toml` See kedro-org/kedro#2334. Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> * Move requirements to `pyproject.toml` Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> --------- Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> Signed-off-by: Tingting_Wan <[email protected]>

Signed-off-by: Tingting_Wan <[email protected]>

Co-authored-by: Nok Lam Chan <[email protected]> Signed-off-by: Tingting_Wan <[email protected]>

Signed-off-by: Tingting_Wan <[email protected]>

Co-authored-by: Deepyaman Datta <[email protected]> Signed-off-by: Tingting_Wan <[email protected]>

Signed-off-by: Tingting_Wan <[email protected]>

* Upgrade Polars Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> * Update Polars to 0.17.x --------- Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> Signed-off-by: Tingting_Wan <[email protected]>

) Signed-off-by: Tingting_Wan <[email protected]>

* Migrate kedro-airflow to static metadata See kedro-org/kedro#2334. Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> * Add explicit PEP 518 build requirements for kedro-datasets Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> * Typos Co-authored-by: Merel Theisen <[email protected]> Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> * Remove dangling reference to requirements.txt Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> * Add release notes Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> --------- Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> Signed-off-by: Tingting_Wan <[email protected]>

* Migrate kedro-telemetry to static metadata See kedro-org/kedro#2334. Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> * Add release notes Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> --------- Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> Signed-off-by: Tingting_Wan <[email protected]>

* Add unit test + lint test on GA * trigger GA - will revert Signed-off-by: Ankita Katiyar <[email protected]> * Fix lint Signed-off-by: Ankita Katiyar <[email protected]> * Add end to end tests * Add cache key Signed-off-by: Ankita Katiyar <[email protected]> * Add cache action Signed-off-by: Ankita Katiyar <[email protected]> * Rename workflow files Signed-off-by: Ankita Katiyar <[email protected]> * Lint + add comment + default bash Signed-off-by: Ankita Katiyar <[email protected]> * Add windows test Signed-off-by: Ankita Katiyar <[email protected]> * Update workflow name + revert changes to READMEs Signed-off-by: Ankita Katiyar <[email protected]> * Add kedro-telemetry/RELEASE.md to trufflehog ignore Signed-off-by: Ankita Katiyar <[email protected]> * Add pytables to test_requirements remove from workflow Signed-off-by: Ankita Katiyar <[email protected]> * Revert "Add pytables to test_requirements remove from workflow" This reverts commit 8203daa. * Separate pip freeze step Signed-off-by: Ankita Katiyar <[email protected]> --------- Signed-off-by: Ankita Katiyar <[email protected]> Signed-off-by: Tingting_Wan <[email protected]>

* Migrate kedro-docker to static metadata See kedro-org/kedro#2334. Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> * Address packaging warning Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> * Fix tests Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> * Actually install current plugin with dependencies Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> * Add release notes Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> --------- Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> Signed-off-by: Tingting_Wan <[email protected]>

Currently opening gitpod will installed a Python 3.11 which breaks everything because we don't support it set. This PR introduce a simple .gitpod.yml to get it started. Signed-off-by: Tingting_Wan <[email protected]>

* Update APIDataSet Signed-off-by: Nok Chan <[email protected]> * Sync ParquetDataSet Signed-off-by: Nok Chan <[email protected]> * Sync Test Signed-off-by: Nok Chan <[email protected]> * Linting Signed-off-by: Nok Chan <[email protected]> * Revert Unnecessary ParquetDataSet Changes Signed-off-by: Nok Chan <[email protected]> * Sync release notes Signed-off-by: Nok Chan <[email protected]> --------- Signed-off-by: Nok Chan <[email protected]> Signed-off-by: Tingting_Wan <[email protected]>

Signed-off-by: Tingting_Wan <[email protected]>

…ream-datasets # Conflicts: # .github/workflows/check-plugin.yml # kedro-datasets/tests/api/test_api_dataset.py

Signed-off-by: Tingting_Wan <[email protected]>

Co-authored-by: Nok Lam Chan <[email protected]> Signed-off-by: Tingting_Wan <[email protected]>

Signed-off-by: Nok Chan <[email protected]>

deepyaman

LGTM overall, minor comments, and one design question. I haven't read the docs, but I assume Jo/others sufficiently reviewed them. :) I can take a look later today on the train perhaps, but no need to wait.

deepyaman · 2023-05-29T11:33:16Z

kedro-datasets/kedro_datasets/spark/spark_streaming_dataset.py

+        if self._schema is not None:
+            if isinstance(self._schema, dict):
+                self._schema = SparkDataSet._load_schema_from_file(self._schema)


@noklam is it OK from a design perspective that SparkStreamingDataSet uses a private method of SparkDataSet? Feels a bit off to me, but perhaps no clear issues if their requirements are the same.

deepyaman · 2023-05-29T11:39:14Z

kedro-datasets/kedro_datasets/spark/spark_streaming_dataset.py

+        # Handle schema load argument
+        self._schema = self._load_args.pop("schema", None)
+        if self._schema is not None:
+            if isinstance(self._schema, dict):


Nit: wonder why this is a nested if statement rather than using and? But it's the same on SparkDataSet, so I guess it's consistent. 🤷 Quite possible I did it when adding schema handling.

I missed these comment - maybe it should just inherit the SparkDataSet class? #135 I think in general we need to look at all SparkDataSet, many of it is weird but it's quite tricky to remove the code.

the path handling is particular confusing because it's unique for Spark. @deepyaman

deepyaman · 2023-05-29T11:40:53Z

kedro-datasets/kedro_datasets/spark/spark_streaming_dataset.py

+        if self._schema is not None:
+            if isinstance(self._schema, dict):
+                self._schema = SparkDataSet._load_schema_from_file(self._schema)


Fine with requiring schema; schema concept is more critical in streaming anyway.

kedro-datasets/setup.py

kedro-datasets/tests/spark/test_spark_streaming_dataset.py

kedro-datasets/kedro_datasets/spark/spark_streaming_dataset.py

* fixkedro- docker e2e test Signed-off-by: Nok Chan <[email protected]> * fix: add timeout to request to satisfy bandit lint --------- Signed-off-by: Nok Chan <[email protected]> Co-authored-by: Deepyaman Datta <[email protected]> Signed-off-by: Tom Kurian <[email protected]>

* ci: install the plugin alongside test requirements * ci: install the plugin alongside test requirements * Update kedro-airflow.yml * Update kedro-datasets.yml * Update kedro-docker.yml * Update kedro-telemetry.yml * Update kedro-airflow.yml * Update kedro-datasets.yml * Update kedro-airflow.yml * Update kedro-docker.yml * Update kedro-telemetry.yml * ci(telemetry): update isort config to correct sort * Don't use profile ¯\_(ツ)_/¯ Signed-off-by: Deepyaman Datta <[email protected]> * chore(datasets): remove empty `tool.black` section * chore(docker): remove empty `tool.black` section --------- Signed-off-by: Deepyaman Datta <[email protected]> Signed-off-by: Tom Kurian <[email protected]>

…ro-org#203) * Create check-release.yml * change from test pypi to pypi * split into jobs and move version logic into script * update github actions output * lint * changes based on review * changes based on review * fix script to not append continuously * change pypi api token logic Signed-off-by: Tom Kurian <[email protected]>

* Less strict pin on Kedro for datasets Signed-off-by: Merel Theisen <[email protected]> Signed-off-by: Tom Kurian <[email protected]>

* ci: don't run checks on both `push`/`pull_request` * ci: don't run checks on both `push`/`pull_request` * ci: don't run checks on both `push`/`pull_request` * ci: don't run checks on both `push`/`pull_request` Signed-off-by: Tom Kurian <[email protected]>

Signed-off-by: Tom Kurian <[email protected]>

…tests checked. (kedro-org#215) * Create merge-gatekeeper.yml * Update .github/workflows/merge-gatekeeper.yml --------- Co-authored-by: Sajid Alam <[email protected]> Signed-off-by: Tom Kurian <[email protected]>

* remove circleci setup files and utils * remove circleci configs in kedro-telemetry * remove redundant .github in kedro-telemetry * Delete continue_config.yml * Update check-release.yml * lint * increase timeout to 40 mins for docker e2e tests Signed-off-by: Tom Kurian <[email protected]>

* [FEAT] add save method to APIDataset Signed-off-by: jmcdonnell <[email protected]> * [ENH] create save_args parameter for api_dataset Signed-off-by: jmcdonnell <[email protected]> * [ENH] add tests for socket + http errors Signed-off-by: <[email protected]> Signed-off-by: jmcdonnell <[email protected]> * [ENH] check save data is json Signed-off-by: <[email protected]> Signed-off-by: jmcdonnell <[email protected]> * [FIX] clean code Signed-off-by: jmcdonnell <[email protected]> * [ENH] handle different data types Signed-off-by: jmcdonnell <[email protected]> * [FIX] test coverage for exceptions Signed-off-by: jmcdonnell <[email protected]> * [ENH] add examples in APIDataSet docstring Signed-off-by: jmcdonnell <[email protected]> * sync APIDataSet from kedro's `develop` (kedro-org#184) * Update APIDataSet Signed-off-by: Nok Chan <[email protected]> * Sync ParquetDataSet Signed-off-by: Nok Chan <[email protected]> * Sync Test Signed-off-by: Nok Chan <[email protected]> * Linting Signed-off-by: Nok Chan <[email protected]> * Revert Unnecessary ParquetDataSet Changes Signed-off-by: Nok Chan <[email protected]> * Sync release notes Signed-off-by: Nok Chan <[email protected]> --------- Signed-off-by: Nok Chan <[email protected]> Signed-off-by: jmcdonnell <[email protected]> * [FIX] remove support for delete method Signed-off-by: jmcdonnell <[email protected]> * [FIX] lint files Signed-off-by: jmcdonnell <[email protected]> * [FIX] fix conflicts Signed-off-by: jmcdonnell <[email protected]> * [FIX] remove fail save test Signed-off-by: jmcdonnell <[email protected]> * [ENH] review suggestions Signed-off-by: jmcdonnell <[email protected]> * [ENH] fix tests Signed-off-by: jmcdonnell <[email protected]> * [FIX] reorder arguments Signed-off-by: jmcdonnell <[email protected]> --------- Signed-off-by: jmcdonnell <[email protected]> Signed-off-by: <[email protected]> Signed-off-by: Nok Chan <[email protected]> Co-authored-by: jmcdonnell <[email protected]> Co-authored-by: Nok Lam Chan <[email protected]> Signed-off-by: Tom Kurian <[email protected]>

…g#212) * ci: Automatically extract release notes Signed-off-by: Ankita Katiyar <[email protected]> * fix lint Signed-off-by: Ankita Katiyar <[email protected]> * Raise exceptions Signed-off-by: Ankita Katiyar <[email protected]> * Lint Signed-off-by: Ankita Katiyar <[email protected]> * Lint Signed-off-by: Ankita Katiyar <[email protected]> --------- Signed-off-by: Ankita Katiyar <[email protected]> Signed-off-by: Tom Kurian <[email protected]>

* Add metadata attribute to all datasets Signed-off-by: Ahdra Merali <[email protected]> Signed-off-by: Tom Kurian <[email protected]>

…icks (kedro-org#206) * committing first version of UnityTableCatalog with unit tests. This datasets allows users to interface with Unity catalog tables in Databricks to both read and write. Signed-off-by: Danny Farah <[email protected]> Signed-off-by: Jannic Holzer <[email protected]> * renaming dataset Signed-off-by: Danny Farah <[email protected]> Signed-off-by: Jannic Holzer <[email protected]> * adding mlflow connectors Signed-off-by: Danny Farah <[email protected]> Signed-off-by: Jannic Holzer <[email protected]> * fixing mlflow imports Signed-off-by: Danny Farah <[email protected]> Signed-off-by: Jannic Holzer <[email protected]> * cleaned up mlflow for initial release Signed-off-by: Danny Farah <[email protected]> Signed-off-by: Jannic Holzer <[email protected]> * cleaned up mlflow references from setup.py for initial release Signed-off-by: Danny Farah <[email protected]> Signed-off-by: Jannic Holzer <[email protected]> * fixed deps in setup.py Signed-off-by: Danny Farah <[email protected]> Signed-off-by: Jannic Holzer <[email protected]> * adding comments before intiial PR Signed-off-by: Danny Farah <[email protected]> Signed-off-by: Jannic Holzer <[email protected]> * moved validation to dataclass Signed-off-by: Danny Farah <[email protected]> Signed-off-by: Jannic Holzer <[email protected]> * bug fix in type of partition column and cleanup Signed-off-by: Danny Farah <[email protected]> Signed-off-by: Jannic Holzer <[email protected]> * updated docstring for ManagedTableDataSet Signed-off-by: Danny Farah <[email protected]> Signed-off-by: Jannic Holzer <[email protected]> * added backticks to catalog Signed-off-by: Danny Farah <[email protected]> Signed-off-by: Jannic Holzer <[email protected]> * fixing regex to allow hyphens Signed-off-by: Danny Farah <[email protected]> Signed-off-by: Jannic Holzer <[email protected]> * Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py Co-authored-by: Jannic <[email protected]> Signed-off-by: Jannic Holzer <[email protected]> * Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py Co-authored-by: Jannic <[email protected]> Signed-off-by: Danny Farah <[email protected]> Signed-off-by: Jannic Holzer <[email protected]> * Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py Co-authored-by: Jannic <[email protected]> Signed-off-by: Danny Farah <[email protected]> Signed-off-by: Jannic Holzer <[email protected]> * Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py Co-authored-by: Jannic <[email protected]> Signed-off-by: Danny Farah <[email protected]> Signed-off-by: Jannic Holzer <[email protected]> * Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py Co-authored-by: Jannic <[email protected]> Signed-off-by: Danny Farah <[email protected]> Signed-off-by: Jannic Holzer <[email protected]> * Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py Co-authored-by: Jannic <[email protected]> Signed-off-by: Danny Farah <[email protected]> Signed-off-by: Jannic Holzer <[email protected]> * Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py Co-authored-by: Jannic <[email protected]> Signed-off-by: Danny Farah <[email protected]> Signed-off-by: Jannic Holzer <[email protected]> * Update kedro-datasets/test_requirements.txt Co-authored-by: Jannic <[email protected]> Signed-off-by: Danny Farah <[email protected]> Signed-off-by: Jannic Holzer <[email protected]> * Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py Co-authored-by: Jannic <[email protected]> Signed-off-by: Danny Farah <[email protected]> Signed-off-by: Jannic Holzer <[email protected]> * Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py Co-authored-by: Jannic <[email protected]> Signed-off-by: Danny Farah <[email protected]> Signed-off-by: Jannic Holzer <[email protected]> * Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py Co-authored-by: Jannic <[email protected]> Signed-off-by: Danny Farah <[email protected]> Signed-off-by: Jannic Holzer <[email protected]> * Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py Co-authored-by: Jannic <[email protected]> Signed-off-by: Danny Farah <[email protected]> Signed-off-by: Jannic Holzer <[email protected]> * adding backticks to catalog Signed-off-by: Danny Farah <[email protected]> Signed-off-by: Jannic Holzer <[email protected]> * Require pandas < 2.0 for compatibility with spark < 3.4 Signed-off-by: Jannic Holzer <[email protected]> * Replace use of walrus operator Signed-off-by: Jannic Holzer <[email protected]> * Add test coverage for validation methods Signed-off-by: Jannic Holzer <[email protected]> * Remove unused versioning functions Signed-off-by: Jannic Holzer <[email protected]> * Fix exception catching for invalid schema, add test for invalid schema Signed-off-by: Jannic Holzer <[email protected]> * Add pylint ignore Signed-off-by: Jannic Holzer <[email protected]> * Add tests/databricks to ignore for no-spark tests Signed-off-by: Jannic Holzer <[email protected]> * Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py Co-authored-by: Nok Lam Chan <[email protected]> * Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py Co-authored-by: Nok Lam Chan <[email protected]> * Remove spurious mlflow test dependency Signed-off-by: Jannic Holzer <[email protected]> * Add explicit check for database existence Signed-off-by: Jannic Holzer <[email protected]> * Remove character limit for table names Signed-off-by: Jannic Holzer <[email protected]> * Refactor validation steps in ManagedTable Signed-off-by: Jannic Holzer <[email protected]> * Remove spurious checks for table and schema name existence Signed-off-by: Jannic Holzer <[email protected]> --------- Signed-off-by: Danny Farah <[email protected]> Signed-off-by: Jannic Holzer <[email protected]> Co-authored-by: Danny Farah <[email protected]> Co-authored-by: Danny Farah <[email protected]> Co-authored-by: Nok Lam Chan <[email protected]> Signed-off-by: Tom Kurian <[email protected]>

* Update APIDataset docs and refactor * Acknowledge community contributor * Fix more broken doc Signed-off-by: Nok Chan <[email protected]> * Lint Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> * Fix release notes of upcoming kedro-datasets --------- Signed-off-by: Nok Chan <[email protected]> Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> Co-authored-by: Juan Luis Cano Rodríguez <[email protected]> Co-authored-by: Jannic <[email protected]> Signed-off-by: Tom Kurian <[email protected]>

* Modify release version and RELEASE.md Signed-off-by: Jannic Holzer <[email protected]> * Add proper name for ManagedTableDataSet Signed-off-by: Jannic Holzer <[email protected]> * Update kedro-datasets/RELEASE.md Co-authored-by: Juan Luis Cano Rodríguez <[email protected]> * Revert lost semicolon for release 1.2.0 Signed-off-by: Jannic Holzer <[email protected]> --------- Signed-off-by: Jannic Holzer <[email protected]> Co-authored-by: Juan Luis Cano Rodríguez <[email protected]> Signed-off-by: Tom Kurian <[email protected]>

* Fix APIDataSet docstring Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> * Add release notes Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> * Separate [docs] extras from [all] in kedro-datasets Fix kedro-orggh-143. Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> --------- Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> Signed-off-by: Tom Kurian <[email protected]>

Co-authored-by: Deepyaman Datta <[email protected]> Signed-off-by: Tom Kurian <[email protected]>

noklam

This is huge 🔥 , thank you both @tingtingQB and @kuriantom369. I'm gonna release it in 1.3.1 if nothing fails in CI.

Signed-off-by: Tom Kurian <[email protected]>

idanov · 2023-05-31T08:06:12Z

kedro-datasets/kedro_datasets/spark/README.md

@@ -0,0 +1,44 @@
+# Spark Streaming


Really like how concise is this file here!

idanov · 2023-05-31T08:09:56Z

kedro-datasets/kedro_datasets/spark/spark_streaming_dataset.py

+    `YAML API <https://kedro.readthedocs.io/en/stable/data/\
+    data_catalog.html#use-the-data-catalog-with-the-yaml-api>`_:
+    .. code-block:: yaml
+        raw.new_inventory:


Suggested change

raw.new_inventory:

raw.new_inventory:

idanov · 2023-05-31T08:10:55Z

kedro-datasets/kedro_datasets/spark/spark_streaming_dataset.py

+    data_catalog.html#use-the-data-catalog-with-the-yaml-api>`_:
+    .. code-block:: yaml
+        raw.new_inventory:
+        type: streaming.extras.datasets.spark_streaming_dataset.SparkStreamingDataSet


Suggested change

type: streaming.extras.datasets.spark_streaming_dataset.SparkStreamingDataSet

type: spark.SparkStreamingDataSet

astrojuanlu · 2023-05-31T13:29:15Z

Also, the docstring has some problems:

sphinx.errors.SphinxWarning: /home/docs/checkouts/readthedocs.org/user_builds/kedro/envs/2485/lib/python3.8/site-packages/kedro_datasets/spark/spark_streaming_dataset.py:docstring of kedro_datasets.spark.spark_streaming_dataset.SparkStreamingDataSet:5:Unexpected indentation.

https://readthedocs.org/projects/kedro/builds/20874133/

Will open a PR to fix this along with @idanov comments.

astrojuanlu and others added 30 commits May 1, 2023 19:13

Fix links on GitHub issue templates (kedro-org#150)

46bb394

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> Signed-off-by: Tingting_Wan <[email protected]>

add spark_stream_dataset.py

c9421ae

Signed-off-by: Tingting_Wan <[email protected]>

restructure the strean dataset to align with the other spark dataset

4b387ff

Signed-off-by: Tingting_Wan <[email protected]>

adding README.md for specification

39ad9fd

Signed-off-by: Tingting_Wan <[email protected]>

Update kedro-datasets/kedro_datasets/spark/spark_stream_dataset.py

69eb8be

Co-authored-by: Nok Lam Chan <[email protected]> Signed-off-by: Tingting_Wan <[email protected]>

rename the dataset

3106068

Signed-off-by: Tingting_Wan <[email protected]>

resolve comments

b8141a7

Signed-off-by: Tingting_Wan <[email protected]>

fix format and pylint

738625e

Signed-off-by: Tingting_Wan <[email protected]>

Update kedro-datasets/kedro_datasets/spark/README.md

a54cc67

Co-authored-by: Deepyaman Datta <[email protected]> Signed-off-by: Tingting_Wan <[email protected]>

add unit tests and SparkStreamingDataset in init.py

b924ad6

Signed-off-by: Tingting_Wan <[email protected]>

add unit tests

743b823

Signed-off-by: Tingting_Wan <[email protected]>

update test_save

3bb3717

Signed-off-by: Tingting_Wan <[email protected]>

Upgrade Polars (kedro-org#171)

ae3bc87

* Upgrade Polars Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> * Update Polars to 0.17.x --------- Signed-off-by: Juan Luis Cano Rodríguez <[email protected]> Signed-off-by: Tingting_Wan <[email protected]>

if release is failed, it return exit code and fail the CI (kedro-org#158

eb634a1

) Signed-off-by: Tingting_Wan <[email protected]>

Introdcuing .gitpod.yml to kedro-plugins (kedro-org#185)

7f4527d

Currently opening gitpod will installed a Python 3.11 which breaks everything because we don't support it set. This PR introduce a simple .gitpod.yml to get it started. Signed-off-by: Tingting_Wan <[email protected]>

formatting

11c3888

Signed-off-by: Tingting_Wan <[email protected]>

formatting

634d884

Signed-off-by: Tingting_Wan <[email protected]>

formatting

9e8f55c

Signed-off-by: Tingting_Wan <[email protected]>

formatting

dbdf19c

Signed-off-by: Tingting_Wan <[email protected]>

Merge remote-tracking branch 'origin/add-stream-datasets' into add-st…

4e49fd9

…ream-datasets # Conflicts: # .github/workflows/check-plugin.yml # kedro-datasets/tests/api/test_api_dataset.py

add spark_stream_dataset.py

1a7a477

Signed-off-by: Tingting_Wan <[email protected]>

restructure the strean dataset to align with the other spark dataset

e877944

Signed-off-by: Tingting_Wan <[email protected]>

adding README.md for specification

09e9cf2

Signed-off-by: Tingting_Wan <[email protected]>

Update kedro-datasets/kedro_datasets/spark/spark_stream_dataset.py

2e30ec0

Co-authored-by: Nok Lam Chan <[email protected]> Signed-off-by: Tingting_Wan <[email protected]>

revert the changes on CI

b94f211

Signed-off-by: Nok Chan <[email protected]>

deepyaman approved these changes May 29, 2023

View reviewed changes

deepyaman changed the title ~~feat(datasets): Add stream datasets and unit tests~~ feat(datasets): add SparkStreamingDataSet May 30, 2023

noklam and others added 18 commits May 30, 2023 09:51

build: Relax Kedro bound for kedro-datasets (kedro-org#140)

3fdb71c

* Less strict pin on Kedro for datasets Signed-off-by: Merel Theisen <[email protected]> Signed-off-by: Tom Kurian <[email protected]>

chore: delete extra space ending check-release.yml (kedro-org#210)

148b464

Signed-off-by: Tom Kurian <[email protected]>

feat: Add metadata attribute to datasets (kedro-org#189)

870e623

* Add metadata attribute to all datasets Signed-off-by: Ahdra Merali <[email protected]> Signed-off-by: Tom Kurian <[email protected]>

Update kedro-datasets/tests/spark/test_spark_streaming_dataset.py

64446dc

Co-authored-by: Deepyaman Datta <[email protected]> Signed-off-by: Tom Kurian <[email protected]>

Update kedro-datasets/kedro_datasets/spark/spark_streaming_dataset.py

497001d

Co-authored-by: Deepyaman Datta <[email protected]> Signed-off-by: Tom Kurian <[email protected]>

Update kedro-datasets/setup.py

7f25f3c

Co-authored-by: Deepyaman Datta <[email protected]> Signed-off-by: Tom Kurian <[email protected]>

kuriantom369 force-pushed the add-stream-datasets branch from 4468c76 to 7f25f3c Compare May 30, 2023 08:52

Merge branch 'main' into add-stream-datasets

bd88b99

noklam mentioned this pull request May 30, 2023

build(datasets): release 1.4.0 #222

Merged

5 tasks

noklam self-requested a review May 30, 2023 16:22

noklam approved these changes May 30, 2023

View reviewed changes

fix linting issue

c094db1

Signed-off-by: Tom Kurian <[email protected]>

noklam merged commit 2acb007 into kedro-org:main May 31, 2023

idanov reviewed May 31, 2023

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(datasets): add `SparkStreamingDataSet` #198

feat(datasets): add `SparkStreamingDataSet` #198

tingtingQB commented May 2, 2023

deepyaman left a comment

deepyaman May 29, 2023

deepyaman May 29, 2023

noklam Jun 13, 2023

deepyaman May 29, 2023

noklam left a comment •

edited

Loading

idanov May 31, 2023

idanov May 31, 2023

idanov May 31, 2023

astrojuanlu commented May 31, 2023

	type: streaming.extras.datasets.spark_streaming_dataset.SparkStreamingDataSet
	type: spark.SparkStreamingDataSet

feat(datasets): add SparkStreamingDataSet #198

feat(datasets): add SparkStreamingDataSet #198

Conversation

tingtingQB commented May 2, 2023

Description

Development notes

Checklist

deepyaman left a comment

Choose a reason for hiding this comment

deepyaman May 29, 2023

Choose a reason for hiding this comment

deepyaman May 29, 2023

Choose a reason for hiding this comment

noklam Jun 13, 2023

Choose a reason for hiding this comment

deepyaman May 29, 2023

Choose a reason for hiding this comment

noklam left a comment • edited Loading

Choose a reason for hiding this comment

idanov May 31, 2023

Choose a reason for hiding this comment

idanov May 31, 2023

Choose a reason for hiding this comment

idanov May 31, 2023

Choose a reason for hiding this comment

astrojuanlu commented May 31, 2023

feat(datasets): add `SparkStreamingDataSet` #198

feat(datasets): add `SparkStreamingDataSet` #198

noklam left a comment •

edited

Loading