Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add metadata attribute to datasets #189

Merged
merged 22 commits into from
May 22, 2023
Merged

Conversation

AhdraMeraliQB
Copy link
Contributor

@AhdraMeraliQB AhdraMeraliQB commented Apr 25, 2023

Description

Resolves kedro-org/kedro#2440

This is connected to kedro-org/kedro#2537 which adds the metadata attribute to the datasets within kedro.io (MemoryDataSet, LambdaDataSet, PartitionedDataSet)

Also addresses some changes made in #184

Development notes

The metadata attribute is defined solely within self.metadata. Some datasets make use of a _describe method to return a dictionary of the dataset's attributes. I have not included the metadata in these methods as in some instances it would necessitate defining it twice, and I find the use redundant - however I would like to hear the reviewer's opinions on this matter.

Metadata is accessible through dataset_name.metadata.Depending on the implementation this is at times inconsistent within the dataset, but it remains consistent across all datasets.

These changes have been tested manually using both the Python api and through catalog.yml using hooks. The hook I used to access the dataset's metadata is as follows:

from kedro.framework.hooks import hook_impl
from typing import Any, Dict
from kedro.io import DataCatalog, MemoryDataSet

class MetadataHooks:
    @hook_impl
    def after_catalog_created(self,
        catalog: DataCatalog,
        conf_catalog: Dict[str, Any],
        conf_creds: Dict[str, Any],
        feed_dict: Dict[str, Any],
        save_version: str,
        load_versions: Dict[str, str],
    ):
        for k,v in catalog.datasets.__dict__.items():
            print(k + "metadata: \n" + str(v.metadata))

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the relevant RELEASE.md file
  • Added tests to cover my changes

Ahdra Merali added 3 commits April 24, 2023 11:15
Signed-off-by: Ahdra Merali <[email protected]>
Signed-off-by: Ahdra Merali <[email protected]>
Ahdra Merali added 2 commits April 25, 2023 15:10
Signed-off-by: Ahdra Merali <[email protected]>
Signed-off-by: Ahdra Merali <[email protected]>
Signed-off-by: Ahdra Merali <[email protected]>
@AhdraMeraliQB AhdraMeraliQB marked this pull request as ready for review April 25, 2023 14:42
Signed-off-by: Ahdra Merali <[email protected]>
@noklam
Copy link
Contributor

noklam commented Apr 25, 2023

Just thinking top of my head, do we need to change this in all the dataset? How was layer being used before without being an argument in DataSet? I think it should be yes.

It would be also great to think about how metadata could be access via hooks, i.e. how will viz using this new field. after_caterlog_create hook I guess?

@AhdraMeraliQB
Copy link
Contributor Author

AhdraMeraliQB commented Apr 25, 2023

How was layer being used before without being an argument in DataSet? I think it should be yes.

layer is currently consumed by the DataCatalog and the general consensus is that this is not ideal. Previously it was defined on every dataset as in this PR

It would be also great to think about how metadata could be access via hooks, i.e. how will viz using this new field. after_caterlog_create hook I guess?

metadata is accessible in the dummy hook implementation that I used to test the datasets (in PR description) , for the specifics on how viz (and other plugins) would consume and use this however, I am not sure

@@ -76,6 +74,8 @@ def __init__(
credentials: Allows specifying secrets in credentials.yml.
Expected format is ``('login', 'password')`` if given as a tuple or list.
An ``AuthBase`` instance can be provided for more complex cases.
metadata: Any arbitrary user metadata.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be good to mention here that Kedro doesn't do anything with this metadata, but that it can be consumed by plugins or directly by the user.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How was layer being used before without being an argument in DataSet? I think it should be yes.

layer is currently consumed by the DataCatalog and the general consensus is that this is not ideal. Previously it was defined on every dataset as in this PR

It would be also great to think about how metadata could be access via hooks, i.e. how will viz using this new field. after_caterlog_create hook I guess?

metadata is accessible in the dummy hook implementation that I used to test the datasets (in PR description) , for the specifics on how viz (and other plugins) would consume and use this however, I am not sure

Hey @AhdraMeraliQB -- i think this is very important because currently we get layer information when we context.catalog -- would the metadata information also be in that way?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would need to change how it's consumed on kedro-viz so that instead of looking at catalog.layers we look at get_dataset(dataset_name).metadata["kedro-viz"]["layer"] (where get_dataset is defined in CatalogRepository).

class SnowparkTableDataSet(AbstractDataSet):
class SnowparkTableDataSet(
AbstractDataSet
): # pylint:disable=too-many-instance-attributes
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe we should just ignore this on the repo level, it's not really a relevant check for the datasets.

@noklam
Copy link
Contributor

noklam commented Apr 26, 2023

@AhdraMeraliQB @merelcht Not sure if we have a ticket already, but we should have docs explaining how this metadata should be consume i.e. via the hook

The implemention in the description is hacky as we are accessing the internal dict in weird way, I think this is related to the discussion we have about the datasets and frozen datasets.

class MetadataHooks:
    @hook_impl
    def after_catalog_created(self,
        catalog: DataCatalog,
        conf_catalog: Dict[str, Any],
        conf_creds: Dict[str, Any],
        feed_dict: Dict[str, Any],
        save_version: str,
        load_versions: Dict[str, str],
    ):
        for k,v in catalog.datasets.__dict__.items():
            print(k + "metadata: \n" + str(v.metadata))

@noklam
Copy link
Contributor

noklam commented Apr 26, 2023

Changed my thought Originally I think 3 is better but then it will always load the full catalog - in case of viz, you only want to load the data that is associated with the target pipeline. I just wrote down my reasoning here, I think 1 & 2 are equally bad but we don't have better option, so I am think with 1.

  1. catalog.datasets.__dict__ as @AhdraMeraliQB did, which used private variable
  2. catalog._data_sets - still using internal variable
  3. conf_catalog - this doesn't use any internal variable, but it may contains the _ entries?

Ahdra Merali added 2 commits April 28, 2023 14:09
Signed-off-by: Ahdra Merali <[email protected]>
Signed-off-by: Ahdra Merali <[email protected]>
@AhdraMeraliQB AhdraMeraliQB changed the title Add metadata attribute to datasets feat: Add metadata attribute to datasets Apr 28, 2023
Signed-off-by: Ahdra Merali <[email protected]>
Ahdra Merali added 2 commits May 8, 2023 14:08
@antonymilne
Copy link
Contributor

@noklam you're right that going through catalog.list() would only show those that are explicitly defined in catalog.yml and would not cover patterned ones. If you want all the datasets actually used in a pipeline then you should use pipelines.data_sets instead. Either way, catalog._get_dataset(dataset_name) will work for both explicitly defined and pattern-matched dataset.

@antonymilne
Copy link
Contributor

antonymilne commented May 15, 2023

As for the major change here: I think I may disagree @merelcht here and wonder whether we should actually use metadata rather than _metadata. It looks inconsistent with other attribute, but since metadata will by definition not get used anywhere in kedro we don't really need to mark it as protected. It's a dataset implementation attribute and not part of AbstractDataSet so we don't currently have any sort of interface associated with exposing _metadata. Hence there's currently no way for consumers (e.g. plugins) to use metadata through a public interface, which is not very encouraging for people who want to actually use it.

In theory we could have a @property for metadata to show the underlying _metadata attribute but I don't see any particular need for that when we could just make metadata public in the first place.

Curious what @idanov thinks here...

Aside from that, this and the other PR LGTM.

The metadata attribute is defined solely within self.metadata. Some datasets make use of a _describe method to return a dictionary of the dataset's attributes. I have not included the metadata in these methods as in some instances it would necessitate defining it twice, and I find the use redundant - however I would like to hear the reviewer's opinions on this matter.

Fine by me to not put it in _describe, but I don't understand what you mean by defining it twice here - can you give an example?

Metadata is accessible through dataset_name.metadata.Depending on the implementation this is at times inconsistent within the dataset, but it remains consistent across all datasets.

Also curious what you mean by this. Where is the inconsistency?

@AhdraMeraliQB
Copy link
Contributor Author

@antonymilne

As for the major change here: I think I may disagree @merelcht here and wonder whether we should actually use metadata rather than _metadata. It looks inconsistent with other attribute, but since metadata will by definition not get used anywhere in kedro we don't really need to mark it as protected.

I did a little digging and it looks like the _preview function added by Viz is the same, in that it's not used by Kedro itself. For this reason, I'd argue to keep _metadata and maintain consistency.

However I also agree that that marking it as protected implies that accessing it is somewhat hacky. In this case I'd actually vouch for introducing a metadata property that accesses self._metadata.

Fine by me to not put it in _describe, but I don't understand what you mean by defining it twice here - can you give an example?

This really only came up in APIDataSet - the dataset has an instance variable _request_args which stores most of the arguments provided to the class. It's this instance variable that is then accessed by _describe(), so to include the metadata in _describe() it would need to be stored in two separate places - self._metadata and self._request_args.

Also curious what you mean by this. Where is the inconsistency?

Another example born from APIDataSet: it mostly doesn't make use of instance variables, instead having everything stored within the one variable self._request_args. This implementation ignores that and defines always metadata as an instance variable for every dataset, regardless of any variance in the dataset implementations.

@antonymilne
Copy link
Contributor

I did a little digging and it looks like the _preview function added by Viz is the same, in that it's not used by Kedro itself. For this reason, I'd argue to keep _metadata and maintain consistency.

However I also agree that that marking it as protected implies that accessing it is somewhat hacky. In this case I'd actually vouch for introducing a metadata property that accesses self._metadata.

I don't think the comparison with _preview is quite accurate here for various reasons, but I do agree that it looks a little odd to have one public attribute when everything else is protected.

Having a metadata property is ok, but I think it should probably live in AbstractDataSet. It's then maybe a bit weird then that the _metadata attribute is defined in implementations but not the abstract dataset, but I guess it's consistent with how e.g. public AbstractDataSet.exists function wraps the implementation's _exists function (this type of template method pattern I'm also not a fan of, but that's for another time).

So yeah, if you and others think the property is a good idea then fine by me 👍 It does at least make metadata read-only after dataset instantiation, which is probably a good thing, and also provide a public interface for users.

Thanks for explaining about APIDataSet. I think what you've done here is good.

@AhdraMeraliQB
Copy link
Contributor Author

AhdraMeraliQB commented May 17, 2023

I agree that the metadata property should live in the AbstractDataSet - I'll make those changes in #2537. I suppose I'll have to open a separate PR to add pass them through the datasets that can then be merged in after the changes on Kedro are released.

@merelcht do you have any thoughts on this?

@AhdraMeraliQB AhdraMeraliQB marked this pull request as draft May 17, 2023 14:54
@merelcht
Copy link
Member

I agree that the metadata property should live in the AbstractDataSet - I'll make those changes in #2537. I suppose I'll have to open a separate PR to add pass them through the datasets that can then be merged in after the changes on Kedro are released.

@merelcht do you have any thoughts on this?

I think the comments and observations @antonymilne made are very valid. I hadn't properly thought it through. So I'd prefer to just make metadata public then and not add a property. If we add the property, it would be a breaking change to remove it again in future and we don't know yet how this metadata feature is going to be used so I'd prefer to not tie ourselves into this additional property.

@AhdraMeraliQB
Copy link
Contributor Author

If we add the property, it would be a breaking change to remove it again in future and we don't know yet how this metadata feature is going to be used so I'd prefer to not tie ourselves into this additional property.

Wouldn't it be breaking either way?

@AhdraMeraliQB AhdraMeraliQB marked this pull request as ready for review May 18, 2023 22:11
@noklam noklam self-requested a review May 22, 2023 09:31
Copy link
Contributor

@noklam noklam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The discussion has been mainly about how we should implement this new attributes. There are 3 options discussed in this PR and we go for 2 in this PR

  1. self._metadata
  2. self.metadata as a instance attribute
  3. self.metadata (as a property)

If we want to change 2/3 in the future, it will be a breaking change since they are both public method. I do think this should goes into the Abstract Class, in future if a new dataset PR comes in and it doesn't have the metadata field, we will reject it. For this reason we should just define it explicit in the contract (abstract class).

  1. seem to address the point about contract better, but it actually didn't enforce what we want. What we really want is enforcing metadata is part of the signature in the Class Constructor. I think the right way to do it is to enforce it in the abstract class constructor, i.e. super().__init__(xxx, xxx)

Follow up actions:

  • We should add example hooks for how to consume this new metadata. Viz will likely be the first consumer.
  • Remove of the layer attribute for 0.19.0 in favor of the metadata attribute

@antonymilne
Copy link
Contributor

@noklam agree with everything you say here, but I believe that adding metadata to the AbstractDataSet constructor was difficult for some reason. I don't know the details (I think @merelcht does) but maybe comes down to the things discussed here and related issue about where AbstractDataSet lives: kedro-org/kedro#1076 (comment).

If we want to change 2/3 in the future, it will be a breaking change since they are both public method.

Agree, although a breaking change to a dataset is less awkward to deal with than to framework.

Copy link
Contributor

@antonymilne antonymilne left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🌟

@merelcht merelcht enabled auto-merge (squash) May 22, 2023 14:33
@merelcht merelcht merged commit de8b833 into main May 22, 2023
@merelcht merelcht deleted the add-metadata-attribute branch May 22, 2023 14:43
kuriantom369 pushed a commit to tingtingQB/kedro-plugins that referenced this pull request May 30, 2023
* Add metadata attribute to all datasets

Signed-off-by: Ahdra Merali <[email protected]>
Signed-off-by: Tom Kurian <[email protected]>
noklam added a commit that referenced this pull request May 31, 2023
* Fix links on GitHub issue templates (#150)

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>
Signed-off-by: Tingting_Wan <[email protected]>

* add spark_stream_dataset.py

Signed-off-by: Tingting_Wan <[email protected]>

* Migrate most of `kedro-datasets` metadata to `pyproject.toml` (#161)

* Include missing requirements files in sdist

Fix gh-86.

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>

* Migrate most project metadata to `pyproject.toml`

See kedro-org/kedro#2334.

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>

* Move requirements to `pyproject.toml`

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>

---------

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>
Signed-off-by: Tingting_Wan <[email protected]>

* restructure the strean dataset to align with the other spark dataset

Signed-off-by: Tingting_Wan <[email protected]>

* adding README.md for specification

Signed-off-by: Tingting_Wan <[email protected]>

* Update kedro-datasets/kedro_datasets/spark/spark_stream_dataset.py

Co-authored-by: Nok Lam Chan <[email protected]>
Signed-off-by: Tingting_Wan <[email protected]>

* rename the dataset

Signed-off-by: Tingting_Wan <[email protected]>

* resolve comments

Signed-off-by: Tingting_Wan <[email protected]>

* fix format and pylint

Signed-off-by: Tingting_Wan <[email protected]>

* Update kedro-datasets/kedro_datasets/spark/README.md

Co-authored-by: Deepyaman Datta <[email protected]>
Signed-off-by: Tingting_Wan <[email protected]>

* add unit tests and SparkStreamingDataset in init.py

Signed-off-by: Tingting_Wan <[email protected]>

* add unit tests

Signed-off-by: Tingting_Wan <[email protected]>

* update test_save

Signed-off-by: Tingting_Wan <[email protected]>

* Upgrade Polars (#171)

* Upgrade Polars

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>

* Update Polars to 0.17.x

---------

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>
Signed-off-by: Tingting_Wan <[email protected]>

* if release is failed, it return exit code and fail the CI (#158)

Signed-off-by: Tingting_Wan <[email protected]>

* Migrate `kedro-airflow` to static metadata (#172)

* Migrate kedro-airflow to static metadata

See kedro-org/kedro#2334.

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>

* Add explicit PEP 518 build requirements for kedro-datasets

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>

* Typos

Co-authored-by: Merel Theisen <[email protected]>

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>

* Remove dangling reference to requirements.txt

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>

* Add release notes

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>

---------

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>
Signed-off-by: Tingting_Wan <[email protected]>

* Migrate `kedro-telemetry` to static metadata (#174)

* Migrate kedro-telemetry to static metadata

See kedro-org/kedro#2334.

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>

* Add release notes

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>

---------

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>
Signed-off-by: Tingting_Wan <[email protected]>

* ci: port lint, unit test, and e2e tests to Actions (#155)

* Add unit test + lint test on GA

* trigger GA - will revert

Signed-off-by: Ankita Katiyar <[email protected]>

* Fix lint

Signed-off-by: Ankita Katiyar <[email protected]>

* Add end to end tests

* Add cache key

Signed-off-by: Ankita Katiyar <[email protected]>

* Add cache action

Signed-off-by: Ankita Katiyar <[email protected]>

* Rename workflow files

Signed-off-by: Ankita Katiyar <[email protected]>

* Lint + add comment + default bash

Signed-off-by: Ankita Katiyar <[email protected]>

* Add windows test

Signed-off-by: Ankita Katiyar <[email protected]>

* Update workflow name + revert changes to READMEs

Signed-off-by: Ankita Katiyar <[email protected]>

* Add kedro-telemetry/RELEASE.md to trufflehog ignore

Signed-off-by: Ankita Katiyar <[email protected]>

* Add pytables to test_requirements remove from workflow

Signed-off-by: Ankita Katiyar <[email protected]>

* Revert "Add pytables to test_requirements remove from workflow"

This reverts commit 8203daa.

* Separate pip freeze step

Signed-off-by: Ankita Katiyar <[email protected]>

---------

Signed-off-by: Ankita Katiyar <[email protected]>
Signed-off-by: Tingting_Wan <[email protected]>

* Migrate `kedro-docker` to static metadata (#173)

* Migrate kedro-docker to static metadata

See kedro-org/kedro#2334.

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>

* Address packaging warning

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>

* Fix tests

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>

* Actually install current plugin with dependencies

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>

* Add release notes

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>

---------

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>
Signed-off-by: Tingting_Wan <[email protected]>

* Introdcuing .gitpod.yml to kedro-plugins (#185)

Currently opening gitpod will installed a Python 3.11 which breaks everything because we don't support it set. This PR introduce a simple .gitpod.yml to get it started.

Signed-off-by: Tingting_Wan <[email protected]>

* sync APIDataSet  from kedro's `develop` (#184)

* Update APIDataSet

Signed-off-by: Nok Chan <[email protected]>

* Sync ParquetDataSet

Signed-off-by: Nok Chan <[email protected]>

* Sync Test

Signed-off-by: Nok Chan <[email protected]>

* Linting

Signed-off-by: Nok Chan <[email protected]>

* Revert Unnecessary ParquetDataSet Changes

Signed-off-by: Nok Chan <[email protected]>

* Sync release notes

Signed-off-by: Nok Chan <[email protected]>

---------

Signed-off-by: Nok Chan <[email protected]>
Signed-off-by: Tingting_Wan <[email protected]>

* formatting

Signed-off-by: Tingting_Wan <[email protected]>

* formatting

Signed-off-by: Tingting_Wan <[email protected]>

* formatting

Signed-off-by: Tingting_Wan <[email protected]>

* formatting

Signed-off-by: Tingting_Wan <[email protected]>

* add spark_stream_dataset.py

Signed-off-by: Tingting_Wan <[email protected]>

* restructure the strean dataset to align with the other spark dataset

Signed-off-by: Tingting_Wan <[email protected]>

* adding README.md for specification

Signed-off-by: Tingting_Wan <[email protected]>

* Update kedro-datasets/kedro_datasets/spark/spark_stream_dataset.py

Co-authored-by: Nok Lam Chan <[email protected]>
Signed-off-by: Tingting_Wan <[email protected]>

* rename the dataset

Signed-off-by: Tingting_Wan <[email protected]>

* resolve comments

Signed-off-by: Tingting_Wan <[email protected]>

* fix format and pylint

Signed-off-by: Tingting_Wan <[email protected]>

* Update kedro-datasets/kedro_datasets/spark/README.md

Co-authored-by: Deepyaman Datta <[email protected]>
Signed-off-by: Tingting_Wan <[email protected]>

* add unit tests and SparkStreamingDataset in init.py

Signed-off-by: Tingting_Wan <[email protected]>

* add unit tests

Signed-off-by: Tingting_Wan <[email protected]>

* update test_save

Signed-off-by: Tingting_Wan <[email protected]>

* formatting

Signed-off-by: Tingting_Wan <[email protected]>

* formatting

Signed-off-by: Tingting_Wan <[email protected]>

* formatting

Signed-off-by: Tingting_Wan <[email protected]>

* formatting

Signed-off-by: Tingting_Wan <[email protected]>

* lint

Signed-off-by: Tingting_Wan <[email protected]>

* lint

Signed-off-by: Tingting_Wan <[email protected]>

* lint

Signed-off-by: Tingting_Wan <[email protected]>

* update test cases

Signed-off-by: Tingting_Wan <[email protected]>

* add negative test

Signed-off-by: Tingting_Wan <[email protected]>

* remove code snippets fpr testing

Signed-off-by: Tingting_Wan <[email protected]>

* lint

Signed-off-by: Tingting_Wan <[email protected]>

* update tests

Signed-off-by: Tingting_Wan <[email protected]>

* update test and remove redundacy

Signed-off-by: Tingting_Wan <[email protected]>

* linting

Signed-off-by: Tingting_Wan <[email protected]>

* refactor file format

Signed-off-by: Tom Kurian <[email protected]>

* fix read me file

Signed-off-by: Tom Kurian <[email protected]>

* docs: Add community contributions (#199)

* Add community contributions

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>

* Use newer link to docs

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>

---------

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>

* adding test for raise error

Signed-off-by: Tingting_Wan <[email protected]>

* update test and remove redundacy

Signed-off-by: Tingting_Wan <[email protected]>
Signed-off-by: Tom Kurian <[email protected]>

* linting

Signed-off-by: Tingting_Wan <[email protected]>
Signed-off-by: Tom Kurian <[email protected]>

* refactor file format

Signed-off-by: Tom Kurian <[email protected]>

* fix read me file

Signed-off-by: Tom Kurian <[email protected]>

* adding test for raise error

Signed-off-by: Tingting_Wan <[email protected]>
Signed-off-by: Tom Kurian <[email protected]>

* fix readme file

Signed-off-by: Tom Kurian <[email protected]>

* fix readme

Signed-off-by: Tom Kurian <[email protected]>

* fix conflicts

Signed-off-by: Tom Kurian <[email protected]>

* fix ci erors

Signed-off-by: Tom Kurian <[email protected]>

* fix lint issue

Signed-off-by: Tom Kurian <[email protected]>

* update class documentation

Signed-off-by: Tom Kurian <[email protected]>

* add additional test cases

Signed-off-by: Tom Kurian <[email protected]>

* add s3 read test cases

Signed-off-by: Tom Kurian <[email protected]>

* add s3 read test cases

Signed-off-by: Tom Kurian <[email protected]>

* add s3 read test case

Signed-off-by: Tom Kurian <[email protected]>

* test s3 read

Signed-off-by: Tom Kurian <[email protected]>

* remove redundant test cases

Signed-off-by: Tom Kurian <[email protected]>

* fix streaming dataset configurations

Signed-off-by: Tom Kurian <[email protected]>

* update streaming datasets doc

Signed-off-by: Tingting_Wan <[email protected]>

* resolve comments re documentation

Signed-off-by: Tingting_Wan <[email protected]>

* bugfix lint

Signed-off-by: Tingting_Wan <[email protected]>

* update link

Signed-off-by: Tingting_Wan <[email protected]>

* revert the changes on CI

Signed-off-by: Nok Chan <[email protected]>

* test(docker): remove outdated logging-related step (#207)

* fixkedro- docker e2e test

Signed-off-by: Nok Chan <[email protected]>

* fix: add timeout to request to satisfy bandit lint

---------

Signed-off-by: Nok Chan <[email protected]>
Co-authored-by: Deepyaman Datta <[email protected]>
Signed-off-by: Tom Kurian <[email protected]>

* ci: ensure plugin requirements get installed in CI (#208)

* ci: install the plugin alongside test requirements

* ci: install the plugin alongside test requirements

* Update kedro-airflow.yml

* Update kedro-datasets.yml

* Update kedro-docker.yml

* Update kedro-telemetry.yml

* Update kedro-airflow.yml

* Update kedro-datasets.yml

* Update kedro-airflow.yml

* Update kedro-docker.yml

* Update kedro-telemetry.yml

* ci(telemetry): update isort config to correct sort

* Don't use profile ¯\_(ツ)_/¯

Signed-off-by: Deepyaman Datta <[email protected]>

* chore(datasets): remove empty `tool.black` section

* chore(docker): remove empty `tool.black` section

---------

Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Tom Kurian <[email protected]>

* ci: Migrate the release workflow from CircleCI to GitHub Actions (#203)

* Create check-release.yml

* change from test pypi to pypi

* split into jobs and move version logic into script

* update github actions output

* lint

* changes based on review

* changes based on review

* fix script to not append continuously

* change pypi api token logic

Signed-off-by: Tom Kurian <[email protected]>

* build: Relax Kedro bound for `kedro-datasets` (#140)

* Less strict pin on Kedro for datasets

Signed-off-by: Merel Theisen <[email protected]>
Signed-off-by: Tom Kurian <[email protected]>

* ci: don't run checks on both `push`/`pull_request` (#192)

* ci: don't run checks on both `push`/`pull_request`

* ci: don't run checks on both `push`/`pull_request`

* ci: don't run checks on both `push`/`pull_request`

* ci: don't run checks on both `push`/`pull_request`

Signed-off-by: Tom Kurian <[email protected]>

* chore: delete extra space ending check-release.yml (#210)

Signed-off-by: Tom Kurian <[email protected]>

* ci: Create merge-gatekeeper.yml to make sure PR only merged when all tests checked. (#215)

* Create merge-gatekeeper.yml

* Update .github/workflows/merge-gatekeeper.yml

---------

Co-authored-by: Sajid Alam <[email protected]>
Signed-off-by: Tom Kurian <[email protected]>

* ci: Remove the CircleCI setup (#209)

* remove circleci setup files and utils

* remove circleci configs in kedro-telemetry

* remove redundant .github in kedro-telemetry

* Delete continue_config.yml

* Update check-release.yml

* lint

* increase timeout to 40 mins for docker e2e tests

Signed-off-by: Tom Kurian <[email protected]>

* feat: Dataset API add `save` method (#180)

* [FEAT] add save method to APIDataset

Signed-off-by: jmcdonnell <[email protected]>

* [ENH] create save_args parameter for api_dataset

Signed-off-by: jmcdonnell <[email protected]>

* [ENH] add tests for socket + http errors

Signed-off-by: <[email protected]>
Signed-off-by: jmcdonnell <[email protected]>

* [ENH] check save data is json

Signed-off-by: <[email protected]>
Signed-off-by: jmcdonnell <[email protected]>

* [FIX] clean code

Signed-off-by: jmcdonnell <[email protected]>

* [ENH] handle different data types

Signed-off-by: jmcdonnell <[email protected]>

* [FIX] test coverage for exceptions

Signed-off-by: jmcdonnell <[email protected]>

* [ENH] add examples in APIDataSet docstring

Signed-off-by: jmcdonnell <[email protected]>

* sync APIDataSet  from kedro's `develop` (#184)

* Update APIDataSet

Signed-off-by: Nok Chan <[email protected]>

* Sync ParquetDataSet

Signed-off-by: Nok Chan <[email protected]>

* Sync Test

Signed-off-by: Nok Chan <[email protected]>

* Linting

Signed-off-by: Nok Chan <[email protected]>

* Revert Unnecessary ParquetDataSet Changes

Signed-off-by: Nok Chan <[email protected]>

* Sync release notes

Signed-off-by: Nok Chan <[email protected]>

---------

Signed-off-by: Nok Chan <[email protected]>
Signed-off-by: jmcdonnell <[email protected]>

* [FIX] remove support for delete method

Signed-off-by: jmcdonnell <[email protected]>

* [FIX] lint files

Signed-off-by: jmcdonnell <[email protected]>

* [FIX] fix conflicts

Signed-off-by: jmcdonnell <[email protected]>

* [FIX] remove fail save test

Signed-off-by: jmcdonnell <[email protected]>

* [ENH] review suggestions

Signed-off-by: jmcdonnell <[email protected]>

* [ENH] fix tests

Signed-off-by: jmcdonnell <[email protected]>

* [FIX] reorder arguments

Signed-off-by: jmcdonnell <[email protected]>

---------

Signed-off-by: jmcdonnell <[email protected]>
Signed-off-by: <[email protected]>
Signed-off-by: Nok Chan <[email protected]>
Co-authored-by: jmcdonnell <[email protected]>
Co-authored-by: Nok Lam Chan <[email protected]>
Signed-off-by: Tom Kurian <[email protected]>

* ci: Automatically extract release notes for GitHub Releases (#212)

* ci: Automatically extract release notes

Signed-off-by: Ankita Katiyar <[email protected]>

* fix lint

Signed-off-by: Ankita Katiyar <[email protected]>

* Raise exceptions

Signed-off-by: Ankita Katiyar <[email protected]>

* Lint

Signed-off-by: Ankita Katiyar <[email protected]>

* Lint

Signed-off-by: Ankita Katiyar <[email protected]>

---------

Signed-off-by: Ankita Katiyar <[email protected]>
Signed-off-by: Tom Kurian <[email protected]>

* feat: Add metadata attribute to datasets (#189)

* Add metadata attribute to all datasets

Signed-off-by: Ahdra Merali <[email protected]>
Signed-off-by: Tom Kurian <[email protected]>

* feat: Add ManagedTableDataset for managed Delta Lake tables in Databricks (#206)

* committing first version of UnityTableCatalog with unit tests. This datasets allows users to interface with Unity catalog tables in Databricks to both read and write.

Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Jannic Holzer <[email protected]>

* renaming dataset

Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Jannic Holzer <[email protected]>

* adding mlflow connectors

Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Jannic Holzer <[email protected]>

* fixing mlflow imports

Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Jannic Holzer <[email protected]>

* cleaned up mlflow for initial release

Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Jannic Holzer <[email protected]>

* cleaned up mlflow references from setup.py for initial release

Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Jannic Holzer <[email protected]>

* fixed deps in setup.py

Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Jannic Holzer <[email protected]>

* adding comments before intiial PR

Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Jannic Holzer <[email protected]>

* moved validation to dataclass

Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Jannic Holzer <[email protected]>

* bug fix in type of partition column and cleanup

Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Jannic Holzer <[email protected]>

* updated docstring for ManagedTableDataSet

Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Jannic Holzer <[email protected]>

* added backticks to catalog

Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Jannic Holzer <[email protected]>

* fixing regex to allow hyphens

Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Jannic Holzer <[email protected]>

* Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py

Co-authored-by: Jannic <[email protected]>
Signed-off-by: Jannic Holzer <[email protected]>

* Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py

Co-authored-by: Jannic <[email protected]>
Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Jannic Holzer <[email protected]>

* Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py

Co-authored-by: Jannic <[email protected]>
Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Jannic Holzer <[email protected]>

* Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py

Co-authored-by: Jannic <[email protected]>
Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Jannic Holzer <[email protected]>

* Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py

Co-authored-by: Jannic <[email protected]>
Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Jannic Holzer <[email protected]>

* Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py

Co-authored-by: Jannic <[email protected]>
Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Jannic Holzer <[email protected]>

* Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py

Co-authored-by: Jannic <[email protected]>
Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Jannic Holzer <[email protected]>

* Update kedro-datasets/test_requirements.txt

Co-authored-by: Jannic <[email protected]>
Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Jannic Holzer <[email protected]>

* Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py

Co-authored-by: Jannic <[email protected]>
Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Jannic Holzer <[email protected]>

* Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py

Co-authored-by: Jannic <[email protected]>
Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Jannic Holzer <[email protected]>

* Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py

Co-authored-by: Jannic <[email protected]>
Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Jannic Holzer <[email protected]>

* Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py

Co-authored-by: Jannic <[email protected]>
Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Jannic Holzer <[email protected]>

* adding backticks to catalog

Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Jannic Holzer <[email protected]>

* Require pandas < 2.0 for compatibility with spark < 3.4

Signed-off-by: Jannic Holzer <[email protected]>

* Replace use of walrus operator

Signed-off-by: Jannic Holzer <[email protected]>

* Add test coverage for validation methods

Signed-off-by: Jannic Holzer <[email protected]>

* Remove unused versioning functions

Signed-off-by: Jannic Holzer <[email protected]>

* Fix exception catching for invalid schema, add test for invalid schema

Signed-off-by: Jannic Holzer <[email protected]>

* Add pylint ignore

Signed-off-by: Jannic Holzer <[email protected]>

* Add tests/databricks to ignore for no-spark tests

Signed-off-by: Jannic Holzer <[email protected]>

* Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py

Co-authored-by: Nok Lam Chan <[email protected]>

* Update kedro-datasets/kedro_datasets/databricks/managed_table_dataset.py

Co-authored-by: Nok Lam Chan <[email protected]>

* Remove spurious mlflow test dependency

Signed-off-by: Jannic Holzer <[email protected]>

* Add explicit check for database existence

Signed-off-by: Jannic Holzer <[email protected]>

* Remove character limit for table names

Signed-off-by: Jannic Holzer <[email protected]>

* Refactor validation steps in ManagedTable

Signed-off-by: Jannic Holzer <[email protected]>

* Remove spurious checks for table and schema name existence

Signed-off-by: Jannic Holzer <[email protected]>

---------

Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Jannic Holzer <[email protected]>
Co-authored-by: Danny Farah <[email protected]>
Co-authored-by: Danny Farah <[email protected]>
Co-authored-by: Nok Lam Chan <[email protected]>
Signed-off-by: Tom Kurian <[email protected]>

* docs: Update APIDataset docs and refactor (#217)

* Update APIDataset docs and refactor

* Acknowledge community contributor

* Fix more broken doc

Signed-off-by: Nok Chan <[email protected]>

* Lint

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>

* Fix release notes of upcoming kedro-datasets

---------

Signed-off-by: Nok Chan <[email protected]>
Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>
Co-authored-by: Juan Luis Cano Rodríguez <[email protected]>
Co-authored-by: Jannic <[email protected]>
Signed-off-by: Tom Kurian <[email protected]>

* feat: Release `kedro-datasets` version `1.3.0` (#219)

* Modify release version and RELEASE.md

Signed-off-by: Jannic Holzer <[email protected]>

* Add proper name for ManagedTableDataSet

Signed-off-by: Jannic Holzer <[email protected]>

* Update kedro-datasets/RELEASE.md

Co-authored-by: Juan Luis Cano Rodríguez <[email protected]>

* Revert lost semicolon for release 1.2.0

Signed-off-by: Jannic Holzer <[email protected]>

---------

Signed-off-by: Jannic Holzer <[email protected]>
Co-authored-by: Juan Luis Cano Rodríguez <[email protected]>
Signed-off-by: Tom Kurian <[email protected]>

* docs: Fix APIDataSet docstring (#220)

* Fix APIDataSet docstring

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>

* Add release notes

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>

* Separate [docs] extras from [all] in kedro-datasets

Fix gh-143.

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>

---------

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>
Signed-off-by: Tom Kurian <[email protected]>

* Update kedro-datasets/tests/spark/test_spark_streaming_dataset.py

Co-authored-by: Deepyaman Datta <[email protected]>
Signed-off-by: Tom Kurian <[email protected]>

* Update kedro-datasets/kedro_datasets/spark/spark_streaming_dataset.py

Co-authored-by: Deepyaman Datta <[email protected]>
Signed-off-by: Tom Kurian <[email protected]>

* Update kedro-datasets/setup.py

Co-authored-by: Deepyaman Datta <[email protected]>
Signed-off-by: Tom Kurian <[email protected]>

* fix linting issue

Signed-off-by: Tom Kurian <[email protected]>

---------

Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>
Signed-off-by: Tingting_Wan <[email protected]>
Signed-off-by: Juan Luis Cano Rodríguez <[email protected]>
Signed-off-by: Ankita Katiyar <[email protected]>
Signed-off-by: Nok Chan <[email protected]>
Signed-off-by: Tom Kurian <[email protected]>
Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: Merel Theisen <[email protected]>
Signed-off-by: jmcdonnell <[email protected]>
Signed-off-by: <[email protected]>
Signed-off-by: Ahdra Merali <[email protected]>
Signed-off-by: Danny Farah <[email protected]>
Signed-off-by: Jannic Holzer <[email protected]>
Co-authored-by: Juan Luis Cano Rodríguez <[email protected]>
Co-authored-by: Tingting Wan <[email protected]>
Co-authored-by: Nok Lam Chan <[email protected]>
Co-authored-by: Deepyaman Datta <[email protected]>
Co-authored-by: Nok Lam Chan <[email protected]>
Co-authored-by: Ankita Katiyar <[email protected]>
Co-authored-by: Juan Luis Cano Rodríguez <[email protected]>
Co-authored-by: Tom Kurian <[email protected]>
Co-authored-by: Sajid Alam <[email protected]>
Co-authored-by: Merel Theisen <[email protected]>
Co-authored-by: McDonnellJoseph <[email protected]>
Co-authored-by: jmcdonnell <[email protected]>
Co-authored-by: Ahdra Merali <[email protected]>
Co-authored-by: Jannic <[email protected]>
Co-authored-by: Danny Farah <[email protected]>
Co-authored-by: Danny Farah <[email protected]>
Co-authored-by: kuriantom369 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable adding new attributes to datasets
5 participants