feat(datasets): add dataset to load/save with Ibis #560

deepyaman · 2024-02-21T17:28:22Z

Description

Officially add Ibis TableDataset introduced in https://kedro.org/blog/building-scalable-data-pipelines-with-kedro-and-ibis.

Development notes

Checklist

Opened this PR as a 'Draft Pull Request' if it is work-in-progress
Updated the documentation to reflect the code changes
Added a description of this change in the relevant RELEASE.md file
Added tests to cover my changes

astrojuanlu · 2024-02-21T17:29:06Z

🔥

Signed-off-by: Deepyaman Datta <[email protected]>

astrojuanlu · 2024-03-05T09:30:45Z

Before I start reviewing: I'm fine pushing this to kedro-datasets for now but I want to make it clear that at some point I'd like to seriously discuss the idea of breaking up kedro-datasets into domain-specific subpackages #535 (comment) given that our monorepo approach is actually obstructing discoverability #401

datajoely · 2024-03-05T09:34:53Z

I would go further @astrojuanlu and say this should be in a kedro-ibis package from day 1 - especially since (1) we know it's going to end up there today (2) the dependency conflicts with the rest of the datasets are going to be numerous.

datajoely · 2024-03-05T09:35:05Z

but also thank you for the push this is great!

datajoely · 2024-03-05T09:35:59Z

kedro-datasets/pyproject.toml

@@ -49,6 +49,28 @@ huggingface-hfdataset = ["datasets", "huggingface_hub"]
 huggingface-hftransformerpipelinedataset = ["transformers"]
 huggingface = ["kedro-datasets[huggingface-hfdataset,huggingface-hftransformerpipelinedataset]"]

+ibis-bigquery = ["ibis-framework[bigquery]"]


I wonder if we could do a CI script to keep this bit in sync

deepyaman · 2024-03-05T16:17:19Z

Before I start reviewing: I'm fine pushing this to kedro-datasets for now but I want to make it clear that at some point I'd like to seriously discuss the idea of breaking up kedro-datasets into domain-specific subpackages #535 (comment) given that our monorepo approach is actually obstructing discoverability #401

@astrojuanlu Intuitively, what would domain-specific subpackages look like? kedro-datasets-spark, kedro-datasets-tensorflow, kedro-datasets-ibis or something? (Maybe just kedro-spark, kedro-tensorflow, kedro-ibis.) I can see this making sense, but it's a much broader effort (with it's own complications, like figuring out the extent to which datasets need to be compatible with each other), and I'd be supportive of doing it down the road.

I would go further @astrojuanlu and say this should be in a kedro-ibis package from day 1 - especially since (1) we know it's going to end up there today (2) the dependency conflicts with the rest of the datasets are going to be numerous.

Do we know it's going to end up in a separate package? At this point, ibis.TableDataset is the only concrete contribution.
I don't believe so; ibis-framework itself doesn't have such tight dependencies, and then you choose the backend you'd like as an extra. Ibis is pretty good about dependencies—better than the majority of other libraries—in order to be able to resolve across every backend (except Flink 😞 but that's Flink's fault), so I don't see the introduction of this causing resolution issues.

For both @astrojuanlu and @datajoely (and anybody else on the TSC): If we do go down the route of kedro-ibis as a separate package, would it fall under kedro-org?

datajoely · 2024-03-05T16:40:04Z

I think kedro-plugins -> kedro-ibis or kedro-datasets-ibis

noklam · 2024-03-19T13:06:34Z

I get some weird error when I try to run the test maunally, CI seems to be fine with it. Maybe it's the ibis example dataset fetching from gcs.

Is there a way to make this test run without network? Seems like it is fetching data from bucket, not sure about that. After re-run CI it also failed.

kedro-datasets/kedro_datasets/ibis/table_dataset.py

noklam · 2024-03-20T15:59:41Z

kedro-datasets/kedro_datasets/ibis/table_dataset.py

+
+    .. code-block:: pycon
+
+        >>> from kedro_datasets.ibis import TableDataset


Would it be possible to add an local example here with some dummy dataframe / duckdb? I know there is a blog post, but just looking at the docs/docstring I think it's not that easy to get how an user should use this.

Ideally we'd have both a yaml and python example.

Just fyi, when connecting to ibis there are two major "paradigms", either a filebased connection or a db connection. From my limited knowledge I think the configuration arguments are quite different.

For reference, I am currently connecting to mssql dbs like this:

conn_table = TableDataset( connection={ "host": "xxx.sql.azuresynapse.net", "database": "xxx", "query": {"driver": "ODBC Driver 17 for SQL Server"}, "backend": "mssql", }, table_name="xxx", load_args={"schema": "xxx"}, )

the documentation here should probably clarify what the different arguments map to in the ibis connection object.

in the case of a filepath being provided, it will map to the backend method read_{file_format}, e.g. https://ibis-project.org/backends/duckdb#ibis.backends.duckdb.Backend.read_parquet

otherwise it will get the table from ibis.{backend}.Backend.table where the Backend object is obtained through ibis.{backend}.connect e.g. https://ibis-project.org/backends/mssql

Sorry, I forgot to add an example, it seems. 😅

the documentation here should probably clarify what the different arguments map to in the ibis connection object.

Added much more detail on this!

inigohidalgo · 2024-03-22T19:11:29Z

kedro-datasets/kedro_datasets/ibis/table_dataset.py

+            reader = getattr(self.connection, f"read_{self._file_format}")
+            return reader(self._filepath, self._table_name, **self._load_args)
+        else:
+            return self.connection.table(self._table_name)


currently when loading tables thru the mssql backend I am doing conn.table(table_name="XXX", schema="A").

an easy solution would be to add the load args into the conn.table call and specify the schema as a load_arg, but imo this is suboptimal, as the write and save schema should usually (in my experience, my assertion could be wrong) be the same

@gforsyth not sure if you have any insight here wrt the upcoming changes in ibis around schema and db hierarchy

so starting in 9.0, schema will be deprecated as a kwarg. We will be using the word "database" to refer to a collection of tables and the word "catalog" to refer to a collection of "database".

For specifying only the database, you would do:

conn.table(table_name="XX", database="A")

if the database is located in a catalog (like, say dbo or sys) you would do either of:

conn.table(table_name="XX", database="dbo.A")

or

conn.table(table_name="XX", database=("dbo", "A"))

(schema will still work in 9.0, it will just raise a FutureWarning)

inigohidalgo · 2024-03-22T19:22:19Z

kedro-datasets/kedro_datasets/ibis/table_dataset.py

+    @property
+    def connection(self) -> BaseBackend:
+        cls = type(self)
+        key = tuple(sorted(self._connection_config.items()))


I get an error unhashable type: 'dict' when using this connection method

TableDataset( connection={ "host": "xxx.sql.azuresynapse.net", "database": "xxx", "query": {"driver": "ODBC Driver 17 for SQL Server"}, "backend": "mssql", }, table_name="xxx", load_args={"schema": "xxx"}, )

manually dropping the query kwarg before doing the tuple call works as a temporary workaround to get a working connection, but is obviously not ideal

@inigohidalgo Good catch! I've done some digging, and opted to use json.dumps(self._connection_config, sort_keys=True) instead. I think this should handle the case you brought up.

Ugh, that doesn't work, for non- serialization keys...

OK, fixed and added some test cases!

Signed-off-by: Deepyaman Datta <[email protected]>

deepyaman · 2024-04-08T18:00:58Z

I get some weird error when I try to run the test maunally, CI seems to be fine with it. Maybe it's the ibis example dataset fetching from gcs.
Is there a way to make this test run without network? Seems like it is fetching data from bucket, not sure about that. After re-run CI it also failed.

@noklam I've updated it to use our canonical [[1, 2], [4, 5], [5, 6]] example. Still using DuckDB, since it's probably worth highlighting DB interactions (including the default Ibis backend).

Signed-off-by: Deepyaman Datta <[email protected]>

This reverts commit 1fcb01a. Signed-off-by: Deepyaman Datta <[email protected]>

Signed-off-by: Deepyaman Datta <[email protected]>

merelcht

I haven't tried to use the dataset yet, but the code looks all good to me! Thanks for adding the detailed examples and doc-strings, they're great! ⭐

merelcht · 2024-04-09T15:46:11Z

kedro-datasets/kedro_datasets/ibis/table_dataset.py

+            connection: Configuration for connecting to an Ibis backend.
+            load_args: Additional arguments passed to the Ibis backend's
+                `read_{file_format}` method.
+            save_args: Additional arguments passed to the Ibis backend's


Any docs we can link to that specify which values a user can supply to materialized?

I'm not aware of anywhere it's documented centrally (i.e. not on a backend-specific basis) in Ibis. @lostmygithubaccount any chance you know?

we do not, this is on my TODO list to add API docs for alongside the other read_* and to_* methods

Sounds good! @merelcht if it's OK then, we'll improve the documentation on the Ibis side, and then whenever that happens, we can make the dataset docs reference that.

Signed-off-by: Elena Khaustova <[email protected]>

astrojuanlu · 2024-04-10T10:09:17Z

So happy to see this merged 🔥 let's make some noise!

* feat(datasets): add dataset to load/save with Ibis Signed-off-by: Deepyaman Datta <[email protected]> * build(datasets): fix typos in definition of extras Signed-off-by: Deepyaman Datta <[email protected]> * build(datasets): include Ibis backend requirements Signed-off-by: Deepyaman Datta <[email protected]> * test(datasets): implement save and reload for Ibis Signed-off-by: Deepyaman Datta <[email protected]> * test(datasets): check `ibis.TableDataset.exists()` Signed-off-by: Deepyaman Datta <[email protected]> * test(datasets): test extra load and save args work Signed-off-by: Deepyaman Datta <[email protected]> * test(datasets): verify config and materializations Signed-off-by: Deepyaman Datta <[email protected]> --------- Signed-off-by: Deepyaman Datta <[email protected]> Signed-off-by: tgoelles <[email protected]>

deepyaman force-pushed the feat/datasets/ibis-table-dataset branch from fd523cb to a52246b Compare February 21, 2024 17:28

deepyaman force-pushed the feat/datasets/ibis-table-dataset branch 3 times, most recently from c76fd22 to 98db223 Compare March 4, 2024 05:46

deepyaman marked this pull request as ready for review March 4, 2024 22:20

deepyaman requested review from astrojuanlu, noklam and datajoely March 4, 2024 22:20

deepyaman enabled auto-merge (squash) March 4, 2024 22:20

deepyaman added 7 commits March 4, 2024 15:26

feat(datasets): add dataset to load/save with Ibis

2a38d58

Signed-off-by: Deepyaman Datta <[email protected]>

build(datasets): include Ibis backend requirements

7dfc850

Signed-off-by: Deepyaman Datta <[email protected]>

test(datasets): implement save and reload for Ibis

19377c4

Signed-off-by: Deepyaman Datta <[email protected]>

test(datasets): check ibis.TableDataset.exists()

f5a6b73

Signed-off-by: Deepyaman Datta <[email protected]>

test(datasets): test extra load and save args work

c29e9e1

Signed-off-by: Deepyaman Datta <[email protected]>

test(datasets): verify config and materializations

ee1dc38

Signed-off-by: Deepyaman Datta <[email protected]>

chore(datasets): updated & standardized RELEASE.md

0c0bc6b

Signed-off-by: Deepyaman Datta <[email protected]>

deepyaman force-pushed the feat/datasets/ibis-table-dataset branch from 6f05651 to 0c0bc6b Compare March 4, 2024 22:26

datajoely reviewed Mar 5, 2024

View reviewed changes

astrojuanlu mentioned this pull request Mar 12, 2024

Decide on definitions of regular and experimental contributions #583

Closed

merelcht self-requested a review March 18, 2024 14:36

noklam reviewed Mar 20, 2024

View reviewed changes

kedro-datasets/kedro_datasets/ibis/table_dataset.py Show resolved Hide resolved

noklam reviewed Mar 20, 2024

View reviewed changes

inigohidalgo reviewed Mar 22, 2024

View reviewed changes

merelcht assigned deepyaman Apr 2, 2024

merelcht mentioned this pull request Apr 2, 2024

Release kedro-datasets 3.0.0 #632

Closed

2 tasks

deepyaman added 2 commits April 8, 2024 10:07

Merge branch 'main' into feat/datasets/ibis-table-dataset

9b8d2e8

Signed-off-by: Deepyaman Datta <[email protected]>

test(datasets): don't require network for examples

bbec217

Signed-off-by: Deepyaman Datta <[email protected]>

deepyaman force-pushed the feat/datasets/ibis-table-dataset branch from c085312 to bbec217 Compare April 8, 2024 16:34

docs(datasets): add Python, YAML examples for Ibis

7b3ae33

Signed-off-by: Deepyaman Datta <[email protected]>

noklam approved these changes Apr 8, 2024

View reviewed changes

deepyaman added 8 commits April 8, 2024 21:28

fix(datasets): make connection config key hashable

1fcb01a

Signed-off-by: Deepyaman Datta <[email protected]>

docs(datasets): expand on how TableDataset works

14ecc8c

Signed-off-by: Deepyaman Datta <[email protected]>

revert(datasets): don't json.dumps to create key

31a23ad

This reverts commit 1fcb01a. Signed-off-by: Deepyaman Datta <[email protected]>

fix(datasets): make connection config key hashable

a4f1d67

Signed-off-by: Deepyaman Datta <[email protected]>

chore(datasets): mark uncovered line with a pragma

f085a46

Signed-off-by: Deepyaman Datta <[email protected]>

Merge branch 'main' into feat/datasets/ibis-table-dataset

19380d3

test(datasets): add a test case for list in config

3bbfa3b

Signed-off-by: Deepyaman Datta <[email protected]>

test(datasets): add bug reported by Iñigo to suite

4757839

Signed-off-by: Deepyaman Datta <[email protected]>

merelcht approved these changes Apr 9, 2024

View reviewed changes

Merge branch 'main' into feat/datasets/ibis-table-dataset

46fe6bd

Signed-off-by: Elena Khaustova <[email protected]>

deepyaman merged commit cf283b2 into main Apr 10, 2024
14 checks passed

deepyaman deleted the feat/datasets/ibis-table-dataset branch April 10, 2024 09:43

astrojuanlu mentioned this pull request Apr 11, 2024

Users cannot install specific components of Kedro separately kedro-org/kedro#3659

Closed

inigohidalgo mentioned this pull request Jun 27, 2024

feat(mssql): Native support for interactive Active Directory Authentication ibis-project/ibis#7381

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(datasets): add dataset to load/save with Ibis #560

feat(datasets): add dataset to load/save with Ibis #560

deepyaman commented Feb 21, 2024 •

edited

Loading

astrojuanlu commented Feb 21, 2024

astrojuanlu commented Mar 5, 2024

datajoely commented Mar 5, 2024

datajoely commented Mar 5, 2024

datajoely Mar 5, 2024

deepyaman commented Mar 5, 2024

datajoely commented Mar 5, 2024

noklam commented Mar 19, 2024 •

edited

Loading

noklam Mar 20, 2024

merelcht Mar 20, 2024

inigohidalgo Mar 22, 2024

inigohidalgo Mar 22, 2024

deepyaman Apr 8, 2024

deepyaman Apr 9, 2024

inigohidalgo Mar 22, 2024

inigohidalgo Mar 22, 2024

gforsyth Mar 22, 2024

inigohidalgo Mar 22, 2024 •

edited

Loading

deepyaman Apr 9, 2024

deepyaman Apr 9, 2024

deepyaman Apr 9, 2024

deepyaman commented Apr 8, 2024

merelcht left a comment

merelcht Apr 9, 2024

deepyaman Apr 9, 2024

lostmygithubaccount Apr 9, 2024

deepyaman Apr 9, 2024

astrojuanlu commented Apr 10, 2024


		.. code-block:: pycon

		>>> from kedro_datasets.ibis import TableDataset

feat(datasets): add dataset to load/save with Ibis #560

feat(datasets): add dataset to load/save with Ibis #560

Conversation

deepyaman commented Feb 21, 2024 • edited Loading

Description

Development notes

Checklist

astrojuanlu commented Feb 21, 2024

astrojuanlu commented Mar 5, 2024

datajoely commented Mar 5, 2024

datajoely commented Mar 5, 2024

Choose a reason for hiding this comment

deepyaman commented Mar 5, 2024

datajoely commented Mar 5, 2024

noklam commented Mar 19, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

inigohidalgo Mar 22, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

deepyaman commented Apr 8, 2024

merelcht left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

astrojuanlu commented Apr 10, 2024

deepyaman commented Feb 21, 2024 •

edited

Loading

noklam commented Mar 19, 2024 •

edited

Loading

inigohidalgo Mar 22, 2024 •

edited

Loading