Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(datasets): add dataset to load/save with Ibis #560

Merged
merged 19 commits into from
Apr 10, 2024

Conversation

deepyaman
Copy link
Member

@deepyaman deepyaman commented Feb 21, 2024

Description

Officially add Ibis TableDataset introduced in https://kedro.org/blog/building-scalable-data-pipelines-with-kedro-and-ibis.

Development notes

Checklist

  • Opened this PR as a 'Draft Pull Request' if it is work-in-progress
  • Updated the documentation to reflect the code changes
  • Added a description of this change in the relevant RELEASE.md file
  • Added tests to cover my changes

@deepyaman deepyaman force-pushed the feat/datasets/ibis-table-dataset branch from fd523cb to a52246b Compare February 21, 2024 17:28
@astrojuanlu
Copy link
Member

🔥

@deepyaman deepyaman force-pushed the feat/datasets/ibis-table-dataset branch 3 times, most recently from c76fd22 to 98db223 Compare March 4, 2024 05:46
@deepyaman deepyaman marked this pull request as ready for review March 4, 2024 22:20
@deepyaman deepyaman enabled auto-merge (squash) March 4, 2024 22:20
@deepyaman deepyaman force-pushed the feat/datasets/ibis-table-dataset branch from 6f05651 to 0c0bc6b Compare March 4, 2024 22:26
@astrojuanlu
Copy link
Member

Before I start reviewing: I'm fine pushing this to kedro-datasets for now but I want to make it clear that at some point I'd like to seriously discuss the idea of breaking up kedro-datasets into domain-specific subpackages #535 (comment) given that our monorepo approach is actually obstructing discoverability #401

@datajoely
Copy link
Contributor

I would go further @astrojuanlu and say this should be in a kedro-ibis package from day 1 - especially since (1) we know it's going to end up there today (2) the dependency conflicts with the rest of the datasets are going to be numerous.

@datajoely
Copy link
Contributor

but also thank you for the push this is great!

@@ -49,6 +49,28 @@ huggingface-hfdataset = ["datasets", "huggingface_hub"]
huggingface-hftransformerpipelinedataset = ["transformers"]
huggingface = ["kedro-datasets[huggingface-hfdataset,huggingface-hftransformerpipelinedataset]"]

ibis-bigquery = ["ibis-framework[bigquery]"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we could do a CI script to keep this bit in sync

@deepyaman
Copy link
Member Author

Before I start reviewing: I'm fine pushing this to kedro-datasets for now but I want to make it clear that at some point I'd like to seriously discuss the idea of breaking up kedro-datasets into domain-specific subpackages #535 (comment) given that our monorepo approach is actually obstructing discoverability #401

@astrojuanlu Intuitively, what would domain-specific subpackages look like? kedro-datasets-spark, kedro-datasets-tensorflow, kedro-datasets-ibis or something? (Maybe just kedro-spark, kedro-tensorflow, kedro-ibis.) I can see this making sense, but it's a much broader effort (with it's own complications, like figuring out the extent to which datasets need to be compatible with each other), and I'd be supportive of doing it down the road.

I would go further @astrojuanlu and say this should be in a kedro-ibis package from day 1 - especially since (1) we know it's going to end up there today (2) the dependency conflicts with the rest of the datasets are going to be numerous.

  1. Do we know it's going to end up in a separate package? At this point, ibis.TableDataset is the only concrete contribution.
  2. I don't believe so; ibis-framework itself doesn't have such tight dependencies, and then you choose the backend you'd like as an extra. Ibis is pretty good about dependencies—better than the majority of other libraries—in order to be able to resolve across every backend (except Flink 😞 but that's Flink's fault), so I don't see the introduction of this causing resolution issues.

For both @astrojuanlu and @datajoely (and anybody else on the TSC): If we do go down the route of kedro-ibis as a separate package, would it fall under kedro-org?

@datajoely
Copy link
Contributor

I think kedro-plugins -> kedro-ibis or kedro-datasets-ibis

@noklam
Copy link
Contributor

noklam commented Mar 19, 2024

I get some weird error when I try to run the test maunally, CI seems to be fine with it. Maybe it's the ibis example dataset fetching from gcs.

image

Is there a way to make this test run without network? Seems like it is fetching data from bucket, not sure about that. After re-run CI it also failed.


.. code-block:: pycon

>>> from kedro_datasets.ibis import TableDataset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be possible to add an local example here with some dummy dataframe / duckdb? I know there is a blog post, but just looking at the docs/docstring I think it's not that easy to get how an user should use this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we'd have both a yaml and python example.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just fyi, when connecting to ibis there are two major "paradigms", either a filebased connection or a db connection. From my limited knowledge I think the configuration arguments are quite different.

For reference, I am currently connecting to mssql dbs like this:

conn_table = TableDataset(
    connection={
        "host": "xxx.sql.azuresynapse.net",
        "database": "xxx",
        "query": {"driver": "ODBC Driver 17 for SQL Server"},
        "backend": "mssql",
    },
    table_name="xxx",
    load_args={"schema": "xxx"},
)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the documentation here should probably clarify what the different arguments map to in the ibis connection object.

in the case of a filepath being provided, it will map to the backend method read_{file_format}, e.g. https://ibis-project.org/backends/duckdb#ibis.backends.duckdb.Backend.read_parquet

otherwise it will get the table from ibis.{backend}.Backend.table where the Backend object is obtained through ibis.{backend}.connect e.g. https://ibis-project.org/backends/mssql

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I forgot to add an example, it seems. 😅

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the documentation here should probably clarify what the different arguments map to in the ibis connection object.

Added much more detail on this!

reader = getattr(self.connection, f"read_{self._file_format}")
return reader(self._filepath, self._table_name, **self._load_args)
else:
return self.connection.table(self._table_name)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

currently when loading tables thru the mssql backend I am doing conn.table(table_name="XXX", schema="A").

an easy solution would be to add the load args into the conn.table call and specify the schema as a load_arg, but imo this is suboptimal, as the write and save schema should usually (in my experience, my assertion could be wrong) be the same

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gforsyth not sure if you have any insight here wrt the upcoming changes in ibis around schema and db hierarchy

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so starting in 9.0, schema will be deprecated as a kwarg. We will be using the word "database" to refer to a collection of tables and the word "catalog" to refer to a collection of "database".

For specifying only the database, you would do:

conn.table(table_name="XX", database="A")

if the database is located in a catalog (like, say dbo or sys) you would do either of:

conn.table(table_name="XX", database="dbo.A")

or

conn.table(table_name="XX", database=("dbo", "A"))

(schema will still work in 9.0, it will just raise a FutureWarning)

@property
def connection(self) -> BaseBackend:
cls = type(self)
key = tuple(sorted(self._connection_config.items()))
Copy link

@inigohidalgo inigohidalgo Mar 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get an error unhashable type: 'dict' when using this connection method

TableDataset(
    connection={
        "host": "xxx.sql.azuresynapse.net",
        "database": "xxx",
        "query": {"driver": "ODBC Driver 17 for SQL Server"},
        "backend": "mssql",
    },
    table_name="xxx",
    load_args={"schema": "xxx"},
)

manually dropping the query kwarg before doing the tuple call works as a temporary workaround to get a working connection, but is obviously not ideal

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@inigohidalgo Good catch! I've done some digging, and opted to use json.dumps(self._connection_config, sort_keys=True) instead. I think this should handle the case you brought up.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh, that doesn't work, for non- serialization keys...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, fixed and added some test cases!

@deepyaman deepyaman force-pushed the feat/datasets/ibis-table-dataset branch from c085312 to bbec217 Compare April 8, 2024 16:34
@deepyaman
Copy link
Member Author

I get some weird error when I try to run the test maunally, CI seems to be fine with it. Maybe it's the ibis example dataset fetching from gcs.

image Is there a way to make this test run without network? Seems like it is fetching data from bucket, not sure about that. After re-run CI it also failed.

@noklam I've updated it to use our canonical [[1, 2], [4, 5], [5, 6]] example. Still using DuckDB, since it's probably worth highlighting DB interactions (including the default Ibis backend).

Copy link
Member

@merelcht merelcht left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't tried to use the dataset yet, but the code looks all good to me! Thanks for adding the detailed examples and doc-strings, they're great! ⭐

connection: Configuration for connecting to an Ibis backend.
load_args: Additional arguments passed to the Ibis backend's
`read_{file_format}` method.
save_args: Additional arguments passed to the Ibis backend's
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any docs we can link to that specify which values a user can supply to materialized?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not aware of anywhere it's documented centrally (i.e. not on a backend-specific basis) in Ibis. @lostmygithubaccount any chance you know?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do not, this is on my TODO list to add API docs for alongside the other read_* and to_* methods

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good! @merelcht if it's OK then, we'll improve the documentation on the Ibis side, and then whenever that happens, we can make the dataset docs reference that.

@deepyaman deepyaman merged commit cf283b2 into main Apr 10, 2024
14 checks passed
@deepyaman deepyaman deleted the feat/datasets/ibis-table-dataset branch April 10, 2024 09:43
@astrojuanlu
Copy link
Member

So happy to see this merged 🔥 let's make some noise!

tgoelles pushed a commit to tgoelles/kedro-plugins that referenced this pull request Jun 6, 2024
* feat(datasets): add dataset to load/save with Ibis

Signed-off-by: Deepyaman Datta <[email protected]>

* build(datasets): fix typos in definition of extras

Signed-off-by: Deepyaman Datta <[email protected]>

* build(datasets): include Ibis backend requirements

Signed-off-by: Deepyaman Datta <[email protected]>

* test(datasets): implement save and reload for Ibis

Signed-off-by: Deepyaman Datta <[email protected]>

* test(datasets): check `ibis.TableDataset.exists()`

Signed-off-by: Deepyaman Datta <[email protected]>

* test(datasets): test extra load and save args work

Signed-off-by: Deepyaman Datta <[email protected]>

* test(datasets): verify config and materializations

Signed-off-by: Deepyaman Datta <[email protected]>

---------

Signed-off-by: Deepyaman Datta <[email protected]>
Signed-off-by: tgoelles <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants