Skip to content

Commit

Permalink
Apply suggestions from code review
Browse files Browse the repository at this point in the history
Co-authored-by: Alena Astrakhantseva <[email protected]>
  • Loading branch information
sh-rp and AstrakhantsevaAA committed Nov 25, 2024
1 parent 9c2d16d commit 3d952fe
Show file tree
Hide file tree
Showing 4 changed files with 31 additions and 30 deletions.
10 changes: 6 additions & 4 deletions docs/website/docs/general-usage/dataset-access/dataset.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: Accessing Loaded Data in Python
title: Accessing loaded data in Python
description: Conveniently accessing the data loaded to any destination in python
keywords: [destination, schema, data, access, retrieval]
---
Expand Down Expand Up @@ -158,7 +158,7 @@ arrow_table = items_relation.select("col1", "col2").limit(50).arrow()

## Supported destinations

All SQL and filesystem destinations supported by `dlt` can utilize this data access interface. For filesystem destinations, `dlt` [uses **DuckDB** under the hood](./sql-client.md#the-filesystem-sql-client) to create views from Parquet or JSONL files dynamically. This allows you to query data stored in files using the same interface as you would with SQL databases. If you plan on accessing data in buckets or the filesystem a lot this way, it is advised to load data as parquet instead of jsonl, as **DuckDB** is able to only load the parts of the data actually needed for the query to work.
All SQL and filesystem destinations supported by `dlt` can utilize this data access interface. For filesystem destinations, `dlt` [uses **DuckDB** under the hood](./sql-client.md#the-filesystem-sql-client) to create views from Parquet or JSONL files dynamically. This allows you to query data stored in files using the same interface as you would with SQL databases. If you plan on accessing data in buckets or the filesystem a lot this way, it is advised to load data as Parquet instead of JSONL, as **DuckDB** is able to only load the parts of the data actually needed for the query to work.

## Examples

Expand Down Expand Up @@ -206,12 +206,14 @@ custom_relation = dataset("SELECT * FROM items JOIN other_items ON items.id = ot
arrow_table = custom_relation.arrow()
```

**Note:** When using custom SQL queries with `dataset()`, methods like `limit` and `select` won't work. Include any filtering or column selection directly in your SQL query.
:::note
When using custom SQL queries with `dataset()`, methods like `limit` and `select` won't work. Include any filtering or column selection directly in your SQL query.
:::


### Loading a `ReadableRelation` into a pipeline table

Since the iter_arrow and iter_df methods are generators that iterate over the full ReadableRelation in chunks, you can use them as a resource for another (or even the same) dlt pipeline:
Since the `iter_arrow` and `iter_df` methods are generators that iterate over the full `ReadableRelation` in chunks, you can use them as a resource for another (or even the same) `dlt` pipeline:

```py
# Create a readable relation with a limit of 1m rows
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -11,23 +11,22 @@ Ibis is a powerful portable Python dataframe library. Learn more about what it i
`dlt` provides an easy way to hand over your loaded dataset to an Ibis backend connection.

:::tip
Not all destinations supported by `dlt` have an equivalent Ibis backend. Natively supported destinations include DuckDB (including Motherduck), Postgres, Redshift, Snowflake, Clickhouse, MSSQL (including Synapse), and BigQuery. The filesystem destination is supported via the [Filesystem SQL client](./sql-client#the-filesystem-sql-client); please install the duckdb backend for ibis to use it. Mutating data with ibis on the filesystem will not result in any actual changes to the persisted files.
Not all destinations supported by `dlt` have an equivalent Ibis backend. Natively supported destinations include DuckDB (including Motherduck), Postgres, Redshift, Snowflake, Clickhouse, MSSQL (including Synapse), and BigQuery. The filesystem destination is supported via the [Filesystem SQL client](./sql-client#the-filesystem-sql-client); please install the DuckDB backend for Ibis to use it. Mutating data with Ibis on the filesystem will not result in any actual changes to the persisted files.
:::

## Prerequisites

To use the Ibis backend, you will need to have the `ibis-framework` package with the correct ibis extra installed. The following example will install the duckdb backend:
To use the Ibis backend, you will need to have the `ibis-framework` package with the correct Ibis extra installed. The following example will install the DuckDB backend:

```sh
pip install ibis-framework[duckdb]
```

## Get an ibis connection from your dataset
## Get an Ibis connection from your dataset

dlt datasets have a helper method to return an ibis connection to the destination they live on. The returned object is a native ibis connection to the destination, which you can use to read and even transform data. Please consult the [ibis documentation](https://ibis-project.org/docs/backends/) to learn more about what you can do with ibis.
`dlt` datasets have a helper method to return an Ibis connection to the destination they live on. The returned object is a native Ibis connection to the destination, which you can use to read and even transform data. Please consult the [Ibis documentation](https://ibis-project.org/docs/backends/) to learn more about what you can do with Ibis.

```py

# get the dataset from the pipeline
dataset = pipeline._dataset()
dataset_name = pipeline.dataset_name
Expand Down
30 changes: 15 additions & 15 deletions docs/website/docs/general-usage/dataset-access/sql-client.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,16 +7,16 @@ keywords: [data, dataset, sql]
# The SQL client

:::note
This page contains technical details about the implementation of the SQL client as well as information on how to use low-level APIs. If you simply want to query your data, it's advised to read the pages in this section on accessing data via dlt datasets, streamlit, or ibis.
This page contains technical details about the implementation of the SQL client as well as information on how to use low-level APIs. If you simply want to query your data, it's advised to read the pages in this section on accessing data via `dlt` datasets, Streamlit, or Ibis.
:::

Most dlt destinations use an implementation of the SqlClientBase class to connect to the physical destination to which your data is loaded. DDL statements, data insert or update commands, as well as SQL merge and replace queries, are executed via a connection on this client. It also is used for reading data for the [streamlit app](./streamlit.md) and [data access via dlt datasets](./dataset.md).
Most `dlt` destinations use an implementation of the `SqlClientBase` class to connect to the physical destination to which your data is loaded. DDL statements, data insert or update commands, as well as SQL merge and replace queries, are executed via a connection on this client. It also is used for reading data for the [Streamlit app](./streamlit.md) and [data access via `dlt` datasets](./dataset.md).

All SQL destinations make use of an SQL client; additionally, the filesystem has a special implementation of the SQL client which you can read about below.
All SQL destinations make use of an SQL client; additionally, the filesystem has a special implementation of the SQL client which you can read about [below](#the-filesystem-sql-client).

## Executing a query on the SQL client

You can access the SQL client of your destination via the sql_client method on your pipeline. The code below shows how to use the SQL client to execute a query.
You can access the SQL client of your destination via the `sql_client` method on your pipeline. The code below shows how to use the SQL client to execute a query.

```py
pipeline = dlt.pipeline(destination="bigquery", dataset_name="crm")
Expand All @@ -31,9 +31,9 @@ with pipeline.sql_client() as client:

## Retrieving the data in different formats

The cursor returned by execute_query has several methods for retrieving the data. The supported formats are Python tuples, pandas DataFrame, and Arrow table.
The cursor returned by `execute_query` has several methods for retrieving the data. The supported formats are Python tuples, Pandas DataFrame, and Arrow table.

The code below shows how to retrieve the data as a pandas DataFrame and then manipulate it in memory:
The code below shows how to retrieve the data as a Pandas DataFrame and then manipulate it in memory:

```py
pipeline = dlt.pipeline(...)
Expand All @@ -48,17 +48,17 @@ counts = reactions.sum(0).sort_values(0, ascending=False)

## Supported methods on the cursor

- `fetchall()`: returns all rows as a list of tuples
- `fetchone()`: returns a single row as a tuple
- `fetchmany(size=None)`: returns a number of rows as a list of tuples; if no size is provided, all rows are returned
- `df(chunk_size=None, **kwargs)`: returns the data as a pandas DataFrame; if chunk_size is provided, the data is retrieved in chunks of the given size
- `arrow(chunk_size=None, **kwargs)`: returns the data as an Arrow table; if chunk_size is provided, the data is retrieved in chunks of the given size
- `iter_fetch(chunk_size: int)`: iterates over the data in chunks of the given size as lists of tuples
- `iter_df(chunk_size: int)`: iterates over the data in chunks of the given size as pandas DataFrames
- `iter_arrow(chunk_size: int)`: iterates over the data in chunks of the given size as Arrow tables
- `fetchall()`: returns all rows as a list of tuples;
- `fetchone()`: returns a single row as a tuple;
- `fetchmany(size=None)`: returns a number of rows as a list of tuples; if no size is provided, all rows are returned;
- `df(chunk_size=None, **kwargs)`: returns the data as a Pandas DataFrame; if `chunk_size` is provided, the data is retrieved in chunks of the given size;
- `arrow(chunk_size=None, **kwargs)`: returns the data as an Arrow table; if `chunk_size` is provided, the data is retrieved in chunks of the given size;
- `iter_fetch(chunk_size: int)`: iterates over the data in chunks of the given size as lists of tuples;
- `iter_df(chunk_size: int)`: iterates over the data in chunks of the given size as Pandas DataFrames;
- `iter_arrow(chunk_size: int)`: iterates over the data in chunks of the given size as Arrow tables.

:::info
Which retrieval method you should use very much depends on your use case and the destination you are using. Some drivers for our destinations provided by their vendors natively support Arrow or pandas DataFrames; in these cases, we will use that interface. If they do not, `dlt` will convert lists of tuples into these formats.
Which retrieval method you should use very much depends on your use case and the destination you are using. Some drivers for our destinations provided by their vendors natively support Arrow or Pandas DataFrames; in these cases, we will use that interface. If they do not, `dlt` will convert lists of tuples into these formats.
:::

## The filesystem SQL client
Expand Down
12 changes: 6 additions & 6 deletions docs/website/docs/general-usage/dataset-access/streamlit.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
---
title: Viewing your data with streamlit
title: Viewing your data with Streamlit
description: Viewing your data with streamlit
keywords: [data, dataset, streamlit]
---
Expand All @@ -9,12 +9,12 @@ keywords: [data, dataset, streamlit]
Once you have run a pipeline locally, you can launch a web app that displays the loaded data. For this to work, you will need to have the `streamlit` package installed.

:::tip
The streamlit app does not work with all destinations supported by `dlt`. Only destinations that provide a SQL client will work. The filesystem destination has support via the [Filesystem SQL client](./sql-client#the-filesystem-sql-client) and will work in most cases. Vector databases generally are unsupported.
The Streamlit app does not work with all destinations supported by `dlt`. Only destinations that provide a SQL client will work. The filesystem destination has support via the [Filesystem SQL client](./sql-client#the-filesystem-sql-client) and will work in most cases. Vector databases generally are unsupported.
:::

## Prerequisites

To install streamlit, run the following command:
To install Streamlit, run the following command:

```sh
pip install streamlit
Expand All @@ -35,11 +35,11 @@ Use the pipeline name you defined in your Python code with the `pipeline_name` a

You can now inspect the schema and your data. Use the left sidebar to switch between:

* Exploring your data (default)
* Information about your loads
* Exploring your data (default);
* Information about your loads.


## Further reading

If you are running dlt in Python interactively or in a notebook, read the [Accessing your data with Python](./dataset.md) guide.
If you are running `dlt` in Python interactively or in a notebook, read the [Accessing loaded data in Python](./dataset.md) guide.

0 comments on commit 3d952fe

Please sign in to comment.