Skip to content

Commit

Permalink
add duckdb support (#1398)
Browse files Browse the repository at this point in the history
  • Loading branch information
ahuang11 authored Sep 25, 2024
1 parent 10d53b4 commit c689c60
Show file tree
Hide file tree
Showing 18 changed files with 678 additions and 817 deletions.
Binary file modified doc/assets/diagram.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
1,174 changes: 367 additions & 807 deletions doc/assets/diagram.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
19 changes: 19 additions & 0 deletions doc/index.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,7 @@ alt: Works with GeoPandas
align: center
---
:::
:::{tab-item} Polars
```python
import polars
Expand All @@ -116,6 +117,24 @@ align: center
---
:::
:::{tab-item} DuckDB
```python
import duckdb
import hvplot.duckdb
from bokeh.sampledata.autompg import autompg_clean as df
df_duckdb = duckdb.from_df(df)
table = df_duckdb.groupby(['origin', 'mfr'])['mpg'].mean().sort_values().tail(5)
table.hvplot.barh('mfr', 'mpg', by='origin', stacked=True)
```
```{image} ./_static/home/pandas.gif
---
alt: Works with DuckDB
align: center
---
```

:::
:::{tab-item} Intake
```python
import hvplot.intake
Expand Down
117 changes: 108 additions & 9 deletions doc/user_guide/Integrations.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -254,19 +254,13 @@
},
{
"cell_type": "markdown",
"id": "a46e377e-729a-4f99-b5d3-83b0736cb8a3",
"id": "7474a792-2cfd-4139-a1cd-872f913fa07b",
"metadata": {},
"source": [
":::{note}\n",
"Added in version `0.9.0`.\n",
":::"
]
},
{
"cell_type": "markdown",
"id": "7474a792-2cfd-4139-a1cd-872f913fa07b",
"metadata": {},
"source": [
":::\n",
"\n",
":::{important}\n",
"While other data sources like `Pandas` or `Dask` have built-in support in HoloViews, as of version 1.17.1 this is not yet the case for `Polars`. You can track this [issue](https://github.com/holoviz/holoviews/issues/5939) to follow the evolution of this feature in HoloViews. Internally hvPlot simply selects the columns that contribute to the plot and casts them to a Pandas object using Polars' `.to_pandas()` method.\n",
":::"
Expand Down Expand Up @@ -327,6 +321,111 @@
"df_polars['A'].hvplot.line(height=150)"
]
},
{
"cell_type": "markdown",
"id": "efc2f45e",
"metadata": {},
"source": [
"#### DuckDB"
]
},
{
"cell_type": "markdown",
"id": "db91860c",
"metadata": {},
"source": [
":::{note}\n",
"Added in version `0.11.0`.\n",
":::"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0d6460d0",
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
"\n",
"df_pandas = pd.DataFrame(np.random.randn(1000, 4), columns=list('ABCD')).cumsum()\n",
"df_pandas.head(2)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "21638d45",
"metadata": {},
"outputs": [],
"source": [
"import hvplot.duckdb # noqa \n",
"import duckdb\n",
"\n",
"connection = duckdb.connect(':memory:')\n",
"relation = duckdb.from_df(df_pandas, connection=connection)\n",
"relation.to_view(\"example_view\");"
]
},
{
"cell_type": "markdown",
"id": "40b56f16",
"metadata": {},
"source": [
"`.hvplot()` supports [DuckDB](https://duckdb.org/docs/api/python/overview.html) `DuckDBPyRelation` and `DuckDBConnection` objects."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f588e3fe",
"metadata": {},
"outputs": [],
"source": [
"relation.hvplot.line(y=['A', 'B', 'C', 'D'], height=150)"
]
},
{
"cell_type": "markdown",
"id": "68a47856",
"metadata": {},
"source": [
"`DuckDBPyRelation` is a bit more optimized because it handles column subsetting directly within DuckDB before the data is converted to a `pd.DataFrame`.\n",
"\n",
"So, it's a good idea to use the `connection.sql()` method when possible, which gives you a `DuckDBPyRelation`, instead of `connection.execute()`, which returns a `DuckDBPyConnection`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "214c60ee",
"metadata": {},
"outputs": [],
"source": [
"sql_expr = \"SELECT * FROM example_view WHERE A > 0 AND B > 0\"\n",
"connection.sql(sql_expr).hvplot.line(y=['A', 'B'], hover_cols=[\"C\"], height=150) # subsets A, B, C"
]
},
{
"cell_type": "markdown",
"id": "2a2f61d4",
"metadata": {},
"source": [
"Alternatively, you can directly subset the desired columns in the SQL expression."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "5ce25c3d",
"metadata": {},
"outputs": [],
"source": [
"sql_expr = \"SELECT A, B, C FROM example_view WHERE A > 0 AND B > 0\"\n",
"connection.execute(sql_expr).hvplot.line(y=['A', 'B'], hover_cols=[\"C\"], height=150)"
]
},
{
"cell_type": "markdown",
"id": "25a6e724-6a84-4bff-9108-ac71dcfa9116",
Expand Down
1 change: 1 addition & 0 deletions doc/user_guide/Introduction.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@
"\n",
"* [Pandas](https://pandas.pydata.org): DataFrame, Series (columnar/tabular data)\n",
"* [Rapids cuDF](https://docs.rapids.ai/api/cudf/stable/): GPU DataFrame, Series (columnar/tabular data)\n",
"* [DuckDB](https://www.duckdb.org/): DuckDB is a fast in-process analytical database\n",
"* [Polars](https://www.pola.rs/): Polars is a fast DataFrame library/in-memory query engine (columnar/tabular data)\n",
"* [Dask](https://www.dask.org): DataFrame, Series (distributed/out of core arrays and columnar data)\n",
"* [XArray](https://xarray.pydata.org): Dataset, DataArray (labelled multidimensional arrays)\n",
Expand Down
1 change: 1 addition & 0 deletions envs/py3.10-tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ dependencies:
- dask
- dask>=2021.3.0
- datashader>=0.6.5
- duckdb
- fiona
- fugue
- fugue-sql-antlr>=0.2.0
Expand Down
1 change: 1 addition & 0 deletions envs/py3.11-docs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ dependencies:
- colorcet>=2
- dask>=2021.3.0
- datashader>=0.6.5
- duckdb
- fiona
- fugue
- fugue-sql-antlr>=0.2.0
Expand Down
1 change: 1 addition & 0 deletions envs/py3.11-tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ dependencies:
- dask
- dask>=2021.3.0
- datashader>=0.6.5
- duckdb
- fiona
- fugue
- fugue-sql-antlr>=0.2.0
Expand Down
1 change: 1 addition & 0 deletions envs/py3.12-tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ dependencies:
- dask
- dask>=2021.3.0
- datashader>=0.6.5
- duckdb
- fiona
- fugue
- fugue-sql-antlr>=0.2.0
Expand Down
1 change: 1 addition & 0 deletions envs/py3.9-tests.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -20,6 +20,7 @@ dependencies:
- dask
- dask>=2021.3.0
- datashader>=0.6.5
- duckdb
- fiona
- fugue
- fugue-sql-antlr>=0.2.0
Expand Down
4 changes: 4 additions & 0 deletions hvplot/converter.py
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,7 @@
is_tabular,
is_series,
is_dask,
is_duckdb,
is_intake,
is_cudf,
is_streamz,
Expand Down Expand Up @@ -1094,6 +1095,9 @@ def _process_data(
elif is_dask(data):
datatype = 'dask'
self.data = data.persist() if persist else data
elif is_duckdb(data):
datatype = 'duckdb'
self.data = data
elif is_cudf(data):
datatype = 'cudf'
self.data = data
Expand Down
27 changes: 27 additions & 0 deletions hvplot/duckdb.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
"""Adds the `.hvplot` method to duckdb.DuckDBPyRelation and duckdb.DuckDBPyConnection"""


def patch(name='hvplot', interactive='interactive', extension='bokeh', logo=False):
from hvplot.plotting.core import hvPlotTabularDuckDB
from . import post_patch, _module_extensions

if 'hvplot.duckdb' not in _module_extensions:
try:
import duckdb
except ImportError:
raise ImportError(
'Could not patch plotting API onto DuckDB. DuckDB could not be imported.'
)

# Patching for DuckDBPyRelation and DuckDBPyConnection
_patch_duckdb_plot = lambda self: hvPlotTabularDuckDB(self) # noqa: E731
_patch_duckdb_plot.__doc__ = hvPlotTabularDuckDB.__call__.__doc__
plot_prop_duckdb = property(_patch_duckdb_plot)
setattr(duckdb.DuckDBPyRelation, name, plot_prop_duckdb)
setattr(duckdb.DuckDBPyConnection, name, plot_prop_duckdb)
_module_extensions.add('hvplot.duckdb')

post_patch(extension, logo)


patch()
7 changes: 6 additions & 1 deletion hvplot/plotting/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
import holoviews as hv
from ..util import with_hv_extension, is_polars
from ..util import with_hv_extension, is_duckdb, is_polars

from .core import hvPlot, hvPlotTabular # noqa

Expand Down Expand Up @@ -34,6 +34,11 @@ def plot(data, kind, **kwargs):
from .core import hvPlotTabularPolars

return hvPlotTabularPolars(data)(kind=kind, **no_none_kwargs)

elif is_duckdb(data):
from .core import hvPlotTabularDuckDB

return hvPlotTabularDuckDB(data)(kind=kind, **no_none_kwargs)
return hvPlotTabular(data)(kind=kind, **no_none_kwargs)


Expand Down
83 changes: 83 additions & 0 deletions hvplot/plotting/core.py
Original file line number Diff line number Diff line change
Expand Up @@ -1864,6 +1864,89 @@ def labels(self, x=None, y=None, text=None, **kwds):
return self(x, y, text=text, kind='labels', **kwds)


class hvPlotTabularDuckDB(hvPlotTabular):
def _get_converter(self, x=None, y=None, kind=None, **kwds):
import duckdb
from duckdb.typing import (
BIGINT,
FLOAT,
DOUBLE,
INTEGER,
SMALLINT,
TINYINT,
UBIGINT,
UINTEGER,
USMALLINT,
UTINYINT,
HUGEINT,
)

params = dict(self._metadata, **kwds)
x = x or params.pop('x', None)
y = y or params.pop('y', None)
kind = kind or params.pop('kind', None)

# Handle DuckDB Relation and Connection objects
if isinstance(self._data, (duckdb.DuckDBPyConnection, duckdb.DuckDBPyRelation)):
if isinstance(self._data, duckdb.DuckDBPyConnection):
data = self._data.df()
else:
data = self._data

if params.get('hover_cols') != 'all':
data_columns = data.columns
possible_columns = [
[v] if isinstance(v, str) else v
for v in params.values()
if isinstance(v, (str, list))
]

columns = (set(data_columns) & set(itertools.chain(*possible_columns))) or {
data_columns[0]
}
if y is None:
# When y is not specified HoloViewsConverter finds all the numeric
# columns and use them as y values (see _process_chart_y). We need
# to include these columns too.

if isinstance(data, duckdb.DuckDBPyRelation):
numeric_columns = data.select_types(
[
BIGINT,
FLOAT,
DOUBLE,
INTEGER,
SMALLINT,
TINYINT,
UBIGINT,
UINTEGER,
USMALLINT,
UTINYINT,
HUGEINT,
]
).columns
else:
numeric_columns = data.select_dtypes(include='number').columns
columns |= set(numeric_columns)
xs = x if is_list_like(x) else (x,)
ys = y if is_list_like(y) else (y,)
columns |= {*xs, *ys}
columns.discard(None)

if isinstance(data, duckdb.DuckDBPyRelation):
columns = sorted(columns, key=lambda c: data_columns.index(c))
data = data.select(*columns).to_df()
else:
columns = sorted(columns, key=lambda c: data.columns.get_loc(c))
data = data[list(columns)]
else:
raise ValueError(
'Only duckdb.DuckDBPyConnection and duckdb.DuckDBPyRelation are supported'
)

return HoloViewsConverter(data, x, y, kind=kind, **params)


class hvPlotTabularPolars(hvPlotTabular):
def _get_converter(self, x=None, y=None, kind=None, **kwds):
import polars as pl
Expand Down
Loading

0 comments on commit c689c60

Please sign in to comment.