Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation for FugueSQL integrations #523

Merged
merged 2 commits into from
May 16, 2022
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
36 changes: 20 additions & 16 deletions dask_sql/integrations/fugue.py
Original file line number Diff line number Diff line change
Expand Up @@ -73,42 +73,46 @@ def fsql_dask(
register: bool = False,
fugue_conf: Any = None,
) -> Dict[str, dd.DataFrame]:
"""Fugue SQL utility function that can consume Context directly. Fugue SQL is a language
"""FugueSQL utility function that can consume Context directly. FugueSQL is a language
extending standard SQL. It makes SQL eligible to describe end to end workflows. It also
enables you to invoke python extensions in the SQL like language.

For more, please read
`Fugue SQl Tutorial <https://fugue-tutorials.readthedocs.io/en/latest/tutorials/fugue_sql/index.html/>`_
`FugueSQL Tutorial <https://fugue-tutorials.readthedocs.io/en/latest/tutorials/fugue_sql/index.html/>`_

Args:
sql: (:obj:`str`): Fugue SQL statement
sql (:obj:`str`): Fugue SQL statement
ctx (:class:`dask_sql.Context`): The context to operate on, defaults to None
register (:obj:`bool`): Whether to register named steps back to the context
(if provided), defaults to False
fugue_conf (:obj:`Any`): a dictionary like object containing Fugue specific configs

Example:
.. code-block:: python
# schema: *
def median(df:pd.DataFrame) -> pd.DataFrame:

# define a custom prepartition function for FugueSQL
def median(df: pd.DataFrame) -> pd.DataFrame:
df["y"] = df["y"].median()
return df.head(1)

# Create a context with tables df1, df2
# create a context with some tables
c = Context()
...
result = fsql_dask('''
j = SELECT df1.*, df2.x
FROM df1 INNER JOIN df2 ON df1.key = df2.key
PERSIST # using persist because j will be used twice
TAKE 5 ROWS PREPARTITION BY x PRESORT key
PRINT
TRANSFORM j PREPARTITION BY x USING median
PRINT
''', c, register=True)

# run a FugueSQL query using the context as input
query = '''
j = SELECT df1.*, df2.x
FROM df1 INNER JOIN df2 ON df1.key = df2.key
PERSIST
TAKE 5 ROWS PREPARTITION BY x PRESORT key
PRINT
TRANSFORM j PREPARTITION BY x USING median
PRINT
'''
result = fsql_dask(query, c, register=True)

assert "j" in result
assert "j" in c.tables

"""
_global, _local = get_caller_global_local_vars()

Expand Down
5 changes: 4 additions & 1 deletion docs/source/api.rst
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,7 @@ API Documentation

.. autofunction:: dask_sql.cmd_loop

.. autofunction:: dask_sql.integrations.fugue.fsql
.. autoclass:: dask_sql.integrations.fugue.DaskSQLExecutionEngine
:members:

.. autofunction:: dask_sql.integrations.fugue.fsql_dask
44 changes: 44 additions & 0 deletions docs/source/fugue.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
FugueSQL Integrations
=====================

`FugueSQL <https://fugue-tutorials.readthedocs.io/tutorials/fugue_sql/index.html>`_ is a related project that aims to provide a unified SQL interface for a variety of different computing frameworks, including Dask.
While it offers a SQL engine with a larger set of supported commands, this comes at the cost of slower performance when using Dask in comparison to dask-sql.
In order to offer a "best of both worlds" solution, dask-sql includes several options to integrate with FugueSQL, using its faster implementation of SQL commands when possible and falling back on FugueSQL when necessary.

dask-sql as a FugueSQL engine
-----------------------------

FugueSQL users unfamiliar with dask-sql can take advantage of its functionality with minimal code changes by passing :obj:`dask_sql.integrations.fugue.DaskSQLExecutionEngine` into the ``FugueSQLWorkflow`` being used to execute commands.
For more information and sample usage, see `Fugue — dask-sql as a FugueSQL engine <https://fugue-tutorials.readthedocs.io/tutorials/integrations/dasksql.html>`_.

Using FugueSQL on an existing ``Context``
-----------------------------------------

dask-sql users attempting to expand their SQL querying options for an existing ``Context`` can use :func:`dask_sql.integrations.fugue.fsql_dask`, which executes the provided query using FugueSQL, using the tables within the provided context as input.
The results of this query can then optionally be registered to the context:

.. code-block:: python

# define a custom prepartition function for FugueSQL
def median(df: pd.DataFrame) -> pd.DataFrame:
df["y"] = df["y"].median()
return df.head(1)

# create a context with some tables
c = Context()
...

# run a FugueSQL query using the context as input
query = """
j = SELECT df1.*, df2.x
FROM df1 INNER JOIN df2 ON df1.key = df2.key
PERSIST
TAKE 5 ROWS PREPARTITION BY x PRESORT key
PRINT
TRANSFORM j PREPARTITION BY x USING median
PRINT
"""
result = fsql_dask(query, c, register=True) # results aren't registered by default

assert "j" in result # returns a dict of resulting tables
assert "j" in c.tables # results are also registered to the context
1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -98,6 +98,7 @@ For this example, we use some data loaded from disk and query it with a SQL comm
api
server
cmd
fugue
how_does_it_work
configuration

Expand Down