Skip to content

Commit

Permalink
Docs: Document creating new extension APIs (#11425)
Browse files Browse the repository at this point in the history
* Docs: Document creating new extension APIs

* fix

* Add clarification about extension APIs. Thanks @ozankabak

* Apply suggestions from code review

Co-authored-by: Mehmet Ozan Kabak <[email protected]>

* Add a paragraph on datafusion-contrib

* prettier

---------

Co-authored-by: Mehmet Ozan Kabak <[email protected]>
  • Loading branch information
alamb and ozankabak authored Jul 16, 2024
1 parent 2837e02 commit 1331288
Show file tree
Hide file tree
Showing 2 changed files with 75 additions and 1 deletion.
2 changes: 1 addition & 1 deletion datafusion/core/src/lib.rs
Original file line number Diff line number Diff line change
Expand Up @@ -174,7 +174,7 @@
//!
//! DataFusion is designed to be highly extensible, so you can
//! start with a working, full featured engine, and then
//! specialize any behavior for their usecase. For example,
//! specialize any behavior for your usecase. For example,
//! some projects may add custom [`ExecutionPlan`] operators, or create their own
//! query language that directly creates [`LogicalPlan`] rather than using the
//! built in SQL planner, [`SqlToRel`].
Expand Down
74 changes: 74 additions & 0 deletions docs/source/contributor-guide/architecture.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,3 +25,77 @@ possible. You can find the most up to date version in the [source code].

[crates.io documentation]: https://docs.rs/datafusion/latest/datafusion/index.html#architecture
[source code]: https://github.com/apache/datafusion/blob/main/datafusion/core/src/lib.rs

## Forks vs Extension APIs

DataFusion is a fast moving project, which results in frequent internal changes.
This benefits DataFusion by allowing it to evolve and respond quickly to
requests, but also means that maintaining a fork with major modifications
sometimes requires non trivial work.

The public API (what is accessible if you use the DataFusion releases from
crates.io) is typically much more stable (though it does change from release to
release as well).

Thus, rather than forks, we recommend using one of the many extension APIs (such
as `TableProvider`, `OptimizerRule`, or `ExecutionPlan`) to customize
DataFusion. If you can not do what you want with the existing APIs, we would
welcome you working with us to add new APIs to enable your use case, as
described in the next section.

## `datafusion-contrib`

While DataFusions comes with enough features "out of the box" to quickly start
with a working system, it can't include everything useful feature (e.g.
`TableProvider`s for all data formats). The [`datafusion-contrib`] project
contains a collection of community maintained extensions that are not part of
the core DataFusion project, and not under Apache Software Foundation governance
but may be useful to others in the community. If you are interested adding a
feature to DataFusion, a new extension in `datafusion-contrib` is likely a good
place to start. Please [contact] us via github issue, slack, or Discord and
we'll gladly set up a new repository for your extension.

[`datafusion-contrib`]: https://github.com/datafusion-contrib
[contact]: ../contributor-guide/communication.md

## Creating new Extension APIs

DataFusion aims to be a general-purpose query engine, and thus the core crates
contain features that are useful for a wide range of use cases. Use case specific
functionality (such as very specific time series or stream processing features)
are typically implemented using the extension APIs.

If have a use case that is not covered by the existing APIs, we would love to
work with you to design a new general purpose API. There are often others who are
interested in similar extensions and the act of defining the API often improves
the code overall for everyone.

Extension APIs that provide "safe" default behaviors are more likely to be
suitable for inclusion in DataFusion, while APIs that require major changes to
built-in operators are less likely. For example, it might make less sense
to add an API to support a stream processing feature if that would result in
slower performance for built-in operators. It may still make sense to add
extension APIs for such features, but leave implementation of such operators in
downstream projects.

The process to create a new extension API is typically:

- Look for an existing issue describing what you want to do, and file one if it
doesn't yet exist.
- Discuss what the API would look like. Feel free to ask contributors (via `@`
mentions) for feedback (you can find such people by looking at the most
recently changed PRs and issues)
- Prototype the new API, typically by adding an example (in
`datafusion-examples` or refactoring existing code) to show how it would work
- Create a PR with the new API, and work with the community to get it merged

Some benefits of using an example based approach are

- Any future API changes will also keep your example going ensuring no
regression in functionality
- There will be a blue print of any needed changes to your code if the APIs do change
(just look at what changed in your example)

An example of this process was [creating a SQL Extension Planning API].

[creating a sql extension planning api]: https://github.com/apache/datafusion/issues/11207

0 comments on commit 1331288

Please sign in to comment.