Skip to content

Commit

Permalink
Minor: port some content to the docs (#5684)
Browse files Browse the repository at this point in the history
* Minor: port some content to the docs

* prettier
  • Loading branch information
alamb authored Mar 22, 2023
1 parent 92c985e commit af97ac8
Showing 1 changed file with 41 additions and 10 deletions.
51 changes: 41 additions & 10 deletions docs/source/user-guide/introduction.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,21 +19,52 @@

# Introduction

DataFusion is an extensible query execution framework, written in
Rust, that uses [Apache Arrow](https://arrow.apache.org) as its
DataFusion is a very fast, extensible query engine for building high-quality data-centric systems in
[Rust](http://rustlang.org), using the [Apache Arrow](https://arrow.apache.org)
in-memory format.

DataFusion supports SQL and a DataFrame API for building logical query
plans, an extensive query optimizer, and a multi-threaded parallel
execution execution engine for processing partitioned data sources
such as CSV and Parquet files extremely quickly.
DataFusion offers SQL and Dataframe APIs, excellent [performance](https://benchmark.clickhouse.com/), built-in support for CSV, Parquet, JSON, and Avro, extensive customization, and a great community.

## Features

- Feature-rich [SQL support](https://arrow.apache.org/datafusion/user-guide/sql/index.html) and [DataFrame API](https://arrow.apache.org/datafusion/user-guide/dataframe.html)
- Blazingly fast, vectorized, multi-threaded, streaming execution engine.
- Native support for Parquet, CSV, JSON, and Avro file formats. Support
for custom file formats and non file datasources via the `TableProvider` trait.
- Many extension points: user defined scalar/aggregate/window functions, DataSources, SQL,
other query languages, custom plan and execution nodes, optimizer passes, and more.
- Streaming, asynchronous IO directly from popular object stores, including AWS S3,
Azure Blob Storage, and Google Cloud Storage. Other storage systems are supported via the
`ObjectStore` trait.
- [Excellent Documentation](https://docs.rs/datafusion/latest) and a
[welcoming community](https://arrow.apache.org/datafusion/contributor-guide/communication.html).
- A state of the art query optimizer with projection and filter pushdown, sort aware optimizations,
automatic join reordering, expression coercion, and more.
- Permissive Apache 2.0 License, Apache Software Foundation governance
- Written in [Rust](https://www.rust-lang.org/), a modern system language with development
productivity similar to Java or Golang, the performance of C++, and
[loved by programmers everywhere](https://insights.stackoverflow.com/survey/2021#technology-most-loved-dreaded-and-wanted).
- Support for [Substrait](https://substrait.io/) for query plan serialization, making it easier to integrate DataFusion
with other projects, and to pass plans across language boundaries.

## Use Cases

DataFusion is used to create modern, fast and efficient data
pipelines, ETL processes, and database systems, which need the
performance of Rust and Apache Arrow and want to provide their users
the convenience of an SQL interface or a DataFrame API.
DataFusion can be used without modification as an embedded SQL
engine or can be customized and used as a foundation for
building new systems. Here are some examples of systems built using DataFusion:

- Specialized Analytical Database systems such as [CeresDB] and more general Apache Spark like system such a [Ballista].
- New query language engines such as [prql-query] and accelerators such as [VegaFusion]
- Research platform for new Database Systems, such as [Flock]
- SQL support to another library, such as [dask sql]
- Streaming data platforms such as [Synnada]
- Tools for reading / sorting / transcoding Parquet, CSV, AVRO, and JSON files such as [qv]
- A faster Spark runtime replacement [Blaze]

By using DataFusion, the projects are freed to focus on their specific
features, and avoid reimplementing general (but still necessary)
features such as an expression representation, standard optimizations,
execution plans, file format support, etc.

## Why DataFusion?

Expand Down

0 comments on commit af97ac8

Please sign in to comment.