Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Library Guide: Add Using the DataFrame API #8319

Merged
merged 3 commits into from
Nov 28, 2023
Merged

Conversation

Veeupup
Copy link
Contributor

@Veeupup Veeupup commented Nov 25, 2023

Signed-off-by: veeupup [email protected]

Which issue does this PR close?

Closes #7305

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@Veeupup Veeupup requested a review from andygrove November 28, 2023 11:29
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Veeupup -- this is a great addition. I left a few suggestions, but I also think we could make those changes as a follow on PR.


You can also serialize `DataFrame` to a file. For now, `Datafusion` supports write `DataFrame` to `csv`, `json` and `parquet`.

Before writing to a file, it will call collect to calculate all the results of the DataFrame and then write to file.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the DataFrame API calls collect -- instead I think it uses the streaming APIs

Suggested change
Before writing to a file, it will call collect to calculate all the results of the DataFrame and then write to file.
When writing a file, DataFusion will execute the DataFrame and stream the results to a file.


## Transform between LogicalPlan and DataFrame

As it is showed above, `DataFrame` is just a very thin wrapper of `LogicalPlan`, so you can easily go back and forth between them.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
As it is showed above, `DataFrame` is just a very thin wrapper of `LogicalPlan`, so you can easily go back and forth between them.
As shown above, `DataFrame` is just a very thin wrapper of `LogicalPlan`, so you can easily go back and forth between them.

Coming Soon
## What is a DataFrame

`DataFrame` is a basic concept in `datafusion` and is only a thin wrapper over LogicalPlan.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
`DataFrame` is a basic concept in `datafusion` and is only a thin wrapper over LogicalPlan.
`DataFrame` in `DataFrame` is modeled after the Pandas DataFrame interface, and is a thin wrapper over LogicalPlan that adds functionality for building and executing those plans.

}
```

For both `DataFrame` and `LogicalPlan`, you can build the query manually, such as:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For both `DataFrame` and `LogicalPlan`, you can build the query manually, such as:
You can build up `DataFrame`s using its methods, similarly to building `LogicalPlan`s using `LogicalPlanBuilder`:

```rust
let df = ctx.table("users").await?;

let new_df = df.select(vec![col("id"), col("bank_account")])?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let new_df = df.select(vec![col("id"), col("bank_account")])?
// Create a new DataFrame sorted by `id`, `bank_account`
let new_df = df.select(vec![col("id"), col("bank_account")])?


You can manually call the `DataFrame` API or automatically generate a `DataFrame` through the SQL query planner just like:

use `sql` to construct `DataFrame`:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
use `sql` to construct `DataFrame`:
For example, to use `sql` to construct `DataFrame`:

let dataframe = ctx.sql("SELECT * FROM users;").await?;
```

construct `DataFrame` manually
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
construct `DataFrame` manually
To construct `DataFrame` using the API:


## Collect / Streaming Exec

When you have a `DataFrame`, you may want to access the results of the internal `LogicalPlan`. You can do this by using `collect` to retrieve all outputs at once, or `streaming_exec` to obtain a `SendableRecordBatchStream`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
When you have a `DataFrame`, you may want to access the results of the internal `LogicalPlan`. You can do this by using `collect` to retrieve all outputs at once, or `streaming_exec` to obtain a `SendableRecordBatchStream`.
DataFusion `DataFrame`s are "lazy", meaning they do not do any processing until they are executed, which allows for additional optimizations.
When you have a `DataFrame`, you can run it in one of three ways:
1. `collect` which executes the query and buffers all the output into a `Vec<RecordBatch>`
2. `streaming_exec`, which begins executions and returns a `SendableRecordBatchStream` which incrementally computes output on each call to `next()`
3. `cache` which executes the query and buffers the output into a new in memory DataFrame.

let batches = df.collect().await?;
```

You can also use stream output to iterate the `RecordBatch`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
You can also use stream output to iterate the `RecordBatch`
You can also use stream output to incrementally generate output one `RecordBatch` at a time


Before writing to a file, it will call collect to calculate all the results of the DataFrame and then write to file.

For example, if you write it to a csv_file
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
For example, if you write it to a csv_file
For example, to write a csv_file

Signed-off-by: veeupup <[email protected]>
@Veeupup
Copy link
Contributor Author

Veeupup commented Nov 28, 2023

Thank you for your detailed comments! Very helpful! @alamb

I have changed its content as your comments : )

@alamb alamb added documentation Improvements or additions to documentation devrel labels Nov 28, 2023
@alamb alamb merged commit f1dbb2d into apache:main Nov 28, 2023
4 checks passed
@alamb
Copy link
Contributor

alamb commented Nov 28, 2023

Thanks again @Veeupup

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Library Guide: Add Using the DataFrame API
4 participants