Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

v2: Implement transforms against DataFusion DataFrame, drop custom UDFs #525

Merged
merged 44 commits into from
Oct 30, 2024

Conversation

jonmmease
Copy link
Collaborator

This PR is a big step toward the Rust goals for VegaFusion 2.0. It removes the vegafusion-dataframe and vegafusion-sql crates and re-implements the data transforms against the standard DataFusion DataFrame struct. This removes several stages of indirection and complexity.

This temporarily removes support for executing queries with DuckDB. The plan for re-introducing support for external SQL engines is to use the datafusion-federation crate to translate query plans to external SQL queries during evaluation. This will let us support external sql queries while still implementing transforms against the DataFusion DataFrame.

Along the way, I audited all of the custom DataFusion UDFs that were contained in the vegafusion-datafusion-udf crate and replaced as many as possible with standard DataFusion features. In particular, DataFusion now has very complete support for timestamps with timezones, and timezone conversion, so nearly all of the custom timestamp UDFs are now expressible with standard DataFusion expressions. This reduces our complexity and will make it easier to support external SQL execution in the future.

Another change is that timestamps are now represented internally as timestamps with timezones (rather than as naive timestamps that are assumed to be in UTC). Naive timestamps are given a timezone on import. Date columns are no longer converted to timestamps on import, and are handled throughout transforms as appropriate.

}

impl VegaFusionDataset {
pub fn fingerprint(&self) -> String {
match self {
VegaFusionDataset::Table { hash, .. } => hash.to_string(),
VegaFusionDataset::DataFrame(df) => df.fingerprint().to_string(),
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping this VegaFusionDataset wrapper to hold the fingerprint, and to leave open possibility to add other plans types (e.g. DataFusion or substrait plans).

@jonmmease jonmmease merged commit 8363810 into v2 Oct 30, 2024
19 checks passed
jonmmease added a commit that referenced this pull request Nov 16, 2024
…Fs (#525)

* wip refactor to use DataFusion's DataFrame

* Add aggregate support

* Port additional transforms

* Port additional transforms

* Port window transform

* Port fold transform

* Port impute transform

* Port pivot transform

* port timeunit

* get schema from first batch

* Don't require metadata match

* start porting stack

* finish stack transform port

* Use object-store with DataFusion to load from http

* wip time functions

* parse %Y-%m-%d in UTC like the browser

* Update timeunit transform to use datafusion operations

* remove unused UDFs

* json fallback to reqwest

* Fix timezone parsing

* Fix selection_test

* all custom spec tests passing

* get image_comparison tests passing

* Get all vegafusion-runtime tests passing

* fix

* fix

* remove more udfs

* remove vegafusion-datafusion-udfs, vegafusion-dataframe, and vegafusion-sql crates

* fix tests

* clippy fix

* format

* warnings / format

* python test updates

* Update to datafusion main

* fmt

* re-enable format millis test, fix substr args

* Support Utf8View in json writer

* fix remaining python tests

* fmt

* clippy fix

* fmt

* work around wasm-pack error

* add call to update-pkg.js

* remove some stale comments
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant