Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Blog post about datafusion 16 release #4804

Closed
Tracked by #4776 ...
alamb opened this issue Jan 3, 2023 · 9 comments · Fixed by apache/arrow-site#294
Closed
Tracked by #4776 ...

Blog post about datafusion 16 release #4804

alamb opened this issue Jan 3, 2023 · 9 comments · Fixed by apache/arrow-site#294

Comments

@alamb
Copy link
Contributor

alamb commented Jan 3, 2023

As part of the 16.0.0 release I would like to write a blog about datafusion on https://arrow.apache.org/blog/ (source at https://github.com/apache/arrow-site)

I am thinking about a basic theme like datafusion is leading the charge to bring advanced OLAP technology everywhere

I would like to highlight the theme summarized by Andy Pavlo in https://ottertune.com/blog/2022-databases-retrospective/

The long-term trend to watch is the proliferation of frameworks like Velox, DataFusion, and Polars. Along with projects like Substrait, the commoditization of these query execution components means that all OLAP DBMSs will be roughly equivalent in the next five years. Instead of building a new DBMS entirely from scratch or hard forking an existing system (e.g., how Firebolt forked Clickhouse), people are better off using an extensible framework like Velox. This means that every DBMS will have the same vectorized execution capabilities that were unique to Snowflake ten years ago. And since in the cloud, the storage layer is the same for everyone (e.g., Amazon controls EBS/S3), the critical differentiator between DBMS offerings will be things that are difficult to quantify, like UI/UX stuff and query optimization.

Some supporting evidence:

  • Several new databases built on datafusion (synnada.ai, greptimedb, probably others)
  • GA of InfluxDB IOx

New features:

  • Advanced Windowing functions (like unbounded windows)
  • Join support (TODO gather more details)
  • Optimizer advancements

Future directions:
1 .Improved grouping / sorting performance
2. RLE (Run End Encoding support
etc

Here is the most recent blog about datafusion I know about https://arrow.apache.org/blog/2022/10/25/datafusion-13.0.0 -- source at https://github.com/apache/arrow-site/blob/master/_posts/2022-10-25-datafusion-13.0.0.md

Please leave comments with your suggestions / ideas!

@alamb alamb mentioned this issue Jan 3, 2023
4 tasks
@wjones127
Copy link
Member

Is Substrait worth mentioning as a future direction?

@andygrove
Copy link
Member

Is Substrait worth mentioning as a future direction?

Came here to say the same thing. I think we should definitely talk about substrait. I can help with some content around that.

@andygrove
Copy link
Member

I would like to also talk about the Python bindings, even though they are released separately. Maybe an update on TPC-H support as well.

@alamb
Copy link
Contributor Author

alamb commented Jan 4, 2023

Great -- I'll try and get a draft up in the next few days if no one else beats me to it, and we can work on the content there

@alamb
Copy link
Contributor Author

alamb commented Jan 4, 2023

I would like to also talk about the Python bindings, even though they are released separately. Maybe an update on TPC-H support as well.

I would love to try and help / encourage a community there as it seems a great way to get more people using DataFusion

@ozankabak
Copy link
Contributor

Sounds great. Another point to touch on could be the recently-improved streaming/incremental execution capabilities. We (the Synnada team) will be continuing to focus on this area in the near future, so even more improvements are coming!

@alamb
Copy link
Contributor Author

alamb commented Jan 5, 2023

Also add somethign about benchmarking from @rdettai https://www.cloudfuse.io/dashboards/standalone-engines

@alamb
Copy link
Contributor Author

alamb commented Jan 7, 2023

Here is a draft post we can use to collaborate on: apache/arrow-site#294

@alamb
Copy link
Contributor Author

alamb commented Jan 19, 2023

Rendered site: Rendered: https://arrow.apache.org/blog/2023/01/19/datafusion-16.0.0/

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants