Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance and extensibility for Doris-Spark-Connector #2106

Closed
wuyunfeng opened this issue Oct 30, 2019 · 1 comment
Closed

Improve performance and extensibility for Doris-Spark-Connector #2106

wuyunfeng opened this issue Oct 30, 2019 · 1 comment
Assignees
Milestone

Comments

@wuyunfeng
Copy link
Member

wuyunfeng commented Oct 30, 2019

Is your feature request related to a problem? Please describe.
Previously Doris-Spark-Connector use the custom thrift binary protocol for data exchange between Apache Spark and Apache Doris(incubator). This design approach would introduce performance and scalability problem:

  1. low compression ratio for the large dataset, would increase network bandwidth utilization, this also would bring some performance side-effect.

  2. higher complexity for other compute service to use. Doris-Spark-Connector is not just service the Apache Spark, and expose data pruned-filter-scan ability for other compute-service is our original intention

Previous PR:
#1525

Describe the solution you'd like
Introduce arrow IPC serialization for Doris-Spark-Connector to resolve those above problem.

Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication

Apache Arrow has those feature below:

  • Fast

Apache Arrow™ enables execution engines to take advantage of the latest SIMD (Single instruction, multiple data) operations included in modern processors, for native vectorized optimization of analytical data processing. Columnar layout is optimized for data locality for better performance on modern hardware like CPUs and GPUs.

The Arrow memory format supports zero-copy reads for lightning-fast data access without serialization overhead

  • Flexible

Arrow acts as a new high-performance interface between various systems. It is also focused on supporting a wide variety of industry-standard programming languages. C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust implementations are in progress and more languages are welcome.

  • Standard

Apache Arrow is backed by key developers of 13 major open source projects, including Calcite, Cassandra, Drill, Hadoop, HBase, Ibis, Impala, Kudu, Pandas, Parquet, Phoenix, Spark, and Storm making it the de-facto standard for columnar in-memory analytics

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants