You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe.
Previously Doris-Spark-Connector use the custom thrift binary protocol for data exchange between Apache Spark and Apache Doris(incubator). This design approach would introduce performance and scalability problem:
low compression ratio for the large dataset, would increase network bandwidth utilization, this also would bring some performance side-effect.
higher complexity for other compute service to use. Doris-Spark-Connector is not just service the Apache Spark, and expose data pruned-filter-scan ability for other compute-service is our original intention
Describe the solution you'd like
Introduce arrow IPC serialization for Doris-Spark-Connector to resolve those above problem.
Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication
Apache Arrow has those feature below:
Fast
Apache Arrow™ enables execution engines to take advantage of the latest SIMD (Single instruction, multiple data) operations included in modern processors, for native vectorized optimization of analytical data processing. Columnar layout is optimized for data locality for better performance on modern hardware like CPUs and GPUs.
The Arrow memory format supports zero-copy reads for lightning-fast data access without serialization overhead
Flexible
Arrow acts as a new high-performance interface between various systems. It is also focused on supporting a wide variety of industry-standard programming languages. C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust implementations are in progress and more languages are welcome.
Standard
Apache Arrow is backed by key developers of 13 major open source projects, including Calcite, Cassandra, Drill, Hadoop, HBase, Ibis, Impala, Kudu, Pandas, Parquet, Phoenix, Spark, and Storm making it the de-facto standard for columnar in-memory analytics
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
Previously Doris-Spark-Connector use the custom thrift binary protocol for data exchange between Apache Spark and Apache Doris(incubator). This design approach would introduce performance and scalability problem:
low compression ratio for the large dataset, would increase network bandwidth utilization, this also would bring some performance side-effect.
higher complexity for other compute service to use. Doris-Spark-Connector is not just service the Apache Spark, and expose data pruned-filter-scan ability for other compute-service is our original intention
Previous PR:
#1525
Describe the solution you'd like
Introduce arrow IPC serialization for Doris-Spark-Connector to resolve those above problem.
Apache Arrow has those feature below:
The text was updated successfully, but these errors were encountered: