Improve performance and extensibility for Doris-Spark-Connector #2106

wuyunfeng · 2019-10-30T14:27:22Z

Is your feature request related to a problem? Please describe.
Previously Doris-Spark-Connector use the custom thrift binary protocol for data exchange between Apache Spark and Apache Doris(incubator). This design approach would introduce performance and scalability problem：

low compression ratio for the large dataset, would increase network bandwidth utilization, this also would bring some performance side-effect.
higher complexity for other compute service to use. Doris-Spark-Connector is not just service the Apache Spark, and expose data pruned-filter-scan ability for other compute-service is our original intention

Previous PR:
#1525

Describe the solution you'd like
Introduce arrow IPC serialization for Doris-Spark-Connector to resolve those above problem.

Apache Arrow is a cross-language development platform for in-memory data. It specifies a standardized language-independent columnar memory format for flat and hierarchical data, organized for efficient analytic operations on modern hardware. It also provides computational libraries and zero-copy streaming messaging and interprocess communication

Apache Arrow has those feature below:

Fast

Apache Arrow™ enables execution engines to take advantage of the latest SIMD (Single instruction, multiple data) operations included in modern processors, for native vectorized optimization of analytical data processing. Columnar layout is optimized for data locality for better performance on modern hardware like CPUs and GPUs.

The Arrow memory format supports zero-copy reads for lightning-fast data access without serialization overhead

Flexible

Arrow acts as a new high-performance interface between various systems. It is also focused on supporting a wide variety of industry-standard programming languages. C, C++, C#, Go, Java, JavaScript, MATLAB, Python, R, Ruby, and Rust implementations are in progress and more languages are welcome.

Standard

Apache Arrow is backed by key developers of 13 major open source projects, including Calcite, Cassandra, Drill, Hadoop, HBase, Ibis, Impala, Kudu, Pandas, Parquet, Phoenix, Spark, and Storm making it the de-facto standard for columnar in-memory analytics

wuyunfeng · 2019-10-30T14:36:10Z

#2013

wuyunfeng mentioned this issue Oct 30, 2019

[Spark Doris Connector] Add arrow IPC serialization for Doris-Spark-Connector #2013

Merged

imay assigned wuyunfeng Oct 31, 2019

imay added the kind/improvement label Oct 31, 2019

imay added this to the 0.12.0 milestone Oct 31, 2019

imay closed this as completed Oct 31, 2019

This was referenced Nov 7, 2019

Read Doris data through spark error，when date column is null #2138

Closed

Spark date result is 1970-01-01 when doris date is before 1970-01-01 or after 2038-01-19 #2157

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance and extensibility for Doris-Spark-Connector #2106

Improve performance and extensibility for Doris-Spark-Connector #2106

wuyunfeng commented Oct 30, 2019 •

edited

Loading

wuyunfeng commented Oct 30, 2019

Improve performance and extensibility for Doris-Spark-Connector #2106

Improve performance and extensibility for Doris-Spark-Connector #2106

Comments

wuyunfeng commented Oct 30, 2019 • edited Loading

wuyunfeng commented Oct 30, 2019

wuyunfeng commented Oct 30, 2019 •

edited

Loading