This repository has been archived by the owner on Mar 27, 2024. It is now read-only.

21 Dec 03:38

HongW2019

v1.5.0

66660a0

v1.5.0 Latest

Latest

OAP 1.5.0 released with three major components: Gazelle, OAP MLlib and CloudTik. Here are the major features/improvements in OAP 1.5.0.

CloudTik

Optimize CloudTik Spark runtime and achieve 1.4X performance speedup. The optimizations to Spark include runtime filter, top-N computation optimization, sized based join reorder, distinct before intersecting, flatten scaler subquery and flatten single row aggregate subquery and removing joins for subqueries of “IN” filtering
Support Kubernetes and integrate with AWS EKS and GCP GKE
Design and implement auto scaling up and scaling down mechanism
Support a useful network topology with all nodes with only private IP by utilizing VPC peering, which provides a practical and isolated way for running private clusters without head public IP
Enhance CloudTik API for in-cluster usage. This new in-cluster (prefixed with This) API can be used in jobs for querying or operating the cluster for desired states. This API is pretty useful in ML/DL job integrations
Implement the experimental Spark ML/DL runtime. Horovod on Spark as distributed training, Hyperopt as super parameter tuning, MLflow as training and model management, work together with various deep learning frameworks including TensorFlow, PyTorch, MXNet. Spark ML/DL runtime supports running distributed training upon cloud storages (S3, Azure Data Lake Gen2, GCS)
Dozens of improvements on usability, features and stabilities

Gazelle

Reach 1.57X overall performance speedup vs. vanilla Spark 3.2.1 on TPC-DS with 5TB dataset on ICX clusters
Reach 1.77X overall performance speedup vs. vanilla Spark 3.2.1 on TPC-H with 5TB dataset on ICX clusters
Support more columnar expressions
Support dynamic merge file partitions
Error handling on Codegen and now Gazelle is more robust
Fix memory leak issues
Improve native Parquet read/write performance
Improve operator performance on Sort/HashAgg/HashJoin
Achieve performance goal on customer workloads
Add Spark 3.2.2 support; now Gazelle can support Spark 3.1.1, Spark 3.1.2, Spark 3.1.3, Spark 3.2.1 and Spark 3.2.2

OAP MLlib

Reach over 11X performance speedup vs. vanilla Spark with PCA, Linear and Ridge Regression on ICX clusters

Changelog

Gazelle Plugin

Features


#931	Reuse partition vectors for arrow scan
#955	implement missing expressions
#1120	Support aggregation window functions with order by
#1135	Supports Spark 3.2.2 shims
#1114	Remove tmp directory after application exits
#862	implement row_number window function
#1007	Document how to test columnar UDF
#942	Use hash aggregate for string type input

Performance


#1144	Optimize cast WSCG performance

Bugs Fixed


#1170	Segfault on data source v2
#1164	Limit the column num in WSCG
#1166	Peers' values should be considered in window function for CURRENT ROW in range mode
#1149	Vulnerability issues
#1112	Validate Error： “Invalid: Length spanned by binary offsets (21) larger than values array (size 20)”
#1103	wrong hashagg results
#929	Failed to add user extension while using gazelle
#1100	Wildcard in json path is not supported
#1079	Like function gets wrong result when default escape char is contained
#1046	Fall back to use row-based operators, error is makeStructField is unable to parse from conv
#1053	Exception when there is function expression in pos or len of substring
#1024	ShortType is not supported in ColumnarLiteral
#1034	Exception when there is unix_timestamp in CaseWhen
#1032	Missing WSCG check for ExistenceJoin
#1027	partition by literal in window function
#1019	Support more date formats for from_unixtime & unix_timestamp
#999	The performance of using ColumnarSort operator to sort string type is significantly lower than that of native spark Sortexec
#984	concat_ws
#958	JVM/Native R2C and CoalesceBatcth process time inaccuracy
#979	Failed to find column while reading parquet with case insensitive

PRs


#1192	[NSE-1191] fix AQE exchange reuse in Spark3.2
#1180	[NSE-1193] fix jni unload
#1175	[NSE-1171] Support merge parquet schema and read missing schema
#1178	[NSE-1161][FOLLOWUP] Remove extra compression type check
#1162	[NSE-1161] Support read-write parquet conversion to read-write arrow
#1014	[NSE-956] allow to write parquet with compression
#1176	bump h2/pgsql version
#1173	[NSE-1171] Throw RuntimeException when reading duplicate fields in case-insensitive mode
#1172	[NSE-1170] Setting correct row number in batch scan w/ partition columns
#1169	[NSE-1161] Format sql config string key
#1167	[NSE-1166] Cover peers' values in sum window function in range mode
#1165	[NSE-1164] Limit the max column num in WSCG
#1160	[NSE-1149] upgrade guava to 30.1.1
#1158	[NSE-1149] upgrade guava to 30.1.1
#1152	[NSE-1149] upgrade guava to 24.1.1
#1153	[NSE-1149] upgrade pgsql to 42.3.3
#1150	[NSE-1149] Remove log4j in shims module
#1146	[NSE-1135] Introduce shim layer for supporting spark 3.2.2
#1145	[NSE-1144] Optimize cast wscg performance
#1136	Remove project from wscg when it's the child of window
#1122	[NSE-1120] Support sum window function with order by statement
#1131	[NSE-1114] Remove temp directory without FileUtils.forceDeleteOnExit
#1129	[NSE-1127] Use larger buffer for hash agg
#1130	[NSE-610] fix hashjoin build time metric
#1126	[NSE-1125] Add status check for hashing GetOrInsert
#1056	[NSE-955] Support window function lag
#1123	[NSE-1118] fix codegen on TPCDS Q88
#1119	[NSE-1118] adding more checks for SMJ codegen
#1058	[NSE-981] Add a test suite for projection codegen
#1117	[NSE-1116] Disable columnar url_decoder
#1113	[NSE-1112] Fix Arrow array meta data validating issue when writing parquet files
#1039	[NSE-1019] fix codegen for all expressions
#1115	[NSE-1114] Remove tmp directory after application exits
#1111	remove debug log
#1098	[NSE-1108] allow to use different cases in column names
[#1082](h...

Assets 3

07 Jul 06:50

HongW2019

v1.4.0

2659106

v1.4.0

Overview

OAP 1.4.0 released with three major components: Gazelle, OAP MLlib and CloudTik (a new addition). In this release, 59 features/improvements were committed to Gazelle; OAP MLlib is paused to focus more on oneDAL; CloudTik is a cloud scale platform for distributed analytics and AI on public cloud providers including AWS, Azure, GCP, and so on. CloudTik enables users or enterprises to easily create and manage analytics and AI platform on public clouds with out-of-box optimized functionalities and performance, and to go quickly to focus on running the business workloads in minutes or hours instead of spending months to construct and optimize the platform.

Here are the major features/improvements in OAP 1.4.0.

Gazelle

Reach 1.6X overall performance vs. vanilla Spark on TPC-DS 103 queries with 5TB dataset on ICX clusters
Reach 1.8X overall performance vs. vanilla Spark on TPC-H 22 queries with 5TB dataset on ICX clusters
Implement split by reducer by column and allocate large block of memory for all reducer to optimize shuffle
Optimize Columnar2Row and Row2Columnar performance
Add support for 10+ expressions
Pack the classes into one single jar
Bugfixes on WholeStage Codegen on unsupported pattern, sort spill, etc

CloudTik

Scalable, robust, powerful and unified control plane for Cloud cluster and runtime management
Support 3 major public Cloud providers AWS, GCP and Azure with managed cloud storage and on-premise mode
Support of Cloud workspace to manage shared Cloud resources including VPC and subnets, cloud storage, firewalls, identities/roles and so on
Out-of-box runtimes including Spark, Presto/Trino, HDFS, Metastore, Kafka, Zookeeper and Ganglia
Integrate with OAP (Gazelle and MLlib) and various tools to run benchmarks on CloudTik

OAP MLlib

Reach over 11X performance vs. vanilla Spark with PCA, Linear and Ridge Regression on ICX clusters

Changelog

Gazelle

Features


#781	Add spark eventlog analyzer for advanced analyzing
#927	Column2Row further enhancement
#913	Add Hadoop 3.3 profile to pom.xml
#869	implement first agg function
#926	Support UDF URLDecoder
#856	[SHUFFLE] manually split of Variable length buffer (String likely)
#886	Add pmod function support
#855	[SHUFFLE] HugePage support in shuffle
#872	implement replace function
#867	Add substring_index function support
#818	Support length, char_length, locate, regexp_extract
#864	Enable native parquet write by default
#828	CoalesceBatches native implementation
#800	Combine datasource and columnar core jar

Performance


#848	Optimize Columnar2Row performance
#943	Optimize Row2Columnar performance
#854	Enable skipping columnarWSCG for queries with small shuffle size
#857	[SHUFFLE] split by reducer by column

Bugs Fixed


#827	Github action is broken
#987	TPC-H q7, q8, q9 run failed when using String for Date
#892	Q47 and q57 failed on ubuntu 20.04 OS without open-jdk.
#784	Improve Sort Spill
#788	Spark UT of "randomSplit on reordered partitions" encountered "Invalid: Map array child array should have no nulls" issue
#821	Improve Wholestage Codegen check
#831	Support more expression types in getting attribute
#876	Write arrow hang with OutputWriter.path
#891	Spark executor lost while DatasetFileWriter failed with speculation
#909	"INSERT OVERWRITE x SELECT /+ REPARTITION(2) / * FROM y LIMIT 2" drains 4 rows into table x using Arrow write extension
#889	Failed to write with ParquetFileFormat while using ArrowWriteExtension
#910	TPCDS failed, segfault caused by PR903
#852	Unit test fix for NSE-843
#843	ArrowDataSouce: Arrow dataset inspect() is called every time a file is read

PRs


#1005	[NSE-800] Fix an assembly warning
#1002	[NSE-800] Pack the classes into one single jar
#988	[NSE-987] fix string date
#977	[NSE-126] set default codegen opt to O1
#975	[NSE-927] Add macro AVX512BW check for different CPU architecture
#962	[NSE-359] disable unit tests on spark32 package
#966	[NSE-913] Add support for Hadoop 3.3.1 when packaging
#936	[NSE-943] Optimize IsNULL() function for Row2Columnar
#937	[NSE-927] Implement AVX512 optimization selection in Runtime and merge two C2R code files into one.
#951	[DNM] update sparklog
#938	[NSE-581] implement rlike/regexp_like
#946	[DNM] update on sparklog script
#939	[NSE-581] adding ShortType/FloatType in ColumnarLiteral
#934	[NSE-927] Extract and inline functions for native ColumnartoRow
#933	[NSE-581] Improve GetArrayItem(Split()) performance
#922	[NSE-912] Remove extra handleSafe costs
#925	[NSE-926] Support a UDF: URLDecoder
#924	[NSE-927] Enable AVX512 in Binary length calculation for native ColumnartoRow
#918	[NSE-856] Optimize of string/binary split
#908	[NSE-848] Optimize performance for Column2Row
#900	[NSE-869] Add 'first' agg function support
#917	[NSE-886] Add pmod expression support
#916	[NSE-909] fix slow test
#915	[NSE-857] Further optimizations of validity buffer split
#912	[NSE-909] "INSERT OVERWRITE x SELECT /+ REPARTITION(2) / * FROM y L…
#896	[NSE-889] Failed to write with ParquetFileFormat while using ArrowWriteExtension
#911	[NSE-910] fix bug of PR903
#901	[NSE-891] Spark executor lost while D...

Assets 3

11 Apr 10:10

HongW2019

v1.3.1

e472131

v1.3.1

Overview

OAP 1.3.1 is a maintenance release and contains two major components: Gazelle and OAP MLlib. In this release, 51 issues/improvements were committed.
Here are the major features/improvements in OAP 1.3.1.

Gazelle (Native SQL Engine)

Reach 1.5X overall performance vs. vanilla Spark on TPC-DS 103 queries with 5TB dataset on ICX clusters
Reach 1.7X overall performance vs. vanilla Spark on TPC-H 22 queries with 3TB dataset on ICX clusters
Support Spark-3.1.1, Spark-3.1.2, Spark-3.1.3, Spark-3.2.0 and Spark-3.2.1
Support rand expression and complex types for ColumnarSortExec
Refactor on shuffled hash join/hash agg
Bug fix for SMJ and memory allocation in row to columnar, etc

OAP MLlib

Reach over 12X performance vs. vanilla Spark using PCA, Linear and Ridge Regression on ICX clusters
Support Spark-3.1.1, Spark-3.1.2, Spark-3.1.3, Spark-3.2.0, Spark-3.2.1 and CDH Spark
Bump Intel oneAPI Base Toolkit to 2022.1.2

Changelog

Gazelle Plugin

Features


#710	Add rand expression support
#745	improve codegen check
#761	Update the document to reflect the changes in build and deployment
#635	Document the incompatibility with Spark on Expressions
#702	Print output datatype for columnar shuffle on WebUI
#712	[Nested type] Optimize Array split and support nested Array
#732	[Nested type] Support Struct and Map nested types in Shuffle
#759	Add spark 3.1.2 & 3.1.3 as supported versions for 3.1.1 shim layer

Performance


#610	refactor on shuffled hash join/hash agg

Bugs Fixed


#755	GetAttrFromExpr unsupported issue when run TPCDS Q57
#764	add java.version to clarify jdk version
#774	Fix runtime issues on spark 3.2
#778	Failed to find include file while running code gen
#725	gazelle failed to run with spark local
#746	Improve memory allocation on native row to column operator
#770	There are cast exception and null pointer expection in spark-3.2
#772	ColumnarBatchScan name missing in UI for Spark321
#740	Handle exceptions like std::out_of_range in casting string to numeric types in WSCG
#727	Create table failed with TPCH partiton dataset
#719	Wrong result on TPC-DS Q38, Q87
#705	Two unit tests failed on master branch

PRs


#834	[NSE-746]Fix memory allocation in row to columnar
#809	[NSE-746]Fix memory allocation in row to columnar
#817	[NSE-761] Update document to reflect spark 3.2.x support
#805	[NSE-772] Code refactor for ColumnarBatchScan
#802	[NSE-794] Fix count() with decimal value
#779	[NSE-778] Failed to find include file while running code gen
#798	[NSE-795] Fix a consecutive SMJ issue in wscg
#799	[NSE-791] fix xchg reuse in Spark321
#773	[NSE-770] [NSE-774] Fix runtime issues on spark 3.2
#787	[NSE-774] Fallback broadcast exchange for DPP to reuse
#763	[NSE-762] Add complex types support for ColumnarSortExec
#783	[NSE-782] prepare 1.3.1 release
#777	[NSE-732]Adding new config to enable/disable complex data type support
#776	[NSE-770] [NSE-774] Fix runtime issues on spark 3.2
#765	[NSE-764] declare java.version for maven
#767	[NSE-610] fix unit tests on SHJ
#760	[NSE-759] Add spark 3.1.2 & 3.1.3 as supported versions for 3.1.1 shim layer
#757	[NSE-746]Fix memory allocation in row to columnar
#724	[NSE-725] change the code style for ExecutorManger
#751	[NSE-745] Improve codegen check for expression
#742	[NSE-359] [NSE-273] Introduce shim layer to fix compatibility issues for gazelle on spark 3.1 & 3.2
#754	[NSE-755] Quick fix for ConverterUtils.getAttrFromExpr for TPCDS queries
#749	[NSE-732] Support Map complex type in Shuffle
#738	[NSE-610] hashjoin opt1
#733	[NSE-732] Support Struct complex type in Shuffle
#744	[NSE-740] fix codegen with out_of_range check
#743	[NSE-740] Catch out_of_range exception in casting string to numeric types in wscg
#735	[NSE-610] hashagg opt#2
#707	[NSE-710] Add rand expression support
#734	[NSE-727] Create table failed with TPCH partiton dataset, patch 2
#715	[NSE-610] hashagg opt#1
#731	[NSE-727] Create table failed with TPCH partiton dataset
#713	[NSE-712] Optimize Array split and support nested Array
#721	[NSE-719][backport]fix null check in SMJ
#720	[NSE-719] fix null check in SMJ
#718	Following NSE-702, fix for AQE enabled case
#691	[NSE-687]Try to upgrade log4j
#703	[NSE-702] Print output datatype for columnar shuffle on WebUI
#706	[NSE-705] Fallback R2C on unsupported cases
#657	[NSE-635] Add document to clarify incompatibility issues in expressions
#623	[NSE-602] Fix Array type shuffle split segmentation fault
#693	[NSE-692] JoinBenchmark is broken

OAP MLlib

Features


#189	Intel-MLlib not support spark-3.2.1 version
#186	[Core] Support CDH versions
#187	Intel-MLlib not support spark-3.1.3 version.
[#180](https://github.c...

Assets 3

14 Jan 14:19

zhixingheyi-tian

v1.3.0

4d7567a

v1.3.0

Overview

OAP 1.3.0 contains two major components: Gazelle and OAP MLlib. In this release, 74 issues/improvements were committed.
Here are the major features/improvements in OAP 1.3.0.

Gazelle (Native SQL Engine):

Further optimization to gain 1.5X overall performance on TPC-DS 103 queries;
Further optimization to gain 1.7X overall performance on TPC-H 22 queries;
Add ORC Read support;
Add Native Row to Column support;
Add 10+ expressions support;
Complete function & performance evaluation on GCP Dataproc platform;
Bugfixes on Columnar WholeStage Codegen, Columnar Sort operators,etc.

OAP MLlib:

Support Covariance & Low Order Moments algorithms optimization for CPU;
Gain over 8x training performance for Covariance algorithm.
Add support for multiple Spark versions (Spark 3.1.1 and Spark 3.2).

Changelog

Gazelle Plugin

Features


#550	[ORC] Support ORC Format Reading
#188	Support Dockerfile
#574	implement native LocalLimit/GlobalLimit
#684	BufferedOutputStream causes massive futex system calls
#465	Provide option to rely on JVM GC to release Arrow buffers in Java
#681	Enable gazelle to support two math expressions: ceil & floor
#651	Set Hadoop 3.2 as default in pom.xml
#126	speed up codegen
#596	[ORC] Verify whether ORC file format supported complex data types in gazelle
#581	implement regex/trim/split expr
#473	Optimize the ArrowColumnarToRow performance
#647	Leverage buffered write in shuffle
#674	Add translate expression support
#675	Add instr expression support
#645	Add support to cast data in bool type to bigint type or string type
#463	version bump on 1.3
#583	implement get_json_object
#640	Disable compression for tiny payloads in shuffle
#631	Do not write schema in shuffle writting
#609	Implement date related expression like to_date, date_sub
#629	Improve codegen failure handling
#612	Add metric "prepare time" for shuffle writer
#576	columnar FROM_UNIXTIME
#589	[ORC] Add TPCDS and TPCH UTs for ORC Format Reading
#537	Increase partition number adaptively for large SHJ stages
#580	document how to create metadata for data source V1 based testing
#555	support batch size > 32k
#561	document the code generation behavior on driver, suggest to deploy driver on powerful server
#523	Support ArrayType in ArrowColumnarToRow operator
#542	Add rule to propagate local window for rank + filter pattern
#21	JNI: Unexpected behavior when executing codes after calling JNIEnv::ThrowNew
#512	Add strategy to force use of SHJ
#518	Arrow buffer cleanup: Support both manual release and auto release as a hybrid mode
#525	Support AQE in columnWriter
#516	Support External Sort in sort kernel
#503	能提供下官网性能测试的详细配置吗？
#501	Remove ArrowRecordBatchBuilder and its usages
#461	Support ArrayType in Gazelle
#479	Optimize sort materialization
#449	Refactor sort codegen kernel
#667	1.3 RC release
#352	Map/Array/Struct type support for Parquet reading in Arrow Data Source

Bugs Fixed


#660	support string builder in window output
#636	Remove log4j 1.2 Support for security issue
#540	reuse subquery in TPC-DS Q14a
#687	log4j 1.2.17 in spark-core
#617	Exceptions handling for stoi, stol, stof, stod in whole stage code gen
#653	Handle special cases for get_json_object in WSCG
#650	Scala test ArrowColumnarBatchSerializerSuite is failing
#642	Fail to cast unresolved reference to attribute reference
#599	data source unit tests are broken
#604	Sort with special projection key broken
#627	adding security instructions
#615	An excpetion in trying to cast attribute in getResultAttrFromExpr of ConverterUtils
#588	preallocated memory for shuffle split
#606	NullpointerException getting map values from ArrowWritableColumnVector
#569	CPU overhead on fine grain / concurrent off-heap acquire operations
#553	Support casting string type to types like int, bigint, float, double
#514	Fix the core dump issue in Q93 when enable columnar2row
#532	Fix the failed UTs in ArrowEvalPythonExecSuite when enable ArrowColumnarToRow
#534	Columnar SHJ: Error if probing with empty record batch
#529	Wrong build side may be chosen for SemiJoin when forcing use of SHJ
#504	Fix non-decimal window function unit test failures
#493	Three unit tests newly failed on master branch

PRs


#690	[NSE-667] backport patches to 1.3 branch
#688	[NSE-687]remove exclude log4j when running ut
#686	[NSE-400] Fix the bug for negative decimal data
#685	[NSE-684] BufferedOutputStream causes massive futex system calls
#680	[NSE-667] backport patches to 1.3 branch
#683	[NSE-400] fix leakage in row to column operator
#637	[NSE-400] Native Arrow Row to columnar support
#648	[NSE-647] Leverage buffered write in shuffle
#682	[NSE-681] Add floor & ceil expression support
#672	[NSE-674] Add translate expression support
#676	[NSE-675] Add instr expression support
#652	[NSE-651]Use Hadoop 3.2 as default hadoop.version
#666	[NSE-667] backport patches to 1.3 branch
#644	[NSE-645] Add support to cast bool type to bigint type & string type
[#659](oap-project/gazelle_plugin#65...

Assets 3

22 Dec 09:01

zhixingheyi-tian

v1.2.1

e96bb70

v1.2.1

OAP 1.2.1 is a bug fix release to remove log4j 2 dependency, without new features and code changes.

OAP 1.2.1 has been updated to remove Log4j version 2.13.3 and may not include the latest functional and security updates. OAP 1.3 is targeted to be released in January 2022 and will include additional functional and/or security updates. Customers should update to the latest version as it becomes available.

Assets 4

03 Sep 10:00

zhixingheyi-tian

v1.2.0

710e38b

v1.2.0

OAP 1.2.0 is the second release we reorganized the source code and transit to the dedicated oap-project organization (https://github.com/oap-project). In this release, 36 issues/improvements were committed. We also completed first round Cloud integration evaluation for Native SQL Engine, OAP MLlib and SQL Data Source Cache on different Cloud platforms including AWS EMR, GCP Dataproc and AWS EKS. The performance data is not promising and 5 Cloud integration functionality and performance issues postponed. In addition, we released two other dedicated Conda packages for EMR and Dataproc Cloud integration.

Here are the major features/improvements in OAP 1.2.0:

SQL Data Source Cache adds K8S support with Unix domain socket deployment limitation left.
Native SQL Engine further optimization to gain 25% overall performance on TPC-DS 103 queries; adds RDD Cache support; adds Spill & UDF Support; completes Column to Row optimization feature; and further enhances the stability of performance.
OAP MLlib further enhances the stability of performance of KMeans, PCA & ALS; supports Linear regression & Bayes algorithm optimization for CPU and gains over 10x training performance for Linear regression algorithm; and completes functionality of KMeans & PCA algorithm support for GPU.
PMem Shuffle adds Remote PMem pool feature (experimental).

Gazelle Plugin

Features


#394	Support ColumnarArrowEvalPython operator
#368	Encountered Hadoop version (3.2.1) conflict issue on AWS EMR-6.3.0
#375	Implement a series of datetime functions
#183	Add Date/Timestamp type support
#362	make arrow-unsafe allocator as the default
#343	configurable codegen opt level
#333	Arrow Data Source: CSV format support fix
#223	Add Parquet write support to Arrow data source
#320	Add build option to enable unsafe Arrow allocator
#337	UDF: Add test case for validating basic row-based udf
#326	Update Scala unit test to spark-3.1.1

Performance


#400	Optimize ColumnarToRow Operator in NSE.
#411	enable ccache on C++ code compiling

Bugs Fixed


#358	Running TPC DS all queries with native-sql-engine for 10 rounds will have performance degradation problems in the last few rounds
#481	JVM heap memory leak on memory leak tracker facilities
#436	Fix for Arrow Data Source test suite
#317	persistent memory cache issue
#382	Hadoop version conflict when supporting to use gazelle_plugin on Google Cloud Dataproc
#384	ColumnarBatchScanExec reading parquet failed on java.lang.IllegalArgumentException: not all nodes and buffers were consumed
#370	Failed to get time zone: NoSuchElementException: None.get
#360	Cannot compile master branch.
#341	build failed on v2 with -Phadoop-3.2

PRs


#489	[NSE-481] JVM heap memory leak on memory leak tracker facilities (Arrow Allocator)
#486	[NSE-475] restore coalescebatches operator before window
#482	[NSE-481] JVM heap memory leak on memory leak tracker facilities
#470	[NSE-469] Lazy Read: Iterator objects are not correctly released
#464	[NSE-460] fix decimal partial sum in 1.2 branch
#439	[NSE-433]Support pre-built Jemalloc
#453	[NSE-254] remove arrow-data-source-common from jar with dependency
#452	[NSE-254]Fix redundant arrow library issue.
#432	[NSE-429] TPC-DS Q14a/b get slowed down within setting spark.oap.sql.columnar.sortmergejoin.lazyread=true
#426	[NSE-207] Fix aggregate and refresh UT test script
#442	[NSE-254]Issue0410 jar size
#441	[NSE-254]Issue0410 jar size
#440	[NSE-254]Solve the redundant arrow library issue
#437	[NSE-436] Fix for Arrow Data Source test suite
#387	[NSE-383] Release SMJ input data immediately after being used
#423	[NSE-417] fix sort spill on inplsace sort
#416	[NSE-207] fix left/right outer join in SMJ
#422	[NSE-421]Disable the wholestagecodegen feature for the ArrowColumnarToRow operator
#369	[NSE-417] Sort spill support framework
#401	[NSE-400] Optimize ColumnarToRow Operator in NSE.
#413	[NSE-411] adding ccache support
#393	[NSE-207] fix scala unit tests
#407	[NSE-403]Add Dataproc integration section to README
#406	[NSE-404]Modify repo name in documents
#402	[NSE-368]Update emr-6.3.0 support
#395	[NSE-394]Support ColumnarArrowEvalPython operator
#346	[NSE-317]fix columnar cache
#392	[NSE-382]Support GCP Dataproc 2.0
#388	[NSE-382]Fix Hadoop version issue
#385	[NSE-384] "Select count(*)" without group by results in error: java.lang.IllegalArgumentException: not all nodes and buffers were consumed
#374	[NSE-207] fix left anti join and support filter wo/ project
#376	[NSE-375] Implement a series of datetime functions
#373	[NSE-183] fix timestamp in native side
#356	[NSE-207] fix issues found in scala unit tests
#371	[NSE-370] Failed to get time zone: NoSuchElementException: None.get
#347	[NSE-183] Add Date/Timestamp type support
#363	[NSE-362] use arrow-unsafe allocator by default
#361	[NSE-273] Spark shim layer infrastructure
#364	[NSE-360] fix ut compile and travis test
#264	[NSE-207] fix issues found from join unit tests
#344	[NSE-343]allow to config codegen opt level
#342	[NSE-341] fix maven build failure
#324	[NSE-223] Add Parquet write support to Arrow data source
#321	[NSE-320] Add build option to enable unsafe Arrow allocator
#299	[NSE-207] fix unsuppored types in aggregate
#338	[NSE-337] UDF: Add test case for validating basic row-based udf
#336	[NSE-333] Arrow Data Source: CSV format support fix
#327	[NSE-326] update scala unit tests to spark-3.1.1

OAP MLlib

Features


#110	Update isOAPEnabled for Kmeans, PCA & ALS
[#108](https://github.com/oap-project/oap-m...

Assets 4

11 Jun 07:03

zhixingheyi-tian

v1.1.1-spark-3.1.1

b692488

v1.1.1-spark-3.1.1

OAP 1.1.1 is a maintenance released based on OAP 1.1.0. In this release, the major work is supporting Spark 3.1.1, 24 issues/improvements were committed.

Release 1.1.1

Native SQL Engine

Features


#304	Upgrade to Arrow 4.0.0
#285	ColumnarWindow: Support Date/Timestamp input in MAX/MIN
#297	Disable incremental compiler in CI
#245	Support columnar rdd cache
#276	Add option to switch Hadoop version
#274	Comment to trigger tpc-h RAM test
#256	CI: do not run ram report for each PR

Bugs Fixed


#325	java.util.ConcurrentModificationException: mutation occurred during iteration
#329	numPartitions are not the same
#318	fix Spark 311 on data source v2
#311	Build reports errors
#302	test on v2 failed due to an exception
#257	different version of slf4j-log4j
#293	Fix BHJ loss if key = 0
#248	arrow dependency must put after arrow installation

PRs


#332	[NSE-325] fix incremental compile issue with 4.5.x scala-maven-plugin
#335	[NSE-329] fix out partitioning in BHJ and SHJ
#328	[NSE-318]check schema before reuse exchange
#307	[NSE-304] Upgrade to Arrow 4.0.0
#312	[NSE-311] Build reports errors
#272	[NSE-273] support spark311
#303	[NSE-302] fix v2 test
#306	[NSE-304] Upgrade to Arrow 4.0.0: Change basic GHA TPC-H test target …
#286	[NSE-285] ColumnarWindow: Support Date input in MAX/MIN
#298	[NSE-297] Disable incremental compiler in GHA CI
#291	[NSE-257] fix multiple slf4j bindings
#294	[NSE-293] fix unsafemap with key = '0'
#233	[NSE-207] fix issues found from aggregate unit tests
#246	[NSE-245]Adding columnar RDD cache support
#289	[NSE-206]Update installation guide and configuration guide.
#277	[NSE-276] Add option to switch Hadoop version
#275	[NSE-274] Comment to trigger tpc-h RAM test
#271	[NSE-196] clean up configs in unit tests
#258	[NSE-257] fix different versions of slf4j-log4j12
#259	[NSE-248] fix arrow dependency order
#249	[NSE-241] fix hashagg result length
#255	[NSE-256] do not run ram report test on each PR

SQL Data Source Cache

Features


#118	port to Spark 3.1.1

Bugs Fixed


#121	OAP Index creation stuck issue

PRs


#132	Fix SampleBasedStatisticsSuite UnitTest case
#122	[ sql-ds-cache-121] Fix Index stuck issues
#119	[SQL-DS-CACHE-118][POAE7-1130] port sql-ds-cache to Spark3.1.1

OAP MLlib

Features


#26	[PIP] Support Spark 3.0.1 / 3.0.2 and upcoming 3.1.1

PRs


#39	[ML-26] Build for different spark version by -Pprofile

PMem Spill

Features


#34	Support vanilla spark 3.1.1

PRs


#41	[PMEM-SPILL-34][POAE7-1119]Port RDD cache to Spark 3.1.1 as separate module

PMem Common

Features


#10	add -mclflushopt flag to enable clflushopt for gcc
#8	use clflushopt instead of clflush

PRs


#11	[PMEM-COMMON-10][POAE7-1010]Add -mclflushopt flag to enable clflushop…
#9	[PMEM-COMMON-8][POAE7-896]use clflush optimize version for clflush

PMem Shuffle

Features


#15	Doesn't work with Spark3.1.1

PRs


#16	[pmem-shuffle-15] Make pmem-shuffle support Spark3.1.1

Remote Shuffle

Features


#18	upgrade to Spark-3.1.1
#11	Support DAOS Object Async API

PRs


#19	[REMOTE-SHUFFLE-18] upgrade to Spark-3.1.1
#14	[REMOTE-SHUFFLE-11] Support DAOS Object Async API

Assets 4

30 Apr 08:55

zhixingheyi-tian

v1.1.0-spark-3.0.0

c945f88

v1.1.0-spark-3.0.0

OAP 1.1.0 is the first release after OAP repository transition to dedicated oap-project github organization. In this release, 85 issues/improvements were committed, and we also provide website documentation.
Here are the major features/improvements in OAP 1.1.0:

OAP MLlib supports PCA & ALS algorithms based on the latest oneCCL version and gains 2x-3x training performance and good stability.
Native SQL Engine can gain over 30% overall performance on TPC-DS 103 queries benchmark with partition tables (tpcds-kit 2.4), and supports DecimalType.
SQL Data Source Cache improves the Plasma Backend stability.
PMem Spill adds FSDAX mode support.
Remote Shuffle adds DAOS support. (experimental)

Native SQL Engine

Features


#261	ArrowDataSource: Add S3 Support
#239	Adopt ARROW-7011
#62	Support Arrow's Build from Source and Package dependency library in the jar
#145	Support decimal in columnar window
#31	Decimal data type support
#128	Support Decimal in Aggregate
#130	Support decimal in project
#134	Update input metrics during reading
#120	Columnar window: Reduce peak memory usage and fix performance issues
#108	Add end-to-end test suite against TPC-DS
#68	Adaptive compression select in Shuffle.
#97	optimize null check in codegen sort
#29	Support mutiple-key sort without codegen
#75	Support HashAggregate in ColumnarWSCG
#73	improve columnar SMJ
#51	Decimal fallback
#38	Supporting expression as join keys in columnar SMJ
#27	Support REUSE exchange when DPP enabled
#17	ColumnarWSCG further optimization

Performance


#194	Arrow Parameters Update when compiling Arrow
#136	upgrade to arrow 3.0
#103	reduce codegen in multiple-key sort
#90	Refine HashAggregate to do everything in CPP

Bugs Fixed


#278	fix arrow dep in 1.1 branch
#265	TPC-DS Q67 failed with memmove exception in native split code.
#280	CMake version check
#241	TPC-DS q67 failed for XXH3_hashLong_64b_withSecret.constprop.0+0x180
#262	q18 has different digits compared with vanilla spark
#196	clean up options for native sql engine
#224	update 3rd party libs
#227	fix vulnerabilities from klockwork
#237	Add ARROW_CSV=ON to default C++ build commands
#229	Fix the deprecated code warning in shuffle_split_test
#119	consolidate batch size
#217	TPC-H query20 result not correct when use decimal dataset
#211	IndexOutOfBoundsException during running TPC-DS Q2
#167	Cannot successfully run q.14a.sql and q14b.sql when using double format for TPC-DS workload.
#191	libarrow.so and libgandiva.so not copy into the tmp directory
#179	Unable to find Arrow headers during build
#153	Fix incorrect queries after enabled Decimal
#173	fix the incorrect result of q69
#48	unit tests for c++ are broken
#101	ColumnarWindow: Remove obsolete debug code
#100	Incorrect result in Q45 w/ v2 bhj threshold is 10MB sf500
#81	Some ArrowVectorWriter implementations doesn't implement setNulls method
#82	Incorrect result in TPCDS Q72 SF1536
#70	Duplicate IsNull check in codegen sort
#64	Memleak in sort when SMJ is disabled
#58	Issues when running tpcds with DPP enabled and AQE disabled
#52	memory leakage in columnar SMJ
#53	Q24a/Q24b SHJ tail task took about 50 secs in SF1500
#42	reduce columnar sort memory footprint
#40	columnar sort codegen fallback to executor side
#1	columnar whole stage codegen failed due to empty results
#23	TPC-DS Q8 failed due to unsupported operation in columnar sortmergejoin
#22	TPC-DS Q95 failed due in columnar wscg
#4	columnar BHJ failed on new memory pool
#5	columnar BHJ failed on partitioned table with prefercolumnar=false

PRs


#288	[NSE-119] clean up on comments
#282	[NSE-280]fix cmake version check
#281	[NSE-280] bump cmake to 3.16
#279	[NSE-278]fix arrow dep in 1.1 branch
#268	[NSE-186] backport to 1.1 branch
#266	[NSE-265] Reserve enough memory before UnsafeAppend in builder
#270	[NSE-261] ArrowDataSource: Add S3 Support
#263	[NSE-262] fix remainer loss in decimal divide
#215	[NSE-196] clean up native sql options
#231	[NSE-176]Arrow install order issue
#242	[NSE-224] update third party code
#240	[NSE-239] Adopt ARROW-7011
#238	[NSE-237] Add ARROW_CSV=ON to default C++ build commands
#230	[NSE-229] Fix the deprecated code warning in shuffle_split_test
#225	[NSE-227]fix issues from codescan
#219	[NSE-217] fix missing decimal check
#212	[NSE-211] IndexOutOfBoundsException during running TPC-DS Q2
#187	[NSE-185] Avoid unnecessary copying when simply projecting on fields
#195	[NSE-194]Turn on several Arrow parameters
#189	[NSE-153] Following NSE-153, optimize fallback conditions for columnar window
#192	[NSE-191]Fix issue0191 for .so file copy to tmp.
#181	[NSE-179]Fix arrow include directory not include when using ARROW_ROOT
[#175](https://github...

Assets 7

Releases: oap-project/oap-tools

v1.5.0

CloudTik

Gazelle

OAP MLlib

Changelog

Gazelle Plugin

Features

Performance

Bugs Fixed

PRs

v1.4.0

Overview

Gazelle

CloudTik

OAP MLlib

Changelog

Gazelle

Features

Performance

Bugs Fixed

PRs

v1.3.1

Overview

Gazelle (Native SQL Engine)

OAP MLlib

Changelog

Gazelle Plugin

Features

Performance

Bugs Fixed

PRs

OAP MLlib

Features

v1.3.0

Overview

Changelog

Gazelle Plugin

Features

Bugs Fixed

PRs

v1.2.1

OAP 1.2.1 is a bug fix release to remove log4j 2 dependency, without new features and code changes.

v1.2.0

Gazelle Plugin

Features

Performance

Bugs Fixed

PRs

OAP MLlib

Features

v1.1.1-spark-3.1.1

Release 1.1.1

Native SQL Engine

Features

Bugs Fixed

PRs

SQL Data Source Cache​

Features

Bugs Fixed

PRs

OAP MLlib

Features

PRs

PMem Spill

Features

PRs

PMem Common

Features

PRs

PMem Shuffle

Features

PRs

Remote Shuffle

Features

PRs

v1.1.0-spark-3.0.0

Native SQL Engine

Features

Performance

SQL Data Source Cache