Overview

OAP 1.3.0 contains two major components: Gazelle and OAP MLlib. In this release, 74 issues/improvements were committed.
Here are the major features/improvements in OAP 1.3.0.

Gazelle (Native SQL Engine):

Further optimization to gain 1.5X overall performance on TPC-DS 103 queries;
Further optimization to gain 1.7X overall performance on TPC-H 22 queries;
Add ORC Read support;
Add Native Row to Column support;
Add 10+ expressions support;
Complete function & performance evaluation on GCP Dataproc platform;
Bugfixes on Columnar WholeStage Codegen, Columnar Sort operators,etc.

OAP MLlib:

Support Covariance & Low Order Moments algorithms optimization for CPU;
Gain over 8x training performance for Covariance algorithm.
Add support for multiple Spark versions (Spark 3.1.1 and Spark 3.2).

Changelog

Gazelle Plugin

Features


#550	[ORC] Support ORC Format Reading
#188	Support Dockerfile
#574	implement native LocalLimit/GlobalLimit
#684	BufferedOutputStream causes massive futex system calls
#465	Provide option to rely on JVM GC to release Arrow buffers in Java
#681	Enable gazelle to support two math expressions: ceil & floor
#651	Set Hadoop 3.2 as default in pom.xml
#126	speed up codegen
#596	[ORC] Verify whether ORC file format supported complex data types in gazelle
#581	implement regex/trim/split expr
#473	Optimize the ArrowColumnarToRow performance
#647	Leverage buffered write in shuffle
#674	Add translate expression support
#675	Add instr expression support
#645	Add support to cast data in bool type to bigint type or string type
#463	version bump on 1.3
#583	implement get_json_object
#640	Disable compression for tiny payloads in shuffle
#631	Do not write schema in shuffle writting
#609	Implement date related expression like to_date, date_sub
#629	Improve codegen failure handling
#612	Add metric "prepare time" for shuffle writer
#576	columnar FROM_UNIXTIME
#589	[ORC] Add TPCDS and TPCH UTs for ORC Format Reading
#537	Increase partition number adaptively for large SHJ stages
#580	document how to create metadata for data source V1 based testing
#555	support batch size > 32k
#561	document the code generation behavior on driver, suggest to deploy driver on powerful server
#523	Support ArrayType in ArrowColumnarToRow operator
#542	Add rule to propagate local window for rank + filter pattern
#21	JNI: Unexpected behavior when executing codes after calling JNIEnv::ThrowNew
#512	Add strategy to force use of SHJ
#518	Arrow buffer cleanup: Support both manual release and auto release as a hybrid mode
#525	Support AQE in columnWriter
#516	Support External Sort in sort kernel
#503	能提供下官网性能测试的详细配置吗？
#501	Remove ArrowRecordBatchBuilder and its usages
#461	Support ArrayType in Gazelle
#479	Optimize sort materialization
#449	Refactor sort codegen kernel
#667	1.3 RC release
#352	Map/Array/Struct type support for Parquet reading in Arrow Data Source

Bugs Fixed


#660	support string builder in window output
#636	Remove log4j 1.2 Support for security issue
#540	reuse subquery in TPC-DS Q14a
#687	log4j 1.2.17 in spark-core
#617	Exceptions handling for stoi, stol, stof, stod in whole stage code gen
#653	Handle special cases for get_json_object in WSCG
#650	Scala test ArrowColumnarBatchSerializerSuite is failing
#642	Fail to cast unresolved reference to attribute reference
#599	data source unit tests are broken
#604	Sort with special projection key broken
#627	adding security instructions
#615	An excpetion in trying to cast attribute in getResultAttrFromExpr of ConverterUtils
#588	preallocated memory for shuffle split
#606	NullpointerException getting map values from ArrowWritableColumnVector
#569	CPU overhead on fine grain / concurrent off-heap acquire operations
#553	Support casting string type to types like int, bigint, float, double
#514	Fix the core dump issue in Q93 when enable columnar2row
#532	Fix the failed UTs in ArrowEvalPythonExecSuite when enable ArrowColumnarToRow
#534	Columnar SHJ: Error if probing with empty record batch
#529	Wrong build side may be chosen for SemiJoin when forcing use of SHJ
#504	Fix non-decimal window function unit test failures
#493	Three unit tests newly failed on master branch

PRs


#690	[NSE-667] backport patches to 1.3 branch
#688	[NSE-687]remove exclude log4j when running ut
#686	[NSE-400] Fix the bug for negative decimal data
#685	[NSE-684] BufferedOutputStream causes massive futex system calls
#680	[NSE-667] backport patches to 1.3 branch
#683	[NSE-400] fix leakage in row to column operator
#637	[NSE-400] Native Arrow Row to columnar support
#648	[NSE-647] Leverage buffered write in shuffle
#682	[NSE-681] Add floor & ceil expression support
#672	[NSE-674] Add translate expression support
#676	[NSE-675] Add instr expression support
#652	[NSE-651]Use Hadoop 3.2 as default hadoop.version
#666	[NSE-667] backport patches to 1.3 branch
#644	[NSE-645] Add support to cast bool type to bigint type & string type
#659	[NSE-650] Scala test ArrowColumnarBatchSerializerSuite is failing
#649	[NSE-660] fix window builder with string
#655	[NSE-617] Handle exception in cast expression from string to numeric types in WSCG
#654	[NSE-653] Add validity checking for get_json_object in WSCG
#641	[NSE-640] Disable compression for tiny payloads in shuffle
#646	[NSE-636]Remove log4j1 related unit tests
#488	[NSE-463] version bump to 1.3.0-SNAPSHOT
#639	[NSE-126] improve codegen with pre-compiled header
#638	[NSE-642] Correctly get ResultAttrFromExpr for sql with 'case when IN/AND/OR'
#632	[NSE-631] Do not write schema in shuffle writting
#633	[NSE-601] Fix an issue in the case of group by coalesce
#630	[NSE-629] improve codegen failure handling
#622	[NSE-609] Complete to_date expression support
#628	[NSE-627] Doc: adding security readme
#624	[NSE-609] Add support for date_sub expression
#619	[NSE-583] impl get_json_object in wscg
#614	[NSE-576] Support from_unixtime expression in the case that 'yyyyMMdd' format is required
#616	[NSE-615] Add tackling for ColumnarEqualTo type in getResultAttrFromExpr of ConverterUtils
#613	[NSE-612] Add metric "prepare time" for shuffle writer
#608	[NSE-602] don't enable columnar shuffle on unsupported data types
#601	[NSE-604] fix sort w/ proj keys
#607	[NSE-606] NullpointerException getting map values from ArrowWritableC…
#584	[NSE-583] implement get_json_object
#595	[NSE-576] fix from_unixtime
#582	[NSE-581]impl regexp_replace
#594	[NSE-588] config the pre-allocated memory for shuffle's splitter
#600	[NSE-599] fix datasource unit tests
#597	[NSE-596] Add complex data types validation for ORC file format in gazelle
#590	[NSE-569] CPU overhead on fine grain / concurrent off-heap acquire operations
#586	[NSE-581] Add trim, left trim, right trim support in expression
#578	[NSE-589] Add TPCDS and TPCH suite for Orc fileformat
#538	[NSE-537] Increase partition number adaptively for large SHJ stages
#587	[NSE-580] update doc on data source(DS V1/V2 usage)
#575	[NSE-574]implement columnar limit
#556	[NSE-555] using 32bit selection vector
#577	[NSE-576] implement columnar from_unixtime
#572	[NSE-561] refine docs on sample configurations and code generation behavior
#552	[NSE-553] Complete the support to cast string type to types like int, bigint, float, double
#543	[NSE-540] enable reuse subquery
#554	[NSE-207] change the fallback condition for Columnar Like
#559	[NSE-352] Map/Array/Struct type support for Parquet reading in ArrowData Source
#551	[NSE-550] Support ORC Format Reading in Gazelle
#545	[NSE-542] Add rule to propagate local window for rank + filter pattern
#541	[NSE-207] improve the fix for join optimization
#495	[NSE-207] Fix NaN in Max and Min
#533	[NSE-532] Fix the failed UTs in ArrowEvalPythonExecSuite when enable ArrowColumnarToRow
#536	[NSE-207] Ignore tests causing test stop
#535	[NSE-534] Columnar SHJ: Error if probing with empty record batch
#531	[NSE-21] JNI: Unexpected behavior when executing codes after calling JNIEnv::ThrowNew
#530	[NSE-529] Wrong build side may be chosen for SemiJoin when forcing use of SHJ
#524	[NSE-523] Support ArrayType in ArrowColumnarToRow optimization
#513	[NSE-512] Add strategy to force use of SHJ
#519	[NSE-518] Arrow buffer cleanup: Support both manual release and auto …
#526	[NSE-525]Support AQE for ColumnarWriter
#517	[NSE-516]Support ExternalSorter to control memory footage
#515	[NSE-514] Fix the core dump issue in Q93 with V2 test
#509	Update README.md for performance result.
#511	[NSE-207] fix full UT test
#502	[NSE-501] Remove ArrowRecordBatchBuilder and its usages
#507	Previous PR removed this UT, fix here
#496	[NSE-461]columnar shuffle support for ArrayType
#480	[NSE-479] optimize sort materialization
#474	[NSE-473]Optimize ArrowColumnarToRow performance
#505	[NSE-504] Fix non-decimal window function unit test
#497	[NSE-493] Three unit tests newly failed on master branch (Python UDF Unit Tests)
#466	[NSE-465] POC release memory using GC
#462	[NSE-461][WIP] Support ArrayType in ArrowWritableColumnVector and ColumarPandasUDF
#450	[NSE-449] Refactor codegen sort kernel
#471	[NSE-207] Enabling UT report
#445	[NSE-444]Support ArrowColumnarToRowExec when the root plan is ColumnarToRowExec
#447	[NSE-207] Fix date and timestamp functions

OAP MLlib

Features


#158	[GPU] Add convertToSyclHomogen for row merged table for kmeans and pca
#149	[GPU] Add check-gpu utility
#140	[Core] Refactor and support multiple Spark versions in single JAR
#137	[Core] Multiple improvements for build & deploy and integrate oneAPI 2021.4
#133	[Correlation] Add Correlation algorithm
#125	[GPU] Update for Kmeans and PCA

Bugs Fixed


#161	[SDLe][Snyk] Log4j 1.2.* issues brought from Spark when scanning 3rd-party components for vulnerabilities
#155	[POM] Update scala version to 2.12.15
#135	[Core] Fix ccl::gather and Add ccl::gatherv

PRs


#162	[ML-161] Excluding log4j 1.x dependency from Spark core to avoid log4…
#159	[GPU] Add convertToSyclHomogen for row merged table for kmeans and pca
#157	[ML-155] [POM] Update scala version to 2.12.15
#150	[ML-149][GPU] Add check-gpu utility
#144	[ML-151] enable Summarizer with OAP
#141	[Core] Refactor and support multiple Spark versions in single JAR
#139	[ML-137] [Core] Multiple improvements for build & deploy and integrate oneAPI 2021.4
#127	[ML-133][Correlation] Add Correlation algorithm
#126	[ML-125][GPU] Update for Kmeans and PCA

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

v1.3.0

Overview

Changelog

Gazelle Plugin

Features

Bugs Fixed

PRs

OAP MLlib

Features

Bugs Fixed

PRs

Uh oh!