Skip to content
This repository was archived by the owner on Mar 27, 2024. It is now read-only.

v1.3.0

Compare
Choose a tag to compare
@zhixingheyi-tian zhixingheyi-tian released this 14 Jan 14:19
· 4 commits to branch-1.3 since this release
4d7567a

Overview

OAP 1.3.0 contains two major components: Gazelle and OAP MLlib. In this release, 74 issues/improvements were committed.
Here are the major features/improvements in OAP 1.3.0.

Gazelle (Native SQL Engine):

  • Further optimization to gain 1.5X overall performance on TPC-DS 103 queries;
  • Further optimization to gain 1.7X overall performance on TPC-H 22 queries;
  • Add ORC Read support;
  • Add Native Row to Column support;
  • Add 10+ expressions support;
  • Complete function & performance evaluation on GCP Dataproc platform;
  • Bugfixes on Columnar WholeStage Codegen, Columnar Sort operators,etc.

OAP MLlib:

  • Support Covariance & Low Order Moments algorithms optimization for CPU;
  • Gain over 8x training performance for Covariance algorithm.
  • Add support for multiple Spark versions (Spark 3.1.1 and Spark 3.2).

Changelog

Gazelle Plugin

Features

#550 [ORC] Support ORC Format Reading
#188 Support Dockerfile
#574 implement native LocalLimit/GlobalLimit
#684 BufferedOutputStream causes massive futex system calls
#465 Provide option to rely on JVM GC to release Arrow buffers in Java
#681 Enable gazelle to support two math expressions: ceil & floor
#651 Set Hadoop 3.2 as default in pom.xml
#126 speed up codegen
#596 [ORC] Verify whether ORC file format supported complex data types in gazelle
#581 implement regex/trim/split expr
#473 Optimize the ArrowColumnarToRow performance
#647 Leverage buffered write in shuffle
#674 Add translate expression support
#675 Add instr expression support
#645 Add support to cast data in bool type to bigint type or string type
#463 version bump on 1.3
#583 implement get_json_object
#640 Disable compression for tiny payloads in shuffle
#631 Do not write schema in shuffle writting
#609 Implement date related expression like to_date, date_sub
#629 Improve codegen failure handling
#612 Add metric "prepare time" for shuffle writer
#576 columnar FROM_UNIXTIME
#589 [ORC] Add TPCDS and TPCH UTs for ORC Format Reading
#537 Increase partition number adaptively for large SHJ stages
#580 document how to create metadata for data source V1 based testing
#555 support batch size > 32k
#561 document the code generation behavior on driver, suggest to deploy driver on powerful server
#523 Support ArrayType in ArrowColumnarToRow operator
#542 Add rule to propagate local window for rank + filter pattern
#21 JNI: Unexpected behavior when executing codes after calling JNIEnv::ThrowNew
#512 Add strategy to force use of SHJ
#518 Arrow buffer cleanup: Support both manual release and auto release as a hybrid mode
#525 Support AQE in columnWriter
#516 Support External Sort in sort kernel
#503 能提供下官网性能测试的详细配置吗?
#501 Remove ArrowRecordBatchBuilder and its usages
#461 Support ArrayType in Gazelle
#479 Optimize sort materialization
#449 Refactor sort codegen kernel
#667 1.3 RC release
#352 Map/Array/Struct type support for Parquet reading in Arrow Data Source

Bugs Fixed

#660 support string builder in window output
#636 Remove log4j 1.2 Support for security issue
#540 reuse subquery in TPC-DS Q14a
#687 log4j 1.2.17 in spark-core
#617 Exceptions handling for stoi, stol, stof, stod in whole stage code gen
#653 Handle special cases for get_json_object in WSCG
#650 Scala test ArrowColumnarBatchSerializerSuite is failing
#642 Fail to cast unresolved reference to attribute reference
#599 data source unit tests are broken
#604 Sort with special projection key broken
#627 adding security instructions
#615 An excpetion in trying to cast attribute in getResultAttrFromExpr of ConverterUtils
#588 preallocated memory for shuffle split
#606 NullpointerException getting map values from ArrowWritableColumnVector
#569 CPU overhead on fine grain / concurrent off-heap acquire operations
#553 Support casting string type to types like int, bigint, float, double
#514 Fix the core dump issue in Q93 when enable columnar2row
#532 Fix the failed UTs in ArrowEvalPythonExecSuite when enable ArrowColumnarToRow
#534 Columnar SHJ: Error if probing with empty record batch
#529 Wrong build side may be chosen for SemiJoin when forcing use of SHJ
#504 Fix non-decimal window function unit test failures
#493 Three unit tests newly failed on master branch

PRs

#690 [NSE-667] backport patches to 1.3 branch
#688 [NSE-687]remove exclude log4j when running ut
#686 [NSE-400] Fix the bug for negative decimal data
#685 [NSE-684] BufferedOutputStream causes massive futex system calls
#680 [NSE-667] backport patches to 1.3 branch
#683 [NSE-400] fix leakage in row to column operator
#637 [NSE-400] Native Arrow Row to columnar support
#648 [NSE-647] Leverage buffered write in shuffle
#682 [NSE-681] Add floor & ceil expression support
#672 [NSE-674] Add translate expression support
#676 [NSE-675] Add instr expression support
#652 [NSE-651]Use Hadoop 3.2 as default hadoop.version
#666 [NSE-667] backport patches to 1.3 branch
#644 [NSE-645] Add support to cast bool type to bigint type & string type
#659 [NSE-650] Scala test ArrowColumnarBatchSerializerSuite is failing
#649 [NSE-660] fix window builder with string
#655 [NSE-617] Handle exception in cast expression from string to numeric types in WSCG
#654 [NSE-653] Add validity checking for get_json_object in WSCG
#641 [NSE-640] Disable compression for tiny payloads in shuffle
#646 [NSE-636]Remove log4j1 related unit tests
#488 [NSE-463] version bump to 1.3.0-SNAPSHOT
#639 [NSE-126] improve codegen with pre-compiled header
#638 [NSE-642] Correctly get ResultAttrFromExpr for sql with 'case when IN/AND/OR'
#632 [NSE-631] Do not write schema in shuffle writting
#633 [NSE-601] Fix an issue in the case of group by coalesce
#630 [NSE-629] improve codegen failure handling
#622 [NSE-609] Complete to_date expression support
#628 [NSE-627] Doc: adding security readme
#624 [NSE-609] Add support for date_sub expression
#619 [NSE-583] impl get_json_object in wscg
#614 [NSE-576] Support from_unixtime expression in the case that 'yyyyMMdd' format is required
#616 [NSE-615] Add tackling for ColumnarEqualTo type in getResultAttrFromExpr of ConverterUtils
#613 [NSE-612] Add metric "prepare time" for shuffle writer
#608 [NSE-602] don't enable columnar shuffle on unsupported data types
#601 [NSE-604] fix sort w/ proj keys
#607 [NSE-606] NullpointerException getting map values from ArrowWritableC…
#584 [NSE-583] implement get_json_object
#595 [NSE-576] fix from_unixtime
#582 [NSE-581]impl regexp_replace
#594 [NSE-588] config the pre-allocated memory for shuffle's splitter
#600 [NSE-599] fix datasource unit tests
#597 [NSE-596] Add complex data types validation for ORC file format in gazelle
#590 [NSE-569] CPU overhead on fine grain / concurrent off-heap acquire operations
#586 [NSE-581] Add trim, left trim, right trim support in expression
#578 [NSE-589] Add TPCDS and TPCH suite for Orc fileformat
#538 [NSE-537] Increase partition number adaptively for large SHJ stages
#587 [NSE-580] update doc on data source(DS V1/V2 usage)
#575 [NSE-574]implement columnar limit
#556 [NSE-555] using 32bit selection vector
#577 [NSE-576] implement columnar from_unixtime
#572 [NSE-561] refine docs on sample configurations and code generation behavior
#552 [NSE-553] Complete the support to cast string type to types like int, bigint, float, double
#543 [NSE-540] enable reuse subquery
#554 [NSE-207] change the fallback condition for Columnar Like
#559 [NSE-352] Map/Array/Struct type support for Parquet reading in ArrowData Source
#551 [NSE-550] Support ORC Format Reading in Gazelle
#545 [NSE-542] Add rule to propagate local window for rank + filter pattern
#541 [NSE-207] improve the fix for join optimization
#495 [NSE-207] Fix NaN in Max and Min
#533 [NSE-532] Fix the failed UTs in ArrowEvalPythonExecSuite when enable ArrowColumnarToRow
#536 [NSE-207] Ignore tests causing test stop
#535 [NSE-534] Columnar SHJ: Error if probing with empty record batch
#531 [NSE-21] JNI: Unexpected behavior when executing codes after calling JNIEnv::ThrowNew
#530 [NSE-529] Wrong build side may be chosen for SemiJoin when forcing use of SHJ
#524 [NSE-523] Support ArrayType in ArrowColumnarToRow optimization
#513 [NSE-512] Add strategy to force use of SHJ
#519 [NSE-518] Arrow buffer cleanup: Support both manual release and auto …
#526 [NSE-525]Support AQE for ColumnarWriter
#517 [NSE-516]Support ExternalSorter to control memory footage
#515 [NSE-514] Fix the core dump issue in Q93 with V2 test
#509 Update README.md for performance result.
#511 [NSE-207] fix full UT test
#502 [NSE-501] Remove ArrowRecordBatchBuilder and its usages
#507 Previous PR removed this UT, fix here
#496 [NSE-461]columnar shuffle support for ArrayType
#480 [NSE-479] optimize sort materialization
#474 [NSE-473]Optimize ArrowColumnarToRow performance
#505 [NSE-504] Fix non-decimal window function unit test
#497 [NSE-493] Three unit tests newly failed on master branch (Python UDF Unit Tests)
#466 [NSE-465] POC release memory using GC
#462 [NSE-461][WIP] Support ArrayType in ArrowWritableColumnVector and ColumarPandasUDF
#450 [NSE-449] Refactor codegen sort kernel
#471 [NSE-207] Enabling UT report
#445 [NSE-444]Support ArrowColumnarToRowExec when the root plan is ColumnarToRowExec
#447 [NSE-207] Fix date and timestamp functions

OAP MLlib

Features

#158 [GPU] Add convertToSyclHomogen for row merged table for kmeans and pca
#149 [GPU] Add check-gpu utility
#140 [Core] Refactor and support multiple Spark versions in single JAR
#137 [Core] Multiple improvements for build & deploy and integrate oneAPI 2021.4
#133 [Correlation] Add Correlation algorithm
#125 [GPU] Update for Kmeans and PCA

Bugs Fixed

#161 [SDLe][Snyk] Log4j 1.2.* issues brought from Spark when scanning 3rd-party components for vulnerabilities
#155 [POM] Update scala version to 2.12.15
#135 [Core] Fix ccl::gather and Add ccl::gatherv

PRs

#162 [ML-161] Excluding log4j 1.x dependency from Spark core to avoid log4…
#159 [GPU] Add convertToSyclHomogen for row merged table for kmeans and pca
#157 [ML-155] [POM] Update scala version to 2.12.15
#150 [ML-149][GPU] Add check-gpu utility
#144 [ML-151] enable Summarizer with OAP
#141 [Core] Refactor and support multiple Spark versions in single JAR
#139 [ML-137] [Core] Multiple improvements for build & deploy and integrate oneAPI 2021.4
#127 [ML-133][Correlation] Add Correlation algorithm
#126 [ML-125][GPU] Update for Kmeans and PCA