This repository was archived by the owner on Mar 27, 2024. It is now read-only.
v1.3.0
·
4 commits
to branch-1.3
since this release
Overview
OAP 1.3.0 contains two major components: Gazelle and OAP MLlib. In this release, 74 issues/improvements were committed.
Here are the major features/improvements in OAP 1.3.0.
Gazelle (Native SQL Engine):
- Further optimization to gain 1.5X overall performance on TPC-DS 103 queries;
- Further optimization to gain 1.7X overall performance on TPC-H 22 queries;
- Add ORC Read support;
- Add Native Row to Column support;
- Add 10+ expressions support;
- Complete function & performance evaluation on GCP Dataproc platform;
- Bugfixes on Columnar WholeStage Codegen, Columnar Sort operators,etc.
OAP MLlib:
- Support Covariance & Low Order Moments algorithms optimization for CPU;
- Gain over 8x training performance for Covariance algorithm.
- Add support for multiple Spark versions (Spark 3.1.1 and Spark 3.2).
Changelog
Gazelle Plugin
Features
#550 | [ORC] Support ORC Format Reading |
#188 | Support Dockerfile |
#574 | implement native LocalLimit/GlobalLimit |
#684 | BufferedOutputStream causes massive futex system calls |
#465 | Provide option to rely on JVM GC to release Arrow buffers in Java |
#681 | Enable gazelle to support two math expressions: ceil & floor |
#651 | Set Hadoop 3.2 as default in pom.xml |
#126 | speed up codegen |
#596 | [ORC] Verify whether ORC file format supported complex data types in gazelle |
#581 | implement regex/trim/split expr |
#473 | Optimize the ArrowColumnarToRow performance |
#647 | Leverage buffered write in shuffle |
#674 | Add translate expression support |
#675 | Add instr expression support |
#645 | Add support to cast data in bool type to bigint type or string type |
#463 | version bump on 1.3 |
#583 | implement get_json_object |
#640 | Disable compression for tiny payloads in shuffle |
#631 | Do not write schema in shuffle writting |
#609 | Implement date related expression like to_date, date_sub |
#629 | Improve codegen failure handling |
#612 | Add metric "prepare time" for shuffle writer |
#576 | columnar FROM_UNIXTIME |
#589 | [ORC] Add TPCDS and TPCH UTs for ORC Format Reading |
#537 | Increase partition number adaptively for large SHJ stages |
#580 | document how to create metadata for data source V1 based testing |
#555 | support batch size > 32k |
#561 | document the code generation behavior on driver, suggest to deploy driver on powerful server |
#523 | Support ArrayType in ArrowColumnarToRow operator |
#542 | Add rule to propagate local window for rank + filter pattern |
#21 | JNI: Unexpected behavior when executing codes after calling JNIEnv::ThrowNew |
#512 | Add strategy to force use of SHJ |
#518 | Arrow buffer cleanup: Support both manual release and auto release as a hybrid mode |
#525 | Support AQE in columnWriter |
#516 | Support External Sort in sort kernel |
#503 | 能提供下官网性能测试的详细配置吗? |
#501 | Remove ArrowRecordBatchBuilder and its usages |
#461 | Support ArrayType in Gazelle |
#479 | Optimize sort materialization |
#449 | Refactor sort codegen kernel |
#667 | 1.3 RC release |
#352 | Map/Array/Struct type support for Parquet reading in Arrow Data Source |
Bugs Fixed
#660 | support string builder in window output |
#636 | Remove log4j 1.2 Support for security issue |
#540 | reuse subquery in TPC-DS Q14a |
#687 | log4j 1.2.17 in spark-core |
#617 | Exceptions handling for stoi, stol, stof, stod in whole stage code gen |
#653 | Handle special cases for get_json_object in WSCG |
#650 | Scala test ArrowColumnarBatchSerializerSuite is failing |
#642 | Fail to cast unresolved reference to attribute reference |
#599 | data source unit tests are broken |
#604 | Sort with special projection key broken |
#627 | adding security instructions |
#615 | An excpetion in trying to cast attribute in getResultAttrFromExpr of ConverterUtils |
#588 | preallocated memory for shuffle split |
#606 | NullpointerException getting map values from ArrowWritableColumnVector |
#569 | CPU overhead on fine grain / concurrent off-heap acquire operations |
#553 | Support casting string type to types like int, bigint, float, double |
#514 | Fix the core dump issue in Q93 when enable columnar2row |
#532 | Fix the failed UTs in ArrowEvalPythonExecSuite when enable ArrowColumnarToRow |
#534 | Columnar SHJ: Error if probing with empty record batch |
#529 | Wrong build side may be chosen for SemiJoin when forcing use of SHJ |
#504 | Fix non-decimal window function unit test failures |
#493 | Three unit tests newly failed on master branch |
PRs
#690 | [NSE-667] backport patches to 1.3 branch |
#688 | [NSE-687]remove exclude log4j when running ut |
#686 | [NSE-400] Fix the bug for negative decimal data |
#685 | [NSE-684] BufferedOutputStream causes massive futex system calls |
#680 | [NSE-667] backport patches to 1.3 branch |
#683 | [NSE-400] fix leakage in row to column operator |
#637 | [NSE-400] Native Arrow Row to columnar support |
#648 | [NSE-647] Leverage buffered write in shuffle |
#682 | [NSE-681] Add floor & ceil expression support |
#672 | [NSE-674] Add translate expression support |
#676 | [NSE-675] Add instr expression support |
#652 | [NSE-651]Use Hadoop 3.2 as default hadoop.version |
#666 | [NSE-667] backport patches to 1.3 branch |
#644 | [NSE-645] Add support to cast bool type to bigint type & string type |
#659 | [NSE-650] Scala test ArrowColumnarBatchSerializerSuite is failing |
#649 | [NSE-660] fix window builder with string |
#655 | [NSE-617] Handle exception in cast expression from string to numeric types in WSCG |
#654 | [NSE-653] Add validity checking for get_json_object in WSCG |
#641 | [NSE-640] Disable compression for tiny payloads in shuffle |
#646 | [NSE-636]Remove log4j1 related unit tests |
#488 | [NSE-463] version bump to 1.3.0-SNAPSHOT |
#639 | [NSE-126] improve codegen with pre-compiled header |
#638 | [NSE-642] Correctly get ResultAttrFromExpr for sql with 'case when IN/AND/OR' |
#632 | [NSE-631] Do not write schema in shuffle writting |
#633 | [NSE-601] Fix an issue in the case of group by coalesce |
#630 | [NSE-629] improve codegen failure handling |
#622 | [NSE-609] Complete to_date expression support |
#628 | [NSE-627] Doc: adding security readme |
#624 | [NSE-609] Add support for date_sub expression |
#619 | [NSE-583] impl get_json_object in wscg |
#614 | [NSE-576] Support from_unixtime expression in the case that 'yyyyMMdd' format is required |
#616 | [NSE-615] Add tackling for ColumnarEqualTo type in getResultAttrFromExpr of ConverterUtils |
#613 | [NSE-612] Add metric "prepare time" for shuffle writer |
#608 | [NSE-602] don't enable columnar shuffle on unsupported data types |
#601 | [NSE-604] fix sort w/ proj keys |
#607 | [NSE-606] NullpointerException getting map values from ArrowWritableC… |
#584 | [NSE-583] implement get_json_object |
#595 | [NSE-576] fix from_unixtime |
#582 | [NSE-581]impl regexp_replace |
#594 | [NSE-588] config the pre-allocated memory for shuffle's splitter |
#600 | [NSE-599] fix datasource unit tests |
#597 | [NSE-596] Add complex data types validation for ORC file format in gazelle |
#590 | [NSE-569] CPU overhead on fine grain / concurrent off-heap acquire operations |
#586 | [NSE-581] Add trim, left trim, right trim support in expression |
#578 | [NSE-589] Add TPCDS and TPCH suite for Orc fileformat |
#538 | [NSE-537] Increase partition number adaptively for large SHJ stages |
#587 | [NSE-580] update doc on data source(DS V1/V2 usage) |
#575 | [NSE-574]implement columnar limit |
#556 | [NSE-555] using 32bit selection vector |
#577 | [NSE-576] implement columnar from_unixtime |
#572 | [NSE-561] refine docs on sample configurations and code generation behavior |
#552 | [NSE-553] Complete the support to cast string type to types like int, bigint, float, double |
#543 | [NSE-540] enable reuse subquery |
#554 | [NSE-207] change the fallback condition for Columnar Like |
#559 | [NSE-352] Map/Array/Struct type support for Parquet reading in ArrowData Source |
#551 | [NSE-550] Support ORC Format Reading in Gazelle |
#545 | [NSE-542] Add rule to propagate local window for rank + filter pattern |
#541 | [NSE-207] improve the fix for join optimization |
#495 | [NSE-207] Fix NaN in Max and Min |
#533 | [NSE-532] Fix the failed UTs in ArrowEvalPythonExecSuite when enable ArrowColumnarToRow |
#536 | [NSE-207] Ignore tests causing test stop |
#535 | [NSE-534] Columnar SHJ: Error if probing with empty record batch |
#531 | [NSE-21] JNI: Unexpected behavior when executing codes after calling JNIEnv::ThrowNew |
#530 | [NSE-529] Wrong build side may be chosen for SemiJoin when forcing use of SHJ |
#524 | [NSE-523] Support ArrayType in ArrowColumnarToRow optimization |
#513 | [NSE-512] Add strategy to force use of SHJ |
#519 | [NSE-518] Arrow buffer cleanup: Support both manual release and auto … |
#526 | [NSE-525]Support AQE for ColumnarWriter |
#517 | [NSE-516]Support ExternalSorter to control memory footage |
#515 | [NSE-514] Fix the core dump issue in Q93 with V2 test |
#509 | Update README.md for performance result. |
#511 | [NSE-207] fix full UT test |
#502 | [NSE-501] Remove ArrowRecordBatchBuilder and its usages |
#507 | Previous PR removed this UT, fix here |
#496 | [NSE-461]columnar shuffle support for ArrayType |
#480 | [NSE-479] optimize sort materialization |
#474 | [NSE-473]Optimize ArrowColumnarToRow performance |
#505 | [NSE-504] Fix non-decimal window function unit test |
#497 | [NSE-493] Three unit tests newly failed on master branch (Python UDF Unit Tests) |
#466 | [NSE-465] POC release memory using GC |
#462 | [NSE-461][WIP] Support ArrayType in ArrowWritableColumnVector and ColumarPandasUDF |
#450 | [NSE-449] Refactor codegen sort kernel |
#471 | [NSE-207] Enabling UT report |
#445 | [NSE-444]Support ArrowColumnarToRowExec when the root plan is ColumnarToRowExec |
#447 | [NSE-207] Fix date and timestamp functions |
OAP MLlib
Features
#158 | [GPU] Add convertToSyclHomogen for row merged table for kmeans and pca |
#149 | [GPU] Add check-gpu utility |
#140 | [Core] Refactor and support multiple Spark versions in single JAR |
#137 | [Core] Multiple improvements for build & deploy and integrate oneAPI 2021.4 |
#133 | [Correlation] Add Correlation algorithm |
#125 | [GPU] Update for Kmeans and PCA |
Bugs Fixed
#161 | [SDLe][Snyk] Log4j 1.2.* issues brought from Spark when scanning 3rd-party components for vulnerabilities |
#155 | [POM] Update scala version to 2.12.15 |
#135 | [Core] Fix ccl::gather and Add ccl::gatherv |
PRs
#162 | [ML-161] Excluding log4j 1.x dependency from Spark core to avoid log4… |
#159 | [GPU] Add convertToSyclHomogen for row merged table for kmeans and pca |
#157 | [ML-155] [POM] Update scala version to 2.12.15 |
#150 | [ML-149][GPU] Add check-gpu utility |
#144 | [ML-151] enable Summarizer with OAP |
#141 | [Core] Refactor and support multiple Spark versions in single JAR |
#139 | [ML-137] [Core] Multiple improvements for build & deploy and integrate oneAPI 2021.4 |
#127 | [ML-133][Correlation] Add Correlation algorithm |
#126 | [ML-125][GPU] Update for Kmeans and PCA |