ShifuML · Xin999 · May 12, 2017 · May 15, 2017 · May 15, 2017 · May 15, 2017
diff --git a/.gitignore b/.gitignore
@@ -4,4 +4,5 @@ target
 .classpath
 .project
 .settings
+.DS_Store
 test-output
diff --git a/CHANGES.txt b/CHANGES.txt
@@ -1,5 +1,5 @@
 /**
- * Copyright [2012-2014] eBay Software Foundation
+ * Copyright [2012-2014] PayPal Software Foundation
  *  
  * Licensed under the Apache License, Version 2.0 (the "License");
  * you may not use this file except in compliance with the License.
@@ -16,6 +16,242 @@
 
 Shifu Change Log
 
+Changes for Shifu-0.10.5
+    * Optimize IndependetTreeModel by decreaseing model memory to 70% and CPU time to 90%
+    * Upgrade guagua to 0.7.0 t fix a bug on empty gzip files in one worker
+
+Changes for Shifu-0.10.4
+    * Optimize IndependetTreeModel by split regression and classification;
+    * Add new version of fast correlation computing.
+
+Changes for Shifu-0.10.3
+    * Fix GBT SLA Categorical Feature Rebin Delimiter Issue: change delimiter to '@^'
+
+Changes for Shifu-0.10.2
+    * Fix GBT SLA issue: pre-parse double types for only once.
+
+Changes for Shifu-0.10.1
+    * Fix one big bug on 'baggingWithReplacement':
+        https://github.com/ShifuML/shifu/issues/335
+
+Changes for Shifu-0.10.0
+    * Tree Ensemble Model Improvement
+        a) Speed GBT Training
+            https://github.com/ShifuML/shifu/issues/252
+        b) Auto Skip Features with only One Bin 
+            https://github.com/ShifuML/shifu/issues/276
+        c) Cover GBT Regression Score To Probability 
+            https://github.com/ShifuML/shifu/issues/254
+        d) Add Early Stop Feature for GBT
+            https://github.com/ShifuML/shifu/issues/230
+        e) By Default Disable Tmp Model Output in NN and GBT, RF
+            https://github.com/ShifuML/shifu/issues/231
+        f) GBT & RF PMML Support
+            https://github.com/ShifuML/shifu/issues/232
+        g) Grid Search: Compute validation error on latest 10 or 20 iterations 
+            https://github.com/ShifuML/shifu/issues/233
+        h) Missing Value Processing in Tree Model
+            https://github.com/ShifuML/shifu/issues/239
+        i) Make Tree Model Without Dependency 
+            https://github.com/ShifuML/shifu/issues/253
+        j) Compress Tree Model by Gzip to Save Size
+            https://github.com/ShifuML/shifu/issues/272
+    * Train Step Improvement
+        a) Sampling Logic Change in Training
+            https://github.com/ShifuML/shifu/issues/310
+        b) Add Stratified Sampling in Training Step
+            https://github.com/ShifuML/shifu/issues/311
+        c) Add Cross Validation in Train Step
+            https://github.com/ShifuML/shifu/issues/312
+        d) Guagua Job Failed Improvement
+            https://github.com/ShifuML/shifu/issues/237
+        e) Disable Tmp Model Output in NN and GBT, RF
+            https://github.com/ShifuML/shifu/issues/231
+        f) Support Redo Training Without Weight after Weighted Norm
+            https://github.com/ShifuML/shifu/issues/315
+    * VarSel Step Improvement
+        a) Refine VarSel Configurations
+            https://github.com/ShifuML/shifu/issues/262
+        b) Enable Multiple Threading in Sensitivity Analysis
+            https://github.com/ShifuML/shifu/issues/213
+        c) Add Feature Importance for Tree Model VarSelect
+            https://github.com/ShifuML/shifu/issues/218
+    * Stats Step Improvement
+        a) Add More Stats in Stats Step
+            https://github.com/ShifuML/shifu/issues/313
+        b) Change Distinct Count Computing from Init to Stats
+            https://github.com/ShifuML/shifu/issues/314
+        c) Bugs & Others
+            1) Default meta/categorical file support
+            2) Bug in stats on bad feature type
+            3) Add more stats on each MR job like number of filter records.
+    * Others
+        a) Eval Step Improvement
+            https://github.com/ShifuML/shifu/issues/150
+        b) CSV Format File Support
+            https://github.com/ShifuML/shifu/issues/258
+        c) Combo Model Training (Beta)
+            https://github.com/ShifuML/shifu/issues/316
+
+Changes for Shifu-0.9.0
+    * Random Forest Enhancement
+        a) RF & GBDT Sort Categorical Features
+            https://github.com/ShifuML/shifu/issues/203
+        b) RF & GBDT Categorical Variables Unsorted Supported
+            https://github.com/ShifuML/shifu/issues/202
+    * Gradient Boosted Trees Enhancement
+        a) Master Fail Over
+            https://github.com/ShifuML/shifu/issues/227
+        b) GBT Support Continuous Model Training
+            https://github.com/ShifuML/shifu/issues/222
+        c) RF & GBDT Sort Categorical Features
+            https://github.com/ShifuML/shifu/issues/203
+        d) RF & GBDT Categorical Variables Unsorted Supported
+            https://github.com/ShifuML/shifu/issues/202
+    * Grid Search Support
+        a) NN Grid Search
+            https://github.com/ShifuML/shifu/issues/214
+        b) RF & GBDT Grid Search
+            https://github.com/ShifuML/shifu/issues/213
+    * Random Search Support
+        https://github.com/ShifuML/shifu/issues/234
+    * Multiple Classfication Enhancement
+        a) Add Random Forest Multiple Classfication
+            https://github.com/ShifuML/shifu/issues/235
+        b) Add OneVSAll Multiple Classfication for NN, RF and GBDT
+            https://github.com/ShifuML/shifu/issues/209
+    * Dynamic Binning Support
+        https://github.com/ShifuML/shifu/issues/236
+    * Others
+        a) https://github.com/ShifuML/shifu/issues/195
+        b) https://github.com/ShifuML/shifu/issues/229
+
+Changes for Shifu-0.2.8
+    * Random Forest Support
+        a) https://github.com/ShifuML/shifu/issues/123
+        b) https://github.com/ShifuML/shifu/issues/122
+    * Gradient Boosted Trees Support
+        a) https://github.com/ShifuML/shifu/issues/124
+        b) https://github.com/ShifuML/shifu/issues/122
+    * Feature Importance in 'posttrain' Step
+        https://github.com/ShifuML/shifu/issues/180
+    * PSI Feature in 'stats' Step
+        https://github.com/ShifuML/shifu/issues/196
+    * Correlation Between Features in 'norm' Step
+        https://github.com/ShifuML/shifu/issues/146
+    * Others
+        a) https://github.com/ShifuML/shifu/issues/190
+        b) https://github.com/ShifuML/shifu/issues/181
+        c) https://github.com/ShifuML/shifu/issues/179
+        d) https://github.com/ShifuML/shifu/issues/178
+
+Changes for Shifu-0.2.7
+    * Sampling Function Improvement
+        a) https://github.com/ShifuML/shifu/issues/93
+        b) https://github.com/ShifuML/shifu/issues/140
+    * Binning Improvement
+        a) https://github.com/ShifuML/shifu/issues/148
+        b) https://github.com/ShifuML/shifu/issues/157
+    * Stats Step Improvement
+        a) https://github.com/ShifuML/shifu/issues/155
+        b) https://github.com/ShifuML/shifu/issues/137
+        c) https://github.com/ShifuML/shifu/issues/75
+    * Norm Step Improvement
+        a) https://github.com/ShifuML/shifu/issues/103
+        b) https://github.com/ShifuML/shifu/issues/120
+        c) https://github.com/ShifuML/shifu/issues/131
+        d) https://github.com/ShifuML/shifu/issues/142
+    * Train Step Improvement
+        a) https://github.com/ShifuML/shifu/issues/66
+        b) https://github.com/ShifuML/shifu/issues/159
+        c) https://github.com/ShifuML/shifu/issues/166
+        d) https://github.com/ShifuML/shifu/issues/106
+    * Variable Selection Step Improvement
+        a) https://github.com/ShifuML/shifu/issues/57
+        b) https://github.com/ShifuML/shifu/issues/102
+    * Distributed LR Algorithm Improvement (Experimental)
+        a) https://github.com/ShifuML/shifu/issues/56
+    * Multiple classes NN Algorithm Improvement (Experimental)
+        a) https://github.com/ShifuML/shifu/issues/149
+    * Pig on Tez Support
+
+Changes for Shifu-0.2.6
+    * https://github.com/ShifuML/shifu/issues/133: Add skewness and kurtosis stats
+    * https://github.com/ShifuML/shifu/issues/134: Add CSV ColumnConfig Format for ColumnConfig.json
+    * https://github.com/ShifuML/shifu/issues/117: Add AUC Computation on Eval Step
+    * https://github.com/ShifuML/shifu/issues/118: Add Shortcut Commands: 'norm', 'varsel'
+    * https://github.com/ShifuML/shifu/issues/127: Support HDP 2.6.0.2.2.4.2-2
+    * https://github.com/ShifuML/shifu/issues/83: Add Distinct Count Statistics
+    * https://github.com/ShifuML/shifu/issues/82: Auto-detect Variable Type
+
+Changes for Shifu-0.2.5
+    * https://github.com/ShifuML/shifu/issues/97: Upgrade Guagua to latest version 0.7.0.
+        a) New features included in Guagua 0.6.0 to continuous improve performance of Shifu:
+            1) 'out-of-core' list to support worker to scale out from memory to disk.
+            2) Netty-based coordinators to decrease dependency on zookeeper and improve iteration communication performance.
+            3) Embedded zookeeper server supported not only in client as a thread, but also in master node as a process.
+        b) One improtant feature included in Guagua 0.7.0 to accelerate training in Shifu:
+            1) Partial-compete feature means in each iteration master only wait for partial workers complete and to 
+               ignore straggler worker result. 
+    * https://github.com/ShifuML/shifu/issues/105: SPDT stats performance improvement.
+        a) 'binningAlgorithm=SPDTI' (default value) in ModelConfig.json#stats is to improve scalability for big data. 
+            This solution is based on SPDT binning algorithm and called SPDT-Improvement(SPDTI).
+        b) Using SPDTI, with 20 million of records and 1600 variables, 20 minutes to finish stats. With 100 million of 
+            records and 1600 variables, 30 minutes to finish stats.
+    * https://github.com/ShifuML/shifu/issues/59: Shifu eval confusion and performance improvement.
+        a) With 20 million of records and 1600 variables, 13 minutes to finish eval step compared with 20 minutes in 
+            Shifu 0.2.4.
+    * https://github.com/ShifuML/shifu/issues/64: Set the Hadoop parallel number automatically.
+        a) With input data set increase, user no need to set 'hadoopParallelNumber' in shifuconfig.
+        b) This value is tuned automatically new Shifu.
+    * Binning improvement
+        a) https://github.com/ShifuML/shifu/issues/77: Add missing value count as a bin.
+        b) https://github.com/ShifuML/shifu/issues/79: Add weights to binning.
+        c) https://github.com/ShifuML/shifu/issues/80: Weights binning KS/IV/WoE computing.
+    * https://github.com/ShifuML/shifu/issues/72: Support WoE transformation when doing normalization
+    * Training step improvement
+        a) https://github.com/ShifuML/shifu/issues/95: NN doesn't support 0 hidden layer.
+        b) https://github.com/ShifuML/shifu/issues/76: Add convergence parameter to Shifu d-train.
+        c) https://github.com/ShifuML/shifu/issues/84: Add local disk support to scale in-memory data set.
+        d) https://github.com/ShifuML/shifu/issues/60: Continuous model training.
+        e) https://github.com/ShifuML/shifu/issues/85: Add 'epochsPerIteration' parameter in NNWorker.
+    * Bug fix:
+        a) https://github.com/ShifuML/shifu/issues/98
+        b) https://github.com/ShifuML/shifu/issues/92
+        c) https://github.com/ShifuML/shifu/issues/70
+        d) https://github.com/ShifuML/shifu/issues/69
+        e) https://github.com/ShifuML/shifu/issues/67
+
+Changes for Shifu-0.2.4
+    * https://github.com/ShifuML/shifu/issues/20: Work flow change.
+        a) Old: new -> init -> stats -> varselect -> normalize -> train -> eval
+        b) New: new -> init -> stats -> normalize -> varselect -> train -> eval
+        c) If do variable selection again after a model, current work flow no need do normalize step, after variable selection then do training step.
+    * https://github.com/ShifuML/shifu/issues/49: Add distributed sensitivity analysis variable selection.
+        a) 'varSelect.wrapperEnabled=true' and 'wrapperBy=SE' in ModelConfig.json#varSelect part to enable sensitivity variable selection.
+        b) 'wrapperRatio' in ModelConfig.json#varSelect part is a percent to set how many variables will be removed.
+        c) To continue variable selection by sensitivity method, run 'shifu varselect' again. 
+        d) With 20 million of records and 1600 variables, 70 minutes (45 minutes for 200 epoch training and 25 minutes for sensitivity variable selection).
+    * https://github.com/ShifuML/shifu/issues/38: Improve scalability in stats step.
+        a) 'binningAlgorithm=SPDT' (default value) in ModelConfig.json#stats is to do variable statistics to improve scalability for big data.
+            Using SPDT, with 20 million of records and 1600 variables, 50 minutes to finish variable selection.
+        b) 'binningAlgorithm=MunroPat' in ModelConfig.json#stats is another approach to do variable statistics to improve scalability for big data.
+    * https://github.com/ShifuML/shifu/issues/58: Improve scalability in eval step for HDFS mode.
+        a) With 20 million of records and 1600 variables, 20 minutes to finish eval step with only 1GB driver memory.
+    * https://github.com/ShifuML/shifu/issues/61: Embeded zookeeper server support.
+        a) No need to set zookeeper servers so far since embeded zookeeper server will help on training models.
+        b) Big data training, independent zookeeper cluster is strongly recommended.
+        c) Upgrade Guagua to 0.5.0 to get support from Guagua for this feature.
+    * Add PMML standard model converter.
+        a) To convert .nn files into pmml, run "shifu export -t pmml" or just "shifu export" (The pmml is default)
+           All generated pmml files will be under <Model-Directory>/pmmls/
+    * Bug fix:
+        a) https://github.com/ShifuML/shifu/issues/45
+        b) https://github.com/ShifuML/shifu/issues/51
+        c) https://github.com/ShifuML/shifu/issues/39
+        d) https://github.com/ShifuML/shifu/issues/40
+        e) https://github.com/ShifuML/shifu/issues/45
+
 Changes for Shifu-0.2.0
     * Make Shifu to support Hadoop-2.0 (add -Phdp-yarn when building)
     * Show mapper progress in JobTracker and show progress in CLI when using distribute training 
@@ -54,7 +290,6 @@ Changes for Shifu-0.0.4
     * TA457512 - Fix the bug: the delimiter of evaluation data doesn't take effect in AKKA mode
     * TA458788 - Fix the bug: Meta validation fails to report error when - "NumHiddenNodes" : [ "a", 45 ]
     * TA459375 - Write in-place QuickSort to replace Collections.sort() for memory consumption
-
 
 Changes for Shifu-0.0.3
     * TA446629 - Fix the bug: when there is am empty file, shifu in akka mode will be stucked
@@ -79,8 +314,8 @@ Changes for Shifu-0.0.2
     * DE29230 - Fix the bugs if the training data path is HDFS globe path
     * DE29231 - User only need put the configuration in local file system
     * US201443 - PathFinder refactor, split Manager class into several processes
-	* US207747 - Add option in ModelConfig for job queue name
-	* US177973 - Update code license and test data license 
+    * US207747 - Add option in ModelConfig for job queue name
+    * US177973 - Update code license and test data license 
     * Don't copy data and purify data when run `shifu init`
     * Add more comments
 

diff --git a/README.md b/README.md
@@ -1,30 +1,58 @@
+[<img src="images/logo/shifu.png" alt="Shifu" align="left">](http://shifu.ml)<div align="right"><div>[![Build Status](https://travis-ci.org/ShifuML/shifu.svg)](https://travis-ci.org/ShifuML/shifu)</div><div>[![Maven Central](https://maven-badges.herokuapp.com/maven-central/ml.shifu/shifu/badge.svg)](https://maven-badges.herokuapp.com/maven-central/ml.shifu/shifu)</div></div>
 
+#
 
-## Getting Started
+## Download
+Please [download](https://github.com/ShifuML/shifu/wiki/shifu-0.10.5-hdp-yarn.tar.gz) latest shifu [here](https://github.com/ShifuML/shifu/wiki/shifu-0.10.5-hdp-yarn.tar.gz).
 
-Please visit [shifu.ml](http://shifu.ml) for download infomation, installation instructions, and tutorials.
+## Getting Started
+After shifu downloading, build your first model with Shifu [tutorial](https://github.com/ShifuML/shifu/wiki/Tutorial---Build-Your-First-ML-Model). More details about shifu can be found in our [wiki pages](https://github.com/ShifuML/shifu/wiki).
 
 ## What is Shifu?
 Shifu is an open-source, end-to-end machine learning and data mining framework built on top of Hadoop. Shifu is designed for data scientists, simplifying the life-cycle of building machine learning models. While originally built for fraud modeling, Shifu is generalized for many other modeling domains.
 
+One of Shifu's pros is an end-to-end modeling pipeline in machine learning. With only configurations settings, a whole machine pipeline can be built and model can be much more easy to develop and push to production. The pipeline defined in Shifu is in below:
+
+![Shifu Pipeline](https://raw.githubusercontent.com/wiki/ShifuML/shifu/images/new-shifu-pipeline.png)
+
 Shifu provides a simple command-line interface for each step of the model building process, including
 
 * Statistic calculation & variable selection to determine the most predictive variables in your data
-* Variable normalization
-* Distributed neural network model training
+* [Variable normalization](https://github.com/ShifuML/shifu/wiki/Variable%20Transform%20in%20Shifu)
+* [Distributed variable selection based on sensitivity analysis](https://github.com/ShifuML/shifu/wiki/Variable%20Selection%20in%20Shifu)
+* [Distributed neural network model training](https://github.com/ShifuML/shifu/wiki/Distributed%20Neural%20Network%20Training%20in%20Shifu)
+* [Distributed tree ensemble model training](https://github.com/ShifuML/shifu/wiki/Distributed%20Tree%20Ensemble%20Model%20Training%20in%20Shifu)
 * Post training analysis & model evaluation
 
-Shifu’s fast Hadoop-based, distributed neural network training can reduce model training time from days to hours on 500GB data sets. Shifu integrates with Pig workflows on Hadoop, and Shifu-trained models can be integrated into production code with a simple Java API. Shifu leverages Pig, Akka, Encog and other open source projects.
+Shifu’s fast Hadoop-based, distributed neural network / logistic regression / gradient boosted trees training can reduce model training time from days to hours on TB data sets. Shifu integrates with Pig workflows on Hadoop, and Shifu-trained models can be integrated into production code with a simple Java API. Shifu leverages Pig, Akka, Encog and other open source projects.
+
+[Guagua](https://github.com/ShifuML/guagua), an in-memory iterative computing framework on Hadoop YARN is developed as sub-project of Shifu to accelerate training progress.
+
+More details about shifu can be found in our [wiki pages](https://github.com/ShifuML/shifu/wiki)
+
+## Conference
+
+* [QCON Shanghai 2015](http://2015.qconshanghai.com/presentation/2827) [Slides](http://www.slideshare.net/pengshanzhang/large-scale-machine-learning-at-pay-pal-risk)
+
+* [BDTC Beijing 2016](http://bdtc2016.hadooper.cn/dct/page/70107)
+
+* [Strata Beijing 2017](https://strata.oreilly.com.cn/strata-cn/public/schedule/detail/59593?locale=en)
 
 ## Contributors
 
- - Zhanghao Hu
- - Grahame Jastrebski
- - Lavar Li
- - Mark Liu
- - David Zhang
- - Xin Zhong
+ - Zhanghao Hu (zhanhu@paypal.com)
+ - Grahame Jastrebski (gjastrebski@paypal.com)
+ - Lavar Li (lulli@paypal.com)
+ - Mark Liu (yliu15@paypal.com)
+ - David Zhang (pengzhang@paypal.com)
+ - Xin Zhong (xinzhong@paypal.com)
+ - Simon Zhang (jzhang13@paypal.com)
+ - Sharma Nitin (nsharma1@paypal.com)
+
+## Google Group
+
+Please join [Shifu group](https://groups.google.com/forum/#!forum/shifuml) if questions, bugs or anything else.
 
 ## Copyright and License
 
-Copyright 2012-2014, eBay Software Foundation under [the Apache License](LICENSE.txt).
+Copyright 2012-2017, PayPal Software Foundation under the [Apache License](LICENSE.txt).