Skip to content

AdaDelta #447

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 1,162 commits into
base: develop
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
1162 commits
Select commit Hold shift + click to select a range
8fcb3ea
fix the name match problem when runPSI
huzza May 12, 2017
4a60780
small encoding issue
pengshanzhang May 15, 2017
beb7ae5
Merge branch 'release'
pengshanzhang May 15, 2017
c70346c
small change
pengshanzhang May 15, 2017
c4225ce
Merge branch 'develop' of https://github.com/ShifuML/shifu into develop
pengshanzhang May 15, 2017
d548478
add feature importance file roll up logic
pengshanzhang May 18, 2017
4fdc85c
User can do rebin in stats step; And user can repeat rebin if the res…
huzza May 18, 2017
3c8970a
Merge branch 'develop' of https://github.com/ShifuML/shifu into develop
huzza May 18, 2017
4c8b251
Merge pull request #364 from huzza/develop
huzza May 18, 2017
3e5ee1a
1. fix a rollup bug 2. change default memory to 1700 M to avoid OOM e…
pengshanzhang May 19, 2017
d7970cd
Merge branch 'develop' of https://github.com/ShifuML/shifu into develop
pengshanzhang May 19, 2017
0199885
1. increase worker timeout 2. change trees list to CopyOnWriteArrayLi…
pengshanzhang May 19, 2017
0c391b3
fix a bug in setting IV value - weighted iv is miss-used as iv
huzza May 22, 2017
38c05b4
prepare for 0.10.2 release
pengshanzhang May 24, 2017
4e34554
Merge branch 'develop' into release
pengshanzhang May 24, 2017
9894df0
change to 0.10.2
pengshanzhang May 24, 2017
7001f73
Merge branch 'release'
pengshanzhang May 24, 2017
1c2efc3
flatten categorical variables when generating GBT tree to avoid too l…
huzza May 30, 2017
ed370fb
Merge branch 'develop' of https://github.com/ShifuML/shifu into develop
huzza May 30, 2017
3d06c70
Merge pull request #365 from huzza/develop
huzza May 30, 2017
168a19d
trim categorical value whose length is larger than 16k
huzza May 30, 2017
76e50aa
Merge pull request #366 from huzza/develop
huzza May 30, 2017
d654795
lease NS check in NN models scoring
huzza May 31, 2017
c065e6c
fix a bug in exporting pmml - it should check columnNum instead of co…
huzza May 31, 2017
b48ad2b
Merge pull request #367 from huzza/develop
huzza May 31, 2017
18c3105
add eval score and shifu gbt engine score test
pengshanzhang Jun 1, 2017
070522d
Merge branch 'develop' of https://github.com/ShifuML/shifu into develop
pengshanzhang Jun 1, 2017
5278ce7
change categorical group variable delimiter from ^ to @^
huzza Jun 1, 2017
933c037
Merge branch 'develop' of https://github.com/ShifuML/shifu into develop
huzza Jun 1, 2017
2928885
Merge pull request #368 from huzza/develop
huzza Jun 1, 2017
c1faf16
if user set memory in command line for correlation computing, autamat…
pengshanzhang Jun 1, 2017
bd00ec9
Merge pull request #6 from ShifuML/develop
m4rkl1u Jun 1, 2017
5289255
change for 0.10.3 release
pengshanzhang Jun 2, 2017
3fb7a44
Merge branch 'develop' into release
pengshanzhang Jun 2, 2017
3b0b56e
change to 0.10.3 for release
pengshanzhang Jun 2, 2017
70d6651
Merge branch 'release'
pengshanzhang Jun 2, 2017
3fbb51f
refactor compute(Map) to call compute(double[])
pengshanzhang Jun 2, 2017
85e4fcd
fix a bug on shifu parameter parsing
pengshanzhang Jun 5, 2017
321ce91
recover gbt to only support variance impurity type
pengshanzhang Jun 5, 2017
4b71354
fix a bug on Shifu parameter parsing
pengshanzhang Jun 5, 2017
aca6484
fix a bug on Shifu parameter parsing
pengshanzhang Jun 5, 2017
a77d68a
refactor code to avoid to new double array if it's a regression model
huzza Jun 6, 2017
2c029ef
merge code and optimize IndependentTreeModel
huzza Jun 6, 2017
d55d396
Merge pull request #370 from huzza/develop
huzza Jun 6, 2017
e3feb75
fix a rebin bug on no missing binPosRate
pengshanzhang Jun 7, 2017
2e4beaf
prepare for 0.10.4 release
pengshanzhang Jun 8, 2017
125d0fb
add change logs
pengshanzhang Jun 8, 2017
e17ae9b
Merge branch 'develop' into release
pengshanzhang Jun 8, 2017
10ba66d
change version to 0.10.4
pengshanzhang Jun 8, 2017
2c65ca2
fix a bug on CorrelationMapper.java
pengshanzhang Jun 8, 2017
ea28f65
Merge branch 'release'
pengshanzhang Jun 8, 2017
a29d437
change version to 0.11.0-SNAPSHOT
pengshanzhang Jun 8, 2017
8d0cd2c
fix a issue on empty records without stats in UpdateBinningInfoReducer
pengshanzhang Jun 12, 2017
5e2b337
add auto-detect RunMode: if it is in Hadoop cluster, a distributed ex…
pengshanzhang Jun 13, 2017
88e5b1e
fix a bug on local eval
pengshanzhang Jun 14, 2017
f2caec2
some logic changed for correlation computing
pengshanzhang Jun 15, 2017
a89ce36
1. add eval score benchmark column setting 2. fix a bug on scorer Bas…
pengshanzhang Jun 16, 2017
c924f44
a bug on stats young heap size setting, should be less than half of t…
pengshanzhang Jun 16, 2017
214ad2c
fix the bug in binning and data filter after support namespace
huzza Jun 17, 2017
6ac18b3
Merge branch 'develop' of https://github.com/ShifuML/shifu into develop
huzza Jun 17, 2017
3e04d86
merge changes
huzza Jun 17, 2017
8756dec
Merge pull request #377 from huzza/develop
huzza Jun 17, 2017
b0ce584
add regression test bash scripts
pengshanzhang Jun 19, 2017
c8695cd
small changes
pengshanzhang Jun 23, 2017
b334399
Merge branch 'develop' of https://github.com/ShifuML/shifu into develop
pengshanzhang Jun 23, 2017
d2bcdaa
avoid column num to array index mapping when prediction
huzza Jun 23, 2017
80dae80
Merge branch 'develop' of https://github.com/ShifuML/shifu into develop
huzza Jun 23, 2017
6e436d3
save memory by remove some unused fields: (One gbt memory dropped 30%…
pengshanzhang Jun 23, 2017
a843ec7
make sure the continous training is not affected after tree execution…
huzza Jun 23, 2017
ff15a69
follow convention
huzza Jun 25, 2017
9234ec6
Merge branch 'develop' of https://github.com/ShifuML/shifu into develop
huzza Jun 25, 2017
57aeba2
Merge pull request #378 from huzza/develop
huzza Jun 25, 2017
72d8b98
set FeatureType to byte to save memory
pengshanzhang Jun 26, 2017
29d8115
Merge branch 'develop' of https://github.com/ShifuML/shifu into develop
pengshanzhang Jun 26, 2017
dd6f9d9
change double gain to float to save memory
pengshanzhang Jun 26, 2017
4468763
add new categorical missing value norm type, it is compatible by mean…
pengshanzhang Jun 27, 2017
30b15ec
fix an encoding issue
pengshanzhang Jun 27, 2017
553111d
change to new guagua version to fix one issue in guagua to avoid empt…
pengshanzhang Jun 29, 2017
27416e0
no need check finalselect in norm cleaned data
pengshanzhang Jun 29, 2017
367e397
Merge branch 'develop' of https://github.com/ShifuML/shifu into develop
huzza Jun 29, 2017
bfe2333
fix NormalizedUDFTest
huzza Jun 30, 2017
410360d
1. by default remove namespace in column name when loading 2. by defa…
pengshanzhang Jun 30, 2017
c084868
Merge pull request #380 from huzza/develop
huzza Jun 30, 2017
b43a453
fix a bug on introducing new
pengshanzhang Jun 30, 2017
0a0a794
Merge branch 'develop' of https://github.com/ShifuML/shifu into develop
pengshanzhang Jun 30, 2017
6415a68
change guagua to 0.7.7
pengshanzhang Jun 30, 2017
623815b
change to release 0.10.5 version
pengshanzhang Jun 30, 2017
53b3f7b
Merge branch 'release'
pengshanzhang Jun 29, 2017
ef058d4
fix a bug on clean categorical data: should be raw string value for c…
pengshanzhang Jul 2, 2017
ce346e4
fix a bug on categorical feature clean: should be raw value instead …
zhangpengshan Jul 2, 2017
ad6d168
add new wiki image
pengshanzhang Jul 2, 2017
2ea56bf
add NN first layer output option for high level features
pengshanzhang Jul 4, 2017
7ea8072
add 25 50 75 percentile in columnstats
MiniZhuwei Jul 4, 2017
30093d8
Impl of calculate 25 50 75 in UpdateBinningInfoReducer
MiniZhuwei Jul 4, 2017
d36283e
format code
MiniZhuwei Jul 4, 2017
d05bfb1
fix equalpopulation binning bug
MiniZhuwei Jun 27, 2017
ffa13d5
improve time performance of equalpopulation binning to avoid mapred t…
MiniZhuwei Jun 27, 2017
e29b6dd
Update EqualPopulationBinning.java
MiniZhuwei Jul 4, 2017
1537fc6
add set percentile to in stats worker to write percentile back to loc…
MiniZhuwei Jul 5, 2017
ff3be54
improve time performance of equalpopulation binning to avoid mapred t…
MiniZhuwei Jul 5, 2017
f344ea7
improve dropout
pengshanzhang Jul 5, 2017
623feed
Merge pull request #379 from MiniZhuwei/develop
zhangpengshan Jul 6, 2017
1eea543
set simple column name in pmml generation
pengshanzhang Jul 6, 2017
0070541
Merge branch 'develop' of https://github.com/ShifuML/shifu into develop
pengshanzhang Jul 6, 2017
e5011a9
Merge branch 'develop-dropout' into develop
pengshanzhang Jul 6, 2017
62c956a
remove unused code
MiniZhuwei Jul 6, 2017
e3a2d5d
back to origin locate bin boundary condition
MiniZhuwei Jul 7, 2017
172f8dc
Merge pull request #383 from MiniZhuwei/develop
zhangpengshan Jul 7, 2017
9c80876
fix an issue on tree model feature importance computing
pengshanzhang Jul 7, 2017
7ddba3e
Merge branch 'develop' of https://github.com/ShifuML/shifu into develop
pengshanzhang Jul 7, 2017
b7511ad
fix a bug on dropout computing in back step of BP
pengshanzhang Jul 7, 2017
bebd973
fix a bug on ELM gradients to 0 setting: it is set to last layer whil…
pengshanzhang Jul 7, 2017
72095cb
make namespace support to stict model. only shifu.namespace.strict.mo…
huzza Jul 7, 2017
257459f
Merge branch 'develop' of https://github.com/ShifuML/shifu into develop
huzza Jul 7, 2017
41aa19e
fix a bug on L1 regularization update
pengshanzhang Jul 10, 2017
3527412
Merge branch 'develop' of https://github.com/ShifuML/shifu into develop
huzza Jul 10, 2017
79ef496
append raw data in eval norm output
pengshanzhang Jul 11, 2017
2a0be3d
fix a bug on csv no header path for eval norm
pengshanzhang Jul 11, 2017
2c58ef7
fix a bug on varsel: if no column is selected(no finalSelect=true), t…
pengshanzhang Jul 12, 2017
48084f7
trim categorical value to make 1.0 same as 1
huzza Jul 12, 2017
8cd6628
Merge pull request #394 from huzza/develop
zhangpengshan Jul 12, 2017
0cefee1
format change
pengshanzhang Jul 12, 2017
21059c7
Merge branch 'develop' of https://github.com/ShifuML/shifu into develop
pengshanzhang Jul 12, 2017
f041b1f
fix a inconsistent for category 1 and 1.0 issue
pengshanzhang Jul 12, 2017
222291f
fix the bug in NormalizeUDF, when tag is null, it will throw NP
huzza Jul 13, 2017
1ffacf8
Merge pull request #397 from huzza/develop
huzza Jul 13, 2017
8d2b125
1. remove 1.0 and 1 categorical issue fix because of finding some cor…
pengshanzhang Jul 17, 2017
2def25e
Merge branch 'develop' of https://github.com/ShifuML/shifu into develop
pengshanzhang Jul 17, 2017
1eb2f56
add invalid tag counter
pengshanzhang Jul 17, 2017
a74bf63
Merge pull request #8 from ShifuML/develop
m4rkl1u Jul 17, 2017
bb231ec
fix a bug on conflict to write model output in final iteration and fi…
pengshanzhang Jul 17, 2017
72cb777
1. first init for IndependentNNModel 2. make max category size can be…
pengshanzhang Jul 18, 2017
35ccefd
fix bug: the columnconfig didnt sync to HDFS before PSI calculation
Jul 18, 2017
b6559ff
up
Jul 18, 2017
8df0225
Merge pull request #399 from m4rkl1u/develop
zhangpengshan Jul 18, 2017
05b5645
independent nn model code push
pengshanzhang Jul 18, 2017
9adcf19
remove unimported classes
Jul 18, 2017
c60a737
push
Jul 18, 2017
1649115
add new class
Jul 18, 2017
bfb0372
first good impl for private binary nn model: output to another folder
pengshanzhang Jul 19, 2017
8813116
quick fix
pengshanzhang Jul 19, 2017
665c776
fix issue on new binary nn model
pengshanzhang Jul 20, 2017
08e1abb
Add NSColumn.java toString for error info
pengshanzhang Jul 20, 2017
789111b
enable eval on k-fold and grid search models, grid search models whic…
pengshanzhang Jul 21, 2017
8b8175c
1. change network to a list in IndependentNNModel.java 2. add more kf…
pengshanzhang Jul 21, 2017
182a1cf
first commit for hybrid column
pengshanzhang Jul 24, 2017
c34dadd
add hybrid threshold setting
pengshanzhang Jul 24, 2017
c232687
make hybrid supported in IndependentNNModel.java
pengshanzhang Jul 25, 2017
fd82e23
first commit to unified bagging model
pengshanzhang Jul 25, 2017
8f3230b
add unified nn bagging model support
pengshanzhang Aug 2, 2017
4e592a1
add new one bagging pmml model parameter in CLI
pengshanzhang Aug 3, 2017
628785d
change readme.doc
zhangpengshan Aug 4, 2017
68656d3
make seed can be config for bagging/RF bagging
wuhaifengdhu Aug 6, 2017
3ed6c20
add gbt export to one bagging model support, trees in IndependentTree…
pengshanzhang Aug 7, 2017
dc92da0
fix issue on dt tmp model saving
pengshanzhang Aug 7, 2017
c2d6d9e
add eval perf and eval score running in parallel if multiple evalsets
Aug 7, 2017
6b4a33e
fix a issue on se varselect for candidate list 2. a issue on export b…
pengshanzhang Aug 8, 2017
1631906
escape the delimiter in eval step
huzza Aug 8, 2017
d9ed4ca
Merge branch 'develop' of https://github.com/ShifuML/shifu into develop
huzza Aug 8, 2017
700999f
Merge pull request #418 from huzza/develop
huzza Aug 8, 2017
8214014
Merge pull request #417 from wuhaifengdhu/baggingSeed-revise
zhangpengshan Aug 16, 2017
35fc3b8
1. fix OOM for create too many ColumnConfig instances in eval (eval m…
pengshanzhang Aug 17, 2017
37cf805
Merge branch 'develop' of https://github.com/ShifuML/shifu into develop
pengshanzhang Aug 17, 2017
7cbd41a
change loss fucntion
Aug 17, 2017
feff4f5
Merge branch 'develop' of https://github.com/prcjsczllc/shifu into de…
Aug 17, 2017
da83b72
overwrite the logloss function
Aug 17, 2017
22454b3
add Candidate flag
huzza Aug 17, 2017
55e5721
Merge branch 'develop' of https://github.com/ShifuML/shifu into develop
huzza Aug 17, 2017
69faf40
update the logloss function based on GBM whitepaper
Aug 18, 2017
b367f03
Merge pull request #420 from huzza/develop
huzza Aug 22, 2017
85de67b
hidden parameter baggingSampleSeed in default config file
wuhaifengdhu Aug 27, 2017
25ba45a
Merge pull request #422 from wuhaifengdhu/randomSeed
zhangpengshan Aug 28, 2017
fb8ff9b
Merge pull request #419 from prcjsczllc/develop
zhangpengshan Aug 28, 2017
aae5a27
add new feature for segments expansions in feature level
pengshanzhang Aug 30, 2017
0e1838e
Merge branch 'develop' of https://github.com/ShifuML/shifu into develop
pengshanzhang Aug 30, 2017
827056f
update latest download info
zhangpengshan Aug 31, 2017
561bf82
add more information
zhangpengshan Aug 31, 2017
21fbd73
add pipeline
zhangpengshan Aug 31, 2017
3f6d2a7
add more conferences info
zhangpengshan Aug 31, 2017
3313365
Update README.md
zhangpengshan Aug 31, 2017
02779db
Update README.md
zhangpengshan Aug 31, 2017
ea41308
Update README.md
zhangpengshan Aug 31, 2017
b7a16c2
Update README.md
zhangpengshan Aug 31, 2017
fbc19a4
remove logs for travis building
pengshanzhang Aug 31, 2017
0674ea0
Merge branch 'develop' of https://github.com/ShifuML/shifu into develop
pengshanzhang Aug 31, 2017
7e86240
fix ut issue in travis
pengshanzhang Aug 31, 2017
2fcb165
Update README.md
zhangpengshan Aug 31, 2017
6d142fe
Update README.md
zhangpengshan Aug 31, 2017
d1566c4
Update README.md
zhangpengshan Aug 31, 2017
89ca0da
Update README.md
zhangpengshan Aug 31, 2017
83446a4
Update README.md
zhangpengshan Aug 31, 2017
c8fc078
Update README.md
zhangpengshan Aug 31, 2017
4efd840
log refine
pengshanzhang Aug 31, 2017
f19469c
Merge branch 'develop' of https://github.com/ShifuML/shifu into develop
pengshanzhang Aug 31, 2017
a6e2ad2
add Candidate columnFlag
huzza Aug 31, 2017
2f99e32
Merge branch 'develop' of https://github.com/huzza/shifu into develop
huzza Aug 31, 2017
23167f3
fix issue on filter UDF
pengshanzhang Sep 1, 2017
6fef09e
fix on log removing
pengshanzhang Sep 1, 2017
81082c6
fix bug on expanding segment stats
pengshanzhang Sep 1, 2017
a96d25b
Update README.md
zhangpengshan Sep 4, 2017
4b5d02e
Update README.md
zhangpengshan Sep 4, 2017
a10f423
Update README.md
zhangpengshan Sep 4, 2017
e6db55e
Update README.md
zhangpengshan Sep 4, 2017
83e63a4
Update README.md
zhangpengshan Sep 4, 2017
887a659
Update README.md
zhangpengshan Sep 4, 2017
e8ca465
fix issue on variable update
pengshanzhang Sep 4, 2017
4919c13
refine DataPurifier.java
pengshanzhang Sep 5, 2017
c6fcb4d
fix ut
pengshanzhang Sep 5, 2017
8fb8577
Merge pull request #9 from ShifuML/develop
m4rkl1u Sep 5, 2017
cdddc70
add filtering logic for PSI scripts
Sep 5, 2017
37def1b
Merge pull request #428 from m4rkl1u/develop
zhangpengshan Sep 5, 2017
62ffe48
fix a bug on parallel eval race condition issue
pengshanzhang Sep 6, 2017
01b3faf
add feature to read correlation from output and no need to rerun it a…
pengshanzhang Sep 6, 2017
3a1e6cc
merge from shifu (#1)
junshiguo Sep 7, 2017
d959850
enable grid search file config
Sep 7, 2017
799eecd
Merge branch 'develop' of https://github.com/junshiguo/shifu into dev…
Sep 7, 2017
a7ad056
Merge pull request #2 from ShifuML/develop
junshiguo Sep 7, 2017
ce70b20
Merge pull request #3 from junshiguo/develop
junshiguo Sep 7, 2017
0d3ad79
add gridConfigFile in meta; bug fix
Sep 7, 2017
3a06335
update based on comments
Sep 11, 2017
a58dce7
remove <> to avoid javadoc compile error
Sep 11, 2017
99f179a
enable gridConfigFile in HDFS mode
Sep 11, 2017
fd36960
add check threshold logic to file config
Sep 11, 2017
09b2f50
add more check when parsing grid config file; add null check when upl…
Sep 12, 2017
7fdf2b7
remove unnecessary imports
Sep 12, 2017
8290bc7
use == for enum equal check
Sep 14, 2017
d6629ca
Merge pull request #430 from junshiguo/juguo-develop
zhangpengshan Sep 14, 2017
a0961ca
remove some debug logs and change some typos
pengshanzhang Sep 14, 2017
f2284aa
Merge branch 'develop' of https://github.com/ShifuML/shifu into develop
pengshanzhang Sep 14, 2017
aa648ef
a bug on typo of ONEVSREST
pengshanzhang Sep 14, 2017
4b4d8f5
Merge pull request #4 from ShifuML/develop
junshiguo Sep 14, 2017
93110ad
split param number type into integer and number
Sep 14, 2017
4fe7c72
validate train params; update based on PR comments
Sep 15, 2017
e2b9297
remove unused imports
Sep 15, 2017
1a5401a
fix typo in test
Sep 15, 2017
0288b33
loose type check
Sep 15, 2017
6c2a537
support one-hot encoding in Shifu
huzza Sep 18, 2017
d84f5bb
make sure column number is in featureSet
huzza Sep 18, 2017
3590da5
Merge pull request #432 from junshiguo/juguo-develop
zhangpengshan Sep 19, 2017
6fd73ee
up
pengshanzhang Sep 19, 2017
a6e6572
Merge branch 'develop' of https://github.com/ShifuML/shifu into develop
pengshanzhang Sep 19, 2017
63d4852
add new cross-entropy loss function (log loss)
pengshanzhang Sep 22, 2017
eca64e1
use new isGoodCandidate function to support candidate variabels
huzza Sep 25, 2017
9564ca1
merge code
huzza Sep 25, 2017
30f5afc
Merge pull request #444 from huzza/develop
huzza Sep 26, 2017
dbc97d1
add auto-detect if shuffle or not which is helped to determine a bett…
pengshanzhang Sep 27, 2017
12e3f86
Merge branch 'develop' of https://github.com/ShifuML/shifu into develop
pengshanzhang Sep 27, 2017
3c4a3ba
AdaDelta
Oct 16, 2017
5d81ed2
AdaGrad
Oct 16, 2017
d97d474
RMSProp & refine interface/abstract class
Oct 17, 2017
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,5 @@ target
.classpath
.project
.settings
.DS_Store
test-output
243 changes: 239 additions & 4 deletions CHANGES.txt
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
/**
* Copyright [2012-2014] eBay Software Foundation
* Copyright [2012-2014] PayPal Software Foundation
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
Expand All @@ -16,6 +16,242 @@

Shifu Change Log

Changes for Shifu-0.10.5
* Optimize IndependetTreeModel by decreaseing model memory to 70% and CPU time to 90%
* Upgrade guagua to 0.7.0 t fix a bug on empty gzip files in one worker

Changes for Shifu-0.10.4
* Optimize IndependetTreeModel by split regression and classification;
* Add new version of fast correlation computing.

Changes for Shifu-0.10.3
* Fix GBT SLA Categorical Feature Rebin Delimiter Issue: change delimiter to '@^'

Changes for Shifu-0.10.2
* Fix GBT SLA issue: pre-parse double types for only once.

Changes for Shifu-0.10.1
* Fix one big bug on 'baggingWithReplacement':
https://github.com/ShifuML/shifu/issues/335

Changes for Shifu-0.10.0
* Tree Ensemble Model Improvement
a) Speed GBT Training
https://github.com/ShifuML/shifu/issues/252
b) Auto Skip Features with only One Bin
https://github.com/ShifuML/shifu/issues/276
c) Cover GBT Regression Score To Probability
https://github.com/ShifuML/shifu/issues/254
d) Add Early Stop Feature for GBT
https://github.com/ShifuML/shifu/issues/230
e) By Default Disable Tmp Model Output in NN and GBT, RF
https://github.com/ShifuML/shifu/issues/231
f) GBT & RF PMML Support
https://github.com/ShifuML/shifu/issues/232
g) Grid Search: Compute validation error on latest 10 or 20 iterations
https://github.com/ShifuML/shifu/issues/233
h) Missing Value Processing in Tree Model
https://github.com/ShifuML/shifu/issues/239
i) Make Tree Model Without Dependency
https://github.com/ShifuML/shifu/issues/253
j) Compress Tree Model by Gzip to Save Size
https://github.com/ShifuML/shifu/issues/272
* Train Step Improvement
a) Sampling Logic Change in Training
https://github.com/ShifuML/shifu/issues/310
b) Add Stratified Sampling in Training Step
https://github.com/ShifuML/shifu/issues/311
c) Add Cross Validation in Train Step
https://github.com/ShifuML/shifu/issues/312
d) Guagua Job Failed Improvement
https://github.com/ShifuML/shifu/issues/237
e) Disable Tmp Model Output in NN and GBT, RF
https://github.com/ShifuML/shifu/issues/231
f) Support Redo Training Without Weight after Weighted Norm
https://github.com/ShifuML/shifu/issues/315
* VarSel Step Improvement
a) Refine VarSel Configurations
https://github.com/ShifuML/shifu/issues/262
b) Enable Multiple Threading in Sensitivity Analysis
https://github.com/ShifuML/shifu/issues/213
c) Add Feature Importance for Tree Model VarSelect
https://github.com/ShifuML/shifu/issues/218
* Stats Step Improvement
a) Add More Stats in Stats Step
https://github.com/ShifuML/shifu/issues/313
b) Change Distinct Count Computing from Init to Stats
https://github.com/ShifuML/shifu/issues/314
c) Bugs & Others
1) Default meta/categorical file support
2) Bug in stats on bad feature type
3) Add more stats on each MR job like number of filter records.
* Others
a) Eval Step Improvement
https://github.com/ShifuML/shifu/issues/150
b) CSV Format File Support
https://github.com/ShifuML/shifu/issues/258
c) Combo Model Training (Beta)
https://github.com/ShifuML/shifu/issues/316

Changes for Shifu-0.9.0
* Random Forest Enhancement
a) RF & GBDT Sort Categorical Features
https://github.com/ShifuML/shifu/issues/203
b) RF & GBDT Categorical Variables Unsorted Supported
https://github.com/ShifuML/shifu/issues/202
* Gradient Boosted Trees Enhancement
a) Master Fail Over
https://github.com/ShifuML/shifu/issues/227
b) GBT Support Continuous Model Training
https://github.com/ShifuML/shifu/issues/222
c) RF & GBDT Sort Categorical Features
https://github.com/ShifuML/shifu/issues/203
d) RF & GBDT Categorical Variables Unsorted Supported
https://github.com/ShifuML/shifu/issues/202
* Grid Search Support
a) NN Grid Search
https://github.com/ShifuML/shifu/issues/214
b) RF & GBDT Grid Search
https://github.com/ShifuML/shifu/issues/213
* Random Search Support
https://github.com/ShifuML/shifu/issues/234
* Multiple Classfication Enhancement
a) Add Random Forest Multiple Classfication
https://github.com/ShifuML/shifu/issues/235
b) Add OneVSAll Multiple Classfication for NN, RF and GBDT
https://github.com/ShifuML/shifu/issues/209
* Dynamic Binning Support
https://github.com/ShifuML/shifu/issues/236
* Others
a) https://github.com/ShifuML/shifu/issues/195
b) https://github.com/ShifuML/shifu/issues/229

Changes for Shifu-0.2.8
* Random Forest Support
a) https://github.com/ShifuML/shifu/issues/123
b) https://github.com/ShifuML/shifu/issues/122
* Gradient Boosted Trees Support
a) https://github.com/ShifuML/shifu/issues/124
b) https://github.com/ShifuML/shifu/issues/122
* Feature Importance in 'posttrain' Step
https://github.com/ShifuML/shifu/issues/180
* PSI Feature in 'stats' Step
https://github.com/ShifuML/shifu/issues/196
* Correlation Between Features in 'norm' Step
https://github.com/ShifuML/shifu/issues/146
* Others
a) https://github.com/ShifuML/shifu/issues/190
b) https://github.com/ShifuML/shifu/issues/181
c) https://github.com/ShifuML/shifu/issues/179
d) https://github.com/ShifuML/shifu/issues/178

Changes for Shifu-0.2.7
* Sampling Function Improvement
a) https://github.com/ShifuML/shifu/issues/93
b) https://github.com/ShifuML/shifu/issues/140
* Binning Improvement
a) https://github.com/ShifuML/shifu/issues/148
b) https://github.com/ShifuML/shifu/issues/157
* Stats Step Improvement
a) https://github.com/ShifuML/shifu/issues/155
b) https://github.com/ShifuML/shifu/issues/137
c) https://github.com/ShifuML/shifu/issues/75
* Norm Step Improvement
a) https://github.com/ShifuML/shifu/issues/103
b) https://github.com/ShifuML/shifu/issues/120
c) https://github.com/ShifuML/shifu/issues/131
d) https://github.com/ShifuML/shifu/issues/142
* Train Step Improvement
a) https://github.com/ShifuML/shifu/issues/66
b) https://github.com/ShifuML/shifu/issues/159
c) https://github.com/ShifuML/shifu/issues/166
d) https://github.com/ShifuML/shifu/issues/106
* Variable Selection Step Improvement
a) https://github.com/ShifuML/shifu/issues/57
b) https://github.com/ShifuML/shifu/issues/102
* Distributed LR Algorithm Improvement (Experimental)
a) https://github.com/ShifuML/shifu/issues/56
* Multiple classes NN Algorithm Improvement (Experimental)
a) https://github.com/ShifuML/shifu/issues/149
* Pig on Tez Support

Changes for Shifu-0.2.6
* https://github.com/ShifuML/shifu/issues/133: Add skewness and kurtosis stats
* https://github.com/ShifuML/shifu/issues/134: Add CSV ColumnConfig Format for ColumnConfig.json
* https://github.com/ShifuML/shifu/issues/117: Add AUC Computation on Eval Step
* https://github.com/ShifuML/shifu/issues/118: Add Shortcut Commands: 'norm', 'varsel'
* https://github.com/ShifuML/shifu/issues/127: Support HDP 2.6.0.2.2.4.2-2
* https://github.com/ShifuML/shifu/issues/83: Add Distinct Count Statistics
* https://github.com/ShifuML/shifu/issues/82: Auto-detect Variable Type

Changes for Shifu-0.2.5
* https://github.com/ShifuML/shifu/issues/97: Upgrade Guagua to latest version 0.7.0.
a) New features included in Guagua 0.6.0 to continuous improve performance of Shifu:
1) 'out-of-core' list to support worker to scale out from memory to disk.
2) Netty-based coordinators to decrease dependency on zookeeper and improve iteration communication performance.
3) Embedded zookeeper server supported not only in client as a thread, but also in master node as a process.
b) One improtant feature included in Guagua 0.7.0 to accelerate training in Shifu:
1) Partial-compete feature means in each iteration master only wait for partial workers complete and to
ignore straggler worker result.
* https://github.com/ShifuML/shifu/issues/105: SPDT stats performance improvement.
a) 'binningAlgorithm=SPDTI' (default value) in ModelConfig.json#stats is to improve scalability for big data.
This solution is based on SPDT binning algorithm and called SPDT-Improvement(SPDTI).
b) Using SPDTI, with 20 million of records and 1600 variables, 20 minutes to finish stats. With 100 million of
records and 1600 variables, 30 minutes to finish stats.
* https://github.com/ShifuML/shifu/issues/59: Shifu eval confusion and performance improvement.
a) With 20 million of records and 1600 variables, 13 minutes to finish eval step compared with 20 minutes in
Shifu 0.2.4.
* https://github.com/ShifuML/shifu/issues/64: Set the Hadoop parallel number automatically.
a) With input data set increase, user no need to set 'hadoopParallelNumber' in shifuconfig.
b) This value is tuned automatically new Shifu.
* Binning improvement
a) https://github.com/ShifuML/shifu/issues/77: Add missing value count as a bin.
b) https://github.com/ShifuML/shifu/issues/79: Add weights to binning.
c) https://github.com/ShifuML/shifu/issues/80: Weights binning KS/IV/WoE computing.
* https://github.com/ShifuML/shifu/issues/72: Support WoE transformation when doing normalization
* Training step improvement
a) https://github.com/ShifuML/shifu/issues/95: NN doesn't support 0 hidden layer.
b) https://github.com/ShifuML/shifu/issues/76: Add convergence parameter to Shifu d-train.
c) https://github.com/ShifuML/shifu/issues/84: Add local disk support to scale in-memory data set.
d) https://github.com/ShifuML/shifu/issues/60: Continuous model training.
e) https://github.com/ShifuML/shifu/issues/85: Add 'epochsPerIteration' parameter in NNWorker.
* Bug fix:
a) https://github.com/ShifuML/shifu/issues/98
b) https://github.com/ShifuML/shifu/issues/92
c) https://github.com/ShifuML/shifu/issues/70
d) https://github.com/ShifuML/shifu/issues/69
e) https://github.com/ShifuML/shifu/issues/67

Changes for Shifu-0.2.4
* https://github.com/ShifuML/shifu/issues/20: Work flow change.
a) Old: new -> init -> stats -> varselect -> normalize -> train -> eval
b) New: new -> init -> stats -> normalize -> varselect -> train -> eval
c) If do variable selection again after a model, current work flow no need do normalize step, after variable selection then do training step.
* https://github.com/ShifuML/shifu/issues/49: Add distributed sensitivity analysis variable selection.
a) 'varSelect.wrapperEnabled=true' and 'wrapperBy=SE' in ModelConfig.json#varSelect part to enable sensitivity variable selection.
b) 'wrapperRatio' in ModelConfig.json#varSelect part is a percent to set how many variables will be removed.
c) To continue variable selection by sensitivity method, run 'shifu varselect' again.
d) With 20 million of records and 1600 variables, 70 minutes (45 minutes for 200 epoch training and 25 minutes for sensitivity variable selection).
* https://github.com/ShifuML/shifu/issues/38: Improve scalability in stats step.
a) 'binningAlgorithm=SPDT' (default value) in ModelConfig.json#stats is to do variable statistics to improve scalability for big data.
Using SPDT, with 20 million of records and 1600 variables, 50 minutes to finish variable selection.
b) 'binningAlgorithm=MunroPat' in ModelConfig.json#stats is another approach to do variable statistics to improve scalability for big data.
* https://github.com/ShifuML/shifu/issues/58: Improve scalability in eval step for HDFS mode.
a) With 20 million of records and 1600 variables, 20 minutes to finish eval step with only 1GB driver memory.
* https://github.com/ShifuML/shifu/issues/61: Embeded zookeeper server support.
a) No need to set zookeeper servers so far since embeded zookeeper server will help on training models.
b) Big data training, independent zookeeper cluster is strongly recommended.
c) Upgrade Guagua to 0.5.0 to get support from Guagua for this feature.
* Add PMML standard model converter.
a) To convert .nn files into pmml, run "shifu export -t pmml" or just "shifu export" (The pmml is default)
All generated pmml files will be under <Model-Directory>/pmmls/
* Bug fix:
a) https://github.com/ShifuML/shifu/issues/45
b) https://github.com/ShifuML/shifu/issues/51
c) https://github.com/ShifuML/shifu/issues/39
d) https://github.com/ShifuML/shifu/issues/40
e) https://github.com/ShifuML/shifu/issues/45

Changes for Shifu-0.2.0
* Make Shifu to support Hadoop-2.0 (add -Phdp-yarn when building)
* Show mapper progress in JobTracker and show progress in CLI when using distribute training
Expand Down Expand Up @@ -54,7 +290,6 @@ Changes for Shifu-0.0.4
* TA457512 - Fix the bug: the delimiter of evaluation data doesn't take effect in AKKA mode
* TA458788 - Fix the bug: Meta validation fails to report error when - "NumHiddenNodes" : [ "a", 45 ]
* TA459375 - Write in-place QuickSort to replace Collections.sort() for memory consumption


Changes for Shifu-0.0.3
* TA446629 - Fix the bug: when there is am empty file, shifu in akka mode will be stucked
Expand All @@ -79,8 +314,8 @@ Changes for Shifu-0.0.2
* DE29230 - Fix the bugs if the training data path is HDFS globe path
* DE29231 - User only need put the configuration in local file system
* US201443 - PathFinder refactor, split Manager class into several processes
* US207747 - Add option in ModelConfig for job queue name
* US177973 - Update code license and test data license
* US207747 - Add option in ModelConfig for job queue name
* US177973 - Update code license and test data license
* Don't copy data and purify data when run `shifu init`
* Add more comments

Expand Down
52 changes: 40 additions & 12 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,30 +1,58 @@
[<img src="images/logo/shifu.png" alt="Shifu" align="left">](http://shifu.ml)<div align="right"><div>[![Build Status](https://travis-ci.org/ShifuML/shifu.svg)](https://travis-ci.org/ShifuML/shifu)</div><div>[![Maven Central](https://maven-badges.herokuapp.com/maven-central/ml.shifu/shifu/badge.svg)](https://maven-badges.herokuapp.com/maven-central/ml.shifu/shifu)</div></div>

#

## Getting Started
## Download
Please [download](https://github.com/ShifuML/shifu/wiki/shifu-0.10.5-hdp-yarn.tar.gz) latest shifu [here](https://github.com/ShifuML/shifu/wiki/shifu-0.10.5-hdp-yarn.tar.gz).

Please visit [shifu.ml](http://shifu.ml) for download infomation, installation instructions, and tutorials.
## Getting Started
After shifu downloading, build your first model with Shifu [tutorial](https://github.com/ShifuML/shifu/wiki/Tutorial---Build-Your-First-ML-Model). More details about shifu can be found in our [wiki pages](https://github.com/ShifuML/shifu/wiki).

## What is Shifu?
Shifu is an open-source, end-to-end machine learning and data mining framework built on top of Hadoop. Shifu is designed for data scientists, simplifying the life-cycle of building machine learning models. While originally built for fraud modeling, Shifu is generalized for many other modeling domains.

One of Shifu's pros is an end-to-end modeling pipeline in machine learning. With only configurations settings, a whole machine pipeline can be built and model can be much more easy to develop and push to production. The pipeline defined in Shifu is in below:

![Shifu Pipeline](https://raw.githubusercontent.com/wiki/ShifuML/shifu/images/new-shifu-pipeline.png)

Shifu provides a simple command-line interface for each step of the model building process, including

* Statistic calculation & variable selection to determine the most predictive variables in your data
* Variable normalization
* Distributed neural network model training
* [Variable normalization](https://github.com/ShifuML/shifu/wiki/Variable%20Transform%20in%20Shifu)
* [Distributed variable selection based on sensitivity analysis](https://github.com/ShifuML/shifu/wiki/Variable%20Selection%20in%20Shifu)
* [Distributed neural network model training](https://github.com/ShifuML/shifu/wiki/Distributed%20Neural%20Network%20Training%20in%20Shifu)
* [Distributed tree ensemble model training](https://github.com/ShifuML/shifu/wiki/Distributed%20Tree%20Ensemble%20Model%20Training%20in%20Shifu)
* Post training analysis & model evaluation

Shifu’s fast Hadoop-based, distributed neural network training can reduce model training time from days to hours on 500GB data sets. Shifu integrates with Pig workflows on Hadoop, and Shifu-trained models can be integrated into production code with a simple Java API. Shifu leverages Pig, Akka, Encog and other open source projects.
Shifu’s fast Hadoop-based, distributed neural network / logistic regression / gradient boosted trees training can reduce model training time from days to hours on TB data sets. Shifu integrates with Pig workflows on Hadoop, and Shifu-trained models can be integrated into production code with a simple Java API. Shifu leverages Pig, Akka, Encog and other open source projects.

[Guagua](https://github.com/ShifuML/guagua), an in-memory iterative computing framework on Hadoop YARN is developed as sub-project of Shifu to accelerate training progress.

More details about shifu can be found in our [wiki pages](https://github.com/ShifuML/shifu/wiki)

## Conference

* [QCON Shanghai 2015](http://2015.qconshanghai.com/presentation/2827) [Slides](http://www.slideshare.net/pengshanzhang/large-scale-machine-learning-at-pay-pal-risk)

* [BDTC Beijing 2016](http://bdtc2016.hadooper.cn/dct/page/70107)

* [Strata Beijing 2017](https://strata.oreilly.com.cn/strata-cn/public/schedule/detail/59593?locale=en)

## Contributors

- Zhanghao Hu
- Grahame Jastrebski
- Lavar Li
- Mark Liu
- David Zhang
- Xin Zhong
- Zhanghao Hu (zhanhu@paypal.com)
- Grahame Jastrebski (gjastrebski@paypal.com)
- Lavar Li (lulli@paypal.com)
- Mark Liu (yliu15@paypal.com)
- David Zhang (pengzhang@paypal.com)
- Xin Zhong (xinzhong@paypal.com)
- Simon Zhang (jzhang13@paypal.com)
- Sharma Nitin (nsharma1@paypal.com)

## Google Group

Please join [Shifu group](https://groups.google.com/forum/#!forum/shifuml) if questions, bugs or anything else.

## Copyright and License

Copyright 2012-2014, eBay Software Foundation under [the Apache License](LICENSE.txt).
Copyright 2012-2017, PayPal Software Foundation under the [Apache License](LICENSE.txt).
Loading