Skip to content

All project code for the LOCO path high dimensional inference, including glm (logistic + poisson) and gaussian graphical models

Notifications You must be signed in to change notification settings

devcao/LOCOpath_repo

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

LOCOpath project code repo

This github repo contains all project code for the LOCO(leave-one-covariate-out) path high dimensional inference. Apart from the code listed here, you may also need to install the R package LOCOpath.

What is LOCOpath project

We provide a variable importance/statisical inference framework for LASSO/group LASSO/fused LASSO/glmnet/graphical LASSO regularied linear/generalized linear/gaussian graphical models in lower(p<n)/high dimensional settings(p>=n).

Install the R package

To install, please first install R package devtools and then

devtools::install_github("devcao/LOCOpath")

Road map for the core R code.

NetTS.R: LOCO path statistic calculations for linear/logistic/Poisson regression, backend R package glmnet.

NetResampleTS.R: Bootstrapping and power simulation code based on LOCO path statistic for linear regression, backend R package glmnet.

NetResampleLogisticTS.R: Bootstrapping and power simulation code based on LOCO path statistic for logistic/Poisson regression, backend R package glmnet, gglasso and Logistic_Enet.R .

Logistic_Enet.R: Modified coordinate descent for logistic regression, enabling constraint like beta_1=1 while looping the coordinate descent, backend C++ code lognet.cpp .

graphLASSO.R: All the graphical LASSO code, including our wrapper of glasso package, our own constraint graphical LASSO code and variable screening code, backend R package glasso.

compare_power.R: All the power simulation codes for other method we need to compare, including desparsified LASSO, T/F/Wald test.

Power simulation functions and examples

In the data generating part, we use rho to specify the correlation structure

 #  rho: can be 'equl': compound symmetry with correlation = 0.8
 #              'weak_equl': compound symmetry with correlation = 0.5
 #               positive value:  toeplitz matrix with correlation = rho, the specified value
 #               0: independent case

High-dimensional linear regression

Testing beta_j = 0 or non zero value

Use lars as backend

# simulating power for n = 100, p = 1000, rho = 0 (independent case)
# and testing beta_1 = 0

require(LOCOpath)
n = 100; p = 1000; rho = 0; iter = 500; B = 500;
beta=c(0,rep(1,9),rep(0,p-10);
Path.Resample.Power(n = n, p = p, beta = beta, rho = rho, iter = iter, B = B, setting = 'dep', 
                    which.covariate = 1, betaNull = 0, multiTest = FALSE,  # this line enables testing beta_{which.covariate} = betaNull
                    norm = 'L2.squared', beta.init = 'adaptive',  # this line uses L2 norm and adaptive LASSO as initial estimator
                    parallel = TRUE)  # we set parallel = TRUE, this will enable parallel computing on Mac/Linux machine. May not work on Windows machine.

Use glmnet as backend

# simulating power for n = 100, p = 1000, rho = 0 (independent case)
# and testing beta_1 = 0

require(LOCOpath)
source('NetTS.R')
source('NetResampleTS.R')
n = 100; p = 1000; rho = 0; iter = 500; B = 500; beta=c(0,rep(1,9),rep(0,p-10);
Net.Resample.Power(n = n, p = p, beta = beta, rho=rho, iter = iter, B = B, setting = 'dep', 
                    which.covariate = 1, betaNull = 0, multiTest = FALSE,  # this line enables testing beta_{which.covariate} = betaNull
                    norm = 'L2.squared', beta.init = 'adaptive',  # this line uses L2 norm and adaptive LASSO as initial estimator
                    parallel = TRUE)  # we set parallel = TRUE, this will enable parallel computing on Mac/Linux machine. May not work on Windows machine.

Simultaneous Testing beta_i = 0 and beta_j = 0 and beta_k = 0 or non zero value

# simulating power for n = 100, p = 1000, rho = 0 (independent case)
# and testing beta_1 = 1, beta_10 = 0 and beta_11 = 0 simultaneously

require(LOCOpath)
n = 100; p = 1000; rho = 0; iter = 500; B = 500;
beta = c(1,rep(1,9),rep(0,p-10))
Path.Resample.Power(n = n, p = p, beta = beta, rho=rho, iter = iter, B = B, setting = 'dep',
                              which.covariate = list(c(1,10,11)), betaNull = list(c(1,0,0)), multiTest = TRUE, 
                              # The code above enables testing beta_1 = 1, beta_10 = 0 and beta_11 = 0 simultaneously
                              # multiTest must be set TRUE, which.covariate and betaNull need to have a list of vector as input
                              parallel = TRUE,
                              norm = 'L2.squared', path.method ='lars', beta.init = 'adaptive')

Testing more general hypothesis like D\beta = d

# simulating power for n = 100, p = 1000, rho = 0 (independent case)
# and testing beta_1 = beta_2

require(LOCOpath)
source('NetTS.R')
source('NetResampleTS.R')

n = 100; p = 1000; rho = 0; iter = 500; B = 500;
beta = c(1,rep(1,10),rep(0,p-11))

Path.Resample.Equality.Power(n = n, p = p, beta = beta, rho=rho, iter = iter, B = B, setting = 'dep', 
                             betaNull = 0, # this line specify we are testing beta_1 - beta_2 = 0, can be non-zero value here
                             parallel = TRUE, 
                             norm = 'L2.squared', path.method ='lars', beta.init = 'adaptive')

High-dimensional Logistic/Poisson regression

Testing beta_j = 0

# simulating power for n = 100, p = 1000, rho = 0 (independent case)
# and testing beta_1 = 0 and beta_10 = 0 and beta_11 = 0 simultaneously

require(LOCOpath)
source('NetTS.R')
source('NetResampleTS.R')
source('NetResampleLogisticTS.R')

n = 100; p = 1000; rho = 0; iter = 500; B = 500;
beta = c(0,rep(1,9),rep(0,p-10))

# for Logistic regression
Net.Resample.Logistic.Power(n = n, p = p, beta = beta, intercept = 0.5, rho = rho, iter = iter, B = B, setting = 'dep', 
                            which.covariate = 1, betaNull = 0, multiTest = FALSE, 
                            parallel = TRUE, norm = 'L2.squared', beta.init = 'adaptive')

# for Poisson regression
beta = c(0,rep(1,2),rep(0,p-3))
Net.Resample.Poisson.Power(n = n, p = p, beta = beta, intercept = 2, rho=rho, iter = iter, B = B, setting = 'dep', 
                           which.covariate = 1, betaNull = 0, multiTest = FALSE, 
                           parallel = TRUE, norm = 'L2.squared', beta.init = 'adaptive')

Simultaneous Testing beta_i = 0 and beta_j = 0 and beta_k = 0

# simulating power for n = 100, p = 1000, rho = 0 (independent case)
# testing beta_1 = 0

require(LOCOpath)
source('NetTS.R')
source('NetResampleTS.R')
source('NetResampleLogisticTS.R')

n = 100; p = 1000; rho = 0; iter = 500; B = 500;
beta = c(0,rep(1,9),rep(0,p-10))

# for Logistic regression 
# testing beta_1 = 0, beta_11 = 0 and beta_12 = 0 simultaneously
Net.Resample.Logistic.Power(n = n, p = p, beta = beta, rho=rho, intercept = 0.5, iter = iter, B = B, setting = 'dep',
                              which.covariate = list(c(1,11,12)), betaNull = list(c(0,0,0)), multiTest = TRUE, 
                              parallel = TRUE, norm = 'L2.squared', beta.init = 'adaptive')

# for Poisson regression
testing beta_1 = 0, beta_4 = 0 and beta_5 = 0 simultaneously
beta = c(0,rep(1,2),rep(0,p-3))
Net.Resample.Poisson.Power(n = n, p = p, beta = beta, rho=rho, intercept = 2, iter = iter, B = B, setting = 'dep',
                              which.covariate = list(c(1,4,5)), betaNull = list(c(0,0,0)), multiTest = TRUE, 
                              parallel = TRUE, norm = 'L2.squared', beta.init = 'adaptive')

Testing beta_j = non_zero_value

# simulating power for n = 100, p = 80, rho = 0 (independent case) 
# p = 1000 in this case will cost too much running time
# and testing beta_1 = 1

require(LOCOpath)
source('NetTS.R')
source('NetResampleTS.R')
source('NetResampleLogisticTS.R')
source('Logistic_Enet.R')

n = 100; p = 80; rho = 0; iter = 500; B = 500;
beta = c(1,rep(3,2),rep(0,p-3))

# for Logistic regression
Net.Resample.Logistic.Con.Power(n = n, p = p, beta = beta, rho=rho, iter = iter, B = B, 
                                which.covariate = 1, betaNull = 1, 
                                parallel = TRUE, norm = 'L1', beta.init = 'adaptive')

Variable screening

source("screen_simu.R")
require(mvnfast)
require(LOCOpath)
n = 20; p = 100; 

# LOCO path variable screening 
# logistic_screen_simu returns a list of value L1 and L2. L1: screen rate for L1 norm, L2: screen rate for L2 norm
scrn_rslt = list()
for (beta_1 in 1:5){

  scrn_rslt$rho00 = logistic_screen_simu(n = n, p = p, signal = beta_1, rho = 0, iter = iter)
  scrn_rslt$rho01 = logistic_screen_simu(n = n, p = p, signal = beta_1, rho = 0.1, iter = iter)
  scrn_rslt$rho05 = logistic_screen_simu(n = n, p = p, signal = beta_1, rho = 0.5, iter = iter)
  scrn_rslt$rho09 = logistic_screen_simu(n = n, p = p, signal = beta_1, rho = 0.9, iter = iter)
}

# SIS/Iterated SIS variable screening
# logistic_sis_simu returns a list of value sis and isis. sis: screen rate for sis, isis: screen rate for isis
scrn_rslt = list()
for (beta_1 in 1:5){

  scrn_rslt$rho00 = logistic_sis_simu(n = n, p = p, signal = beta_1, rho = 0, iter = iter)
  scrn_rslt$rho01 = logistic_sis_simu(n = n, p = p, signal = beta_1, rho = 0.1, iter = iter)
  scrn_rslt$rho05 = logistic_sis_simu(n = n, p = p, signal = beta_1, rho = 0.5, iter = iter)
  scrn_rslt$rho09 = logistic_sis_simu(n = n, p = p, signal = beta_1, rho = 0.9, iter = iter)
}

logistic_sis_simu(n = n, p = p, signal = beta_i, rho = rho, iter = iter)

Sparse Gaussian graphical models

Compare the results to glasso

require(glasso)
require(CVglasso)
require(LOCOpath)
source('graphLASSO.R')
source('NetTS.R')

### generate precision matrix
p=20; Theta = diag(rep(1.0,p)); index = 10
for ( i in 1:(p-index)){
  Theta[i, i+index ] = 0.5
  Theta[i+index, i] = Theta[i, i+index ] 
}
### generate data based on precision matrix
Mu=rep(0,p); X=rmvn(n=100, mu = Mu,sigma = Sigma); S=var(X)

### cross validated glasso
a = CVglasso(X=X, S=S)
### our LOCO path statistic
TS_sigma = graph_TS(S = S, n_rho = 50)

### compare the results
par(mfrow=c(1,3))
# we remove the diagonal values
diag(Theta)=0; image_v1(Theta,main='Truth', xaxt='n',yaxt='n',)
diag(a$Omega)=0; image_v1(abs(a$Omega), main = 'glasso', xaxt='n',yaxt='n',)

TS_sd = TS_sigma/sum(TS_sigma, na.rm=TRUE) # normalize
diag(TS_sd) = 0; image_v1(TS_sd, main = 'LOCO path', xaxt='n',yaxt='n')

Results

Variable screening

source('graphLASSO.R')

# for n = 100, p = 50, matrix type = A
n = 100
p = 50
results = simu_graph_screen(n = n, p = p, type = 'A', Iter = 250)
save(results, file = 'type_A_n_100_p_50.RData')

To generate the ROC curve, check out type_AC_n_100.R and type_AC_n_1000.R .

Real data analysis

1st project

Please check the Reproduce the real data analysis section.

2nd project

Please check file logistic_real_data.R.

3rd project

Please check file glasso_real_data.R.

Some code to run on Slurm managed cluster

The cluster code are in directory slurm_cluster_code. To run the power simulation on a slurm managed cluster, check this out. (Not very organized yet)

These 2 files are based on the simulation output from the cluster running jobs.

About

All project code for the LOCO path high dimensional inference, including glm (logistic + poisson) and gaussian graphical models

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published