-
Notifications
You must be signed in to change notification settings - Fork 32
Open
Description
When fitting models with a large number of variables, Lasso.jl and GLMnet return different paths, and the difference grows bigger as the number of variables is bigger.
An example to illustrate this:
using Lasso, GLMNet, Statistics
# fits identical models in Lasso and GLMNet from mock data
# and returns the mean absolute difference of the betas of both models
function lasso_glmnet_dif(nrow, ncol, n_col_contributing)
data = rand(nrow, ncol)
outcome = mean(data[:, 1:n_col_contributing], dims = 1)[:,1] .> rand(nrow)
presence_matrix = [1 .- outcome outcome]
l = Lasso.fit(LassoPath, data, outcome, Binomial())
g = GLMNet.glmnet(data, presence_matrix, Binomial())
lcoefs = Vector(l.coefs[:,end])
gcoefs = g.betas[:, end]
mean(abs, lcoefs .- gcoefs)
end
# 1000 records, 5 variables that all contribute to outcome
lasso_glmnet_dif(1000, 5, 5) # order of magnitude 1e-9
# 1000 records, 100 variables of which 5 contribute to the outcome
lasso_glmnet_dif(1000, 1000, 5) # around 0.05
The context for this problem is that I'm working on a julia implementation of maxnet, where a big-ish model matrix is generated (100s of columns) and a lasso path is used to select the most important ones.
Metadata
Metadata
Assignees
Labels
No labels