Skip to content

Divergence from GLMnet when using a matrix with many variables #78

@tiemvanderdeure

Description

@tiemvanderdeure

When fitting models with a large number of variables, Lasso.jl and GLMnet return different paths, and the difference grows bigger as the number of variables is bigger.

An example to illustrate this:

using Lasso, GLMNet, Statistics

# fits identical models in Lasso and GLMNet from mock data
# and returns the mean absolute difference of the betas of both models
function lasso_glmnet_dif(nrow, ncol, n_col_contributing)
    data = rand(nrow, ncol)
    outcome = mean(data[:, 1:n_col_contributing], dims = 1)[:,1] .> rand(nrow)
    presence_matrix = [1 .- outcome outcome]

    l = Lasso.fit(LassoPath, data, outcome, Binomial())
    g = GLMNet.glmnet(data, presence_matrix, Binomial())

    lcoefs = Vector(l.coefs[:,end])
    gcoefs = g.betas[:, end]

    mean(abs, lcoefs .- gcoefs)
end

# 1000 records, 5 variables that all contribute to outcome
lasso_glmnet_dif(1000, 5, 5) # order of magnitude 1e-9
# 1000 records, 100 variables of which 5 contribute to the outcome
lasso_glmnet_dif(1000, 1000, 5) # around 0.05

The context for this problem is that I'm working on a julia implementation of maxnet, where a big-ish model matrix is generated (100s of columns) and a lasso path is used to select the most important ones.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions