First I will list a couple of websites.

Then I will list a couple of papers.  In my comments for the papers
are some suggestions about how to use lasso in practice, using
library(glmnet) in R. The same R practices apply to ridge regression
and elasticnet.

You can get a quick idea of how and why the lasso works from the
following two websites:

https://stats.stackexchange.com/questions/74542/why-does-the-lasso-provide-variable-selection

    This gives a nice "calculus" explanation of why lasso forces some
    coefficients to be zero if lambda is large enough, using simple
    regression (y = b0 + b1 x + epsilon); for the explanation when
    p>1, the Tibshirani paper below is useful.

https://newonlinecourses.science.psu.edu/stat508/lesson/5/5.4

    This page gives a nice overview of lasso, and explains a little
    about how the geometry of L1 for the lasso (vs L2 for ridge
    regression) forces some coefficients to be zero.  It is more
    intuitive than the prev website, but also more in line with the
    mathematics (Tibshirani paper below) that is actually needed when p>1.

    The webpage also gives some hints about how one could construct
    standard errors for beta-hat's estimated by lasso (Park &
    Casella's, 2008, "bayesian lasso" seems best to me, although a
    complete recipe is not given), and an extension called "group
    lasso" (Yuan & Lin, 2007) which could work with categorical variables.

You can learn much more by googling

* why does the lasso work
* how does the lasso work

The basic papers/books on lasso are here:

Tibshirani, R. (1996). Regression shrinkage and selection via the
lasso. Journal of the Royal Statistical Society, Series B, Vol 58, 1,
267-–288.

    The "orignal paper".

    A basic idea, which I did not go into in class, is that
    predictions using beta-hats from the lasso can have lower MSE for
    predicting new observations, than using least-squares beta-hats.

https://web.stanford.edu/~hastie/glmnet/glmnet_alpha.html

    A quick tutorial on the glmnet package.

    If you are more interested in variable selection per se (because
    you want a few good models to discuss with a client or
    collaborator in order to choose the "scientifically best one"),
    then the approach in the class notes is better, i.e. just use a
    few different values of lambda to select a few different models to
    discuss with your collaborator.  Using library(glmnet) you might
    do something like this:

    lasso.fits <- glmnet(Xmatrix,Yvector,alpha=1)    # alpha=1 for lasso
    plot(lasso.fits,xvar="lambda")  # to get a visualization
    abline(h=0,lty=2)               # helpful guideline
    coef(lasso.fits)      # coefficients for "all" values of lambda
    coef(lasso.fits,s=30) # coefficients for lambda=s (30, in this case)
    predict(lasso.fits,s=30,newx=Xmatrix.new)
                          # predictions based on beta-hats at lambda=s
    
    If you are more interested in prediction, then using the
    beta-hats from lasso is definitely better.  In that case you want
    the "single best lambda", which you can choose with
    cross-validation.  Using library(glmnet) you might do something
    like this:

    cv.choice < cv.glmnet(Xmatrix,Yvector,alpha=1)  # alpha=1 for lasso
    plot(cv.choice)            # to get a visualization
    cv.choice$lambda.min       # to get the lambda that minimized cv mse
    cv.choice$lambda.1se       # to get the lambda 1 se above lambda.min
                               # (this choice may guard against overfitting)
    coef(cv.choice,s=30)       # to get beta-hats at lambda=s (s=30 here)
    predict(cv.choice,newx=Xmatrix.new) # to make predictions using
                                        # the best lasso beta-hats
    predict(cv.choice,s=30,newx=Xmatrix.new) # to make predictions using
                                             # beta-hats at lambda=s

Zou, H. and Hastie, T. (2005). Regularization and variable selection
via the elastic net. Journal of the Royal Statistical Society, Series
B, 67, 2, 301-–320.

    Introduces "elasticnet" which combines lasso and ridge penalties
    (for glmet, alpha between 0 and 1).  For problems with high
    colinearity among the X's (and even when p>n) elasticnet can
    priduce lower prediction mse than lasso.

Friedman, J., Tibshirani, R., and Hastie, T. (2010). Regularization
paths for generalized linear models via coordinate descent. Journal of
Statistical Software, 33, 1, https://www.jstatsoft.org/v33/i01

    lasso and related ideas for generalized linear models (logistic
    regression, poisson regression, etc.).  Basically all about
    the guts and uses of library(glmnet) 

Hastie, T., Tibshirani, R., and Friedman, J. (2008). The Elements of
Statistical Learning, 2nd edition. Springer, New York.

    great book.  you can get a free pdf from one of the authors'
    websites (I think it's Hastie, but I forget...)

--------------------------------------------------------------