The linear model and extensions

The linear model is arguably one of the most important ideas in statistics, both in its own right and as a component of other models.

Linear regression

US crime data: "The variables seem to have been rescaled to convenient numbers." (M=percentage of males, So=indicator for southern state, Ed=mean years of schooling, Po1=police expenditure in 1960, LF=labour force participation rate)

library(reshape2)
data(UScrime,package="MASS")
USsub <- subset(UScrime,select=c(y,M,So,Ed,Po1,LF))
mUS <- melt(USsub,id.var="y")
g0 <- ggplot(mUS,aes(x=value,y=y))+
  facet_wrap(~variable,scale="free_x")+
      geom_point()
g0 + geom_smooth(method="lm")

plot of chunk unnamed-chunk-1

(Note limitation on ggplot linear models)

(Note difference between marginal [univariate] and multivariate models)

lm1 <- lm(y~.,data=USsub)
library(coefplot2)
coefplot2(lm1)

plot of chunk lm1

summary(lm1)
## 
## Call:
## lm(formula = y ~ ., data = USsub)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
##   -498   -161     37    125    550 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -2835.10     899.00   -3.15    0.003 ** 
## M               9.29       4.15    2.24    0.031 *  
## So            190.09     123.76    1.54    0.132    
## Ed              5.14       5.57    0.92    0.361    
## Po1            10.87       1.57    6.95    2e-08 ***
## LF              1.64       1.21    1.35    0.184    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 257 on 41 degrees of freedom
## Multiple R-squared:  0.606,  Adjusted R-squared:  0.558 
## F-statistic: 12.6 on 5 and 41 DF,  p-value: 1.88e-07
lm2 <- update(lm1,data=as.data.frame(scale(USsub)))
coefplot2(lm2)

plot of chunk lm2

Polynomial regression:

g0 + geom_smooth(method="lm",formula=y~poly(x,2))

plot of chunk unnamed-chunk-2

lm2P <- update(lm2,.~.^2)
coefplot2(lm2P)

plot of chunk lm2P

ANOVA

Generalized linear models

YF(g1(μ),ϕ)μ=Xβ where F is an exponential family probability distribution (e.g. binomial, Poisson, Gamma) with a known mean-variance relationship; g is a link function (log, logit, probit ...) * logistic regression is (by far) the most common, followed by Poisson regression

These data were scraped from Google Scholar hits on the relevant search terms.

plot of chunk scrapepix

dat <- read.csv("data/dufemalepers.csv")
dat <- transform(dat,
                 tot=du+notdu,
                 duprop=du/(du+notdu))
ggplot(dat,aes(x=density,y=duprop))+geom_point(aes(size=tot),alpha=0.5)+
    geom_smooth(data=subset(dat,duprop>0),
                method="glm",formula=y~I(1/x),aes(weight=tot),
                family=quasibinomial(link="inverse"),
                fullrange=TRUE)

plot of chunk tiwari

Generalized additive models

ggplot(dat,aes(x=density,y=duprop))+geom_point(aes(size=tot),alpha=0.5)+
    geom_smooth(method="gam",
                aes(weight=tot),
                family=quasibinomial,
                fullrange=TRUE)

plot of chunk unnamed-chunk-3

(GAM doesn't actually do very well here - might possibly be able to pin it at zero, but that would be difficult ...)

mquake <- melt(quakes,id.var="mag")
ggplot(mquake,aes(x=value,y=mag))+
    facet_grid(.~variable,scale="free_x")+
  stat_sum(alpha=0.5,aes(size=..n..))+
  geom_smooth(method="gam")

plot of chunk unnamed-chunk-4

ggplot(quakes,aes(x=long,y=lat))+
  geom_point(aes(size=mag,colour=depth),alpha=0.5)

plot of chunk unnamed-chunk-5

Generalized least squares (correlation and heteroscedasticity)

YMVN(μ,Σ)μ=XβΣ=f(θ) Σ is the variance-covariance matrix of the residuals.

Nonlinear least squares

Mixed models

Quantile regression

*

ggplot(dat,aes(x=density,y=duprop))+geom_point(aes(size=tot),alpha=0.5)+
stat_quantile(formula=y~sqrt(x))

plot of chunk unnamed-chunk-6

Penalized regression

Even more: mix and match

hastie1

hastie2

(Hastie, Tibshirani, and Friedman 2009)

library(glmnet)
resp <- as.matrix(subset(UScrime,select=-c(y,So)))
g1 <- glmnet(resp,UScrime$y,alpha=1)
par(las=1,bty="l")
plot(g1,ylim=c(-10,20))

plot of chunk unnamed-chunk-7

(this is a bad example: penalized regression actually doesn't like correlated predictors!)

Hastie, Trevor, Robert Tibshirani, and J. H. Friedman. 2009. The elements of statistical learning data mining, inference, and prediction. New York: Springer. http://public.eblib.com/EBLPublic/PublicView.do?ptiID=437866.

O’Hara, Robert B., and D. Johan Kotze. 2010. “Do not log-transform count data.” Methods in Ecology and Evolution 1 (2) (jun): 118–122. doi:10.1111/j.2041-210X.2010.00021.x. http://onlinelibrary.wiley.com/doi/10.1111/j.2041-210X.2010.00021.x/abstract.

Schielzeth, Holger. 2010. “Simple means to improve the interpretability of regression coefficients.” Methods in Ecology and Evolution 1: 103–113. doi:10.1111/j.2041-210X.2010.00012.x. http://dx.doi.org/10.1111/j.2041-210X.2010.00012.x.

Venables, W. N. 1998. “Exegeses on Linear Models.” In . Washington, DC. http://www.stats.ox.ac.uk/pub/MASS3/Exegeses.pdf.

Warton, David I., and Francis K. C. Hui. 2011. “The arcsine is asinine: the analysis of proportions in ecology.” Ecology 92 (jan): 3–10. doi:10.1890/10-0340.1. http://www.esajournals.org/doi/full/10.1890/10-0340.1.

Wood, Simon N. 2006. Generalized Additive Models: An Introduction with R. Chapman & Hall/CRC.