The linear model is arguably one of the most important ideas in statistics, both in its own right and as a component of other models.
model.matrix()
to create the design matrix by specifying the formula and the datay~x
(or y~1+x
)US crime data: "The variables seem to have been rescaled to convenient numbers." (M=percentage of males, So=indicator for southern state, Ed=mean years of schooling, Po1=police expenditure in 1960, LF=labour force participation rate)
library(reshape2)
data(UScrime,package="MASS")
USsub <- subset(UScrime,select=c(y,M,So,Ed,Po1,LF))
mUS <- melt(USsub,id.var="y")
g0 <- ggplot(mUS,aes(x=value,y=y))+
facet_wrap(~variable,scale="free_x")+
geom_point()
g0 + geom_smooth(method="lm")
(Note limitation on ggplot
linear models)
(Note difference between marginal [univariate] and multivariate models)
lm1 <- lm(y~.,data=USsub)
library(coefplot2)
coefplot2(lm1)
summary(lm1)
##
## Call:
## lm(formula = y ~ ., data = USsub)
##
## Residuals:
## Min 1Q Median 3Q Max
## -498 -161 37 125 550
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -2835.10 899.00 -3.15 0.003 **
## M 9.29 4.15 2.24 0.031 *
## So 190.09 123.76 1.54 0.132
## Ed 5.14 5.57 0.92 0.361
## Po1 10.87 1.57 6.95 2e-08 ***
## LF 1.64 1.21 1.35 0.184
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 257 on 41 degrees of freedom
## Multiple R-squared: 0.606, Adjusted R-squared: 0.558
## F-statistic: 12.6 on 5 and 41 DF, p-value: 1.88e-07
lm2 <- update(lm1,data=as.data.frame(scale(USsub)))
coefplot2(lm2)
y~x+I(x^2)
(or ~poly(x,2)
[orthogonal polynomial] or ~poly(x,2,raw=TRUE)
)poly()
we have a single input variable , but multiple predictor variables (, , ...))g0 + geom_smooth(method="lm",formula=y~poly(x,2))
lm2P <- update(lm2,.~.^2)
coefplot2(lm2P)
y~f
where is an exponential family probability distribution (e.g. binomial, Poisson, Gamma) with a known mean-variance relationship; is a link function (log, logit, probit ...) * logistic regression is (by far) the most common, followed by Poisson regression
These data were scraped from Google Scholar hits on the relevant search terms.
dat <- read.csv("data/dufemalepers.csv")
dat <- transform(dat,
tot=du+notdu,
duprop=du/(du+notdu))
ggplot(dat,aes(x=density,y=duprop))+geom_point(aes(size=tot),alpha=0.5)+
geom_smooth(data=subset(dat,duprop>0),
method="glm",formula=y~I(1/x),aes(weight=tot),
family=quasibinomial(link="inverse"),
fullrange=TRUE)
splines
package: ns
, bs
, periodicSpline
; specify input variable and number of knots (knot placement is done automatically)mgcv
package); use lots of knots, shrink via penalization (generalized cross-validation)ggplot(dat,aes(x=density,y=duprop))+geom_point(aes(size=tot),alpha=0.5)+
geom_smooth(method="gam",
aes(weight=tot),
family=quasibinomial,
fullrange=TRUE)
(GAM doesn't actually do very well here - might possibly be able to pin it at zero, but that would be difficult ...)
mquake <- melt(quakes,id.var="mag")
ggplot(mquake,aes(x=value,y=mag))+
facet_grid(.~variable,scale="free_x")+
stat_sum(alpha=0.5,aes(size=..n..))+
geom_smooth(method="gam")
ggplot(quakes,aes(x=long,y=lat))+
geom_point(aes(size=mag,colour=depth),alpha=0.5)
is the variance-covariance matrix of the residuals.
f()
can be anything, but chosen to be positivecorCAR1
=exponential decay)ramps
package)gls
*
ggplot(dat,aes(x=density,y=duprop))+geom_point(aes(size=tot),alpha=0.5)+
stat_quantile(formula=y~sqrt(x))
(Hastie, Tibshirani, and Friedman 2009)
library(glmnet)
resp <- as.matrix(subset(UScrime,select=-c(y,So)))
g1 <- glmnet(resp,UScrime$y,alpha=1)
par(las=1,bty="l")
plot(g1,ylim=c(-10,20))
(this is a bad example: penalized regression actually doesn't like correlated predictors!)
Hastie, Trevor, Robert Tibshirani, and J. H. Friedman. 2009. The elements of statistical learning data mining, inference, and prediction. New York: Springer. http://public.eblib.com/EBLPublic/PublicView.do?ptiID=437866.
O’Hara, Robert B., and D. Johan Kotze. 2010. “Do not log-transform count data.” Methods in Ecology and Evolution 1 (2) (jun): 118–122. doi:10.1111/j.2041-210X.2010.00021.x. http://onlinelibrary.wiley.com/doi/10.1111/j.2041-210X.2010.00021.x/abstract.
Schielzeth, Holger. 2010. “Simple means to improve the interpretability of regression coefficients.” Methods in Ecology and Evolution 1: 103–113. doi:10.1111/j.2041-210X.2010.00012.x. http://dx.doi.org/10.1111/j.2041-210X.2010.00012.x.
Venables, W. N. 1998. “Exegeses on Linear Models.” In . Washington, DC. http://www.stats.ox.ac.uk/pub/MASS3/Exegeses.pdf.
Warton, David I., and Francis K. C. Hui. 2011. “The arcsine is asinine: the analysis of proportions in ecology.” Ecology 92 (jan): 3–10. doi:10.1890/10-0340.1. http://www.esajournals.org/doi/full/10.1890/10-0340.1.
Wood, Simon N. 2006. Generalized Additive Models: An Introduction with R. Chapman & Hall/CRC.