\documentclass{article} \usepackage{sober} \usepackage[english]{babel} \usepackage[utf8]{inputenc} \usepackage{hyperref} \usepackage{fancyvrb} \usepackage{fullpage} \VerbatimFootnotes \usepackage{amsmath} \usepackage{natbib} \usepackage{color} \usepackage{times} \usepackage{paralist} \bibliographystyle{chicago} \newcommand{\ggp}{\code{ggplot}} \title{\ggp\ for disease ecologists} \author{Jennie Lavine (\code{jslavine@umich.edu}) and Ben Bolker (\code{bolker@mcmaster.ca})} \date{\today} \newcommand{\code}[1]{{\tt #1}} \newcommand{\R}{R} %% nothing fancy \newcommand{\exercise}{\textbf{Exercise}} \newcommand{\bbnote}[1]{\color{red} \small #1 \color{black}} \begin{document} \maketitle \includegraphics[width=2.64cm,height=0.93cm]{cc-attrib-nc.png} \begin{minipage}[b]{3in} {\tiny Licensed under the Creative Commons attribution-noncommercial license (\url{http://creativecommons.org/licenses/by-nc/3.0/}). Please share \& remix noncommercially, mentioning its origin.} \end{minipage} \SweaveOpts{fig.width=5,fig.height=5,out.width="0.7\\textwidth", tidy=FALSE,fig.align="center",fig.path="./figs/", warning=FALSE, message=FALSE} \section{Introduction} \href{http://had.co.nz/ggplot2/}{ggplot2} is an R package by Hadley Wickham \citep{wickham_ggplot2}, based on \emph{The Grammar of Graphics} by Leland Wilkinson \citep{wilkinson_grammar_1999}, that attempts to systematize the construction of exploratory and statistical graphics by defining a system of mappings between variables in the data and ``aesthetics'' --- characteristics of the plot such as $x$ and $y$ coordinates, colour, point size, line width, etc.. The basic structure of a \ggp\ command is that you define the overall plot by specifying a data set, always in the form of a data frame, a set of default mappings, and then \emph{add} so-called ``geoms'' to add objects of layers to the plot. Each layer contains: \begin{enumerate} \item a description of the 'geometric object' (\code{geom}) you want to produce, such as points, bars, lines, etc. \item a statistical transformation (\code{stat}), which could be the binned data to make a histogram, a summary statistic such as the mean of groups with error bars, a smoothed curve from e.g. a linear model summarizing a relationship between two variables, etc. \item a scale (\code{scale}) to map quantities in the data to visual aspects of the plot, such as a color scale, different shaped points for different groups, size of points determined by some characteristic in the data, etc. \end{enumerate} Before you begin, make sure you have the \ggp, \code{plyr}, and \code{reshape2} packages installed, and download the \code{gophertortoise.txt} and \code{malaria.csv} data files from \url{http://www.math.mcmaster.ca/bolker/eeid/private} and put them in your working directory (the username and password will be available in class). \section{\ggp\ objects} Now we look at a dataset on seroprevalence by size in tortoises. We read in data, take a quick look at the distribution of sizes, and familiarize ourselves with the concept of layers in \ggp. Create a \ggp\ object called \code{hist.size}. This object does not have any layers in it, so there is nothing to plot. You can take a look at the object by typing \code{names(hist.size)}. Then make a histogram by adding a \code{geom} to it. <>= require(ggplot2) ## require() and library() are similar require(plyr) dat <- read.table("gophertortoise.txt",header=TRUE) head(dat, 3) hist.size <- ggplot(dat, aes(x=size)) hist.size+geom_histogram(binwidth=10)+facet_grid(.~YEAR) @ To make a histogram of densities instead of counts, add the argument \code{aes(y=..density..)} to \code{geom\_histogram}. \section{Looking at binary data} Now we want to see how serological status is related to size. We \begin{inparaenum} \item create a \ggp\ object called \code{plot1} with size mapped to the $x$ axis and serological status to the $y$ axis, \item change the default theme to black and white. \item plot the binary data as points using transparency (\code{alpha=0.2}) with a black and white background (because the transparency (\code{alpha}) is specified as a fixed value rather than depending on a variable in our data set, we put it on its own, not inside an \code{aes} statement). \end{inparaenum} <>= plot1 <- ggplot(dat, aes(size, status)) theme_update(theme_bw()) ##using transparency to show overlapping points (plot1a <- plot1 + geom_point(alpha=0.2)) @ Perhaps it would be easier to look at the data as proportions rather than counts. When you download \ggp, you will also download two useful packages for manipulating data objects: \code{plyr} and \code{reshape2}. Here we use the function \code{ddply} to take the dataframe \code{dat}, split it by \code{size}, apply a function to each subset, then reshape the results into a new dataframe, which we store in the object \code{sizetab}. If you are familiar with \code{tapply}, \code{sapply}, etc., the input will feel familiar to you. (If not, don't worry about it too much.) <>= sizetab <- ddply(dat,"size", function(x) c(tot=nrow(x),pos=sum(x$status))) head(sizetab,3) @ Now we render a basic plot, with just the barebones: data, an aesthetic saying what values get mapped to the x and y axes, and the type of geometric object <>= ggplot(data=sizetab,aes(x=size,y=pos/tot))+geom_point() @ Now (1) make the size of the points scale with the number of individuals represented by each data point (\code{tot}) and (2) make the points partway transparent. <>= (g1 <- ggplot(sizetab,aes(x=size,y=pos/tot,size=tot))+geom_point(alpha=0.5)) @ Now we will add two more layers, one to add a horizontal line specifying the overall probability and the other to change the y labels (we use \code{with} as a quick way to avoid using to many \verb+$+ to refer to variables inside the \code{sizetab} data frame). Once again, since \code{yintercept} is an absolute value and not a mapping between a variable and an aesthetic, we don't put it inside \code{aes()}. <>= prob0 <- with(sizetab,sum(pos)/sum(tot)) g1+geom_hline(yintercept=prob0,width=2,colour=2)+ ylab("Fraction/probability seropositive") @ We can visualize the proportions a different way using \code{stat\_sum}. To do this, first we bin the data into size groups <>= cut.ind <- seq(from=min(dat$size)-1, to=max(dat$size)+1, length=25) dat$size.cut <- cut(dat$size, cut.ind, labels=head(cut.ind,-1)+diff(cut.ind)/2) dat$size.cut <- as.numeric(as.character(dat$size.cut)) @ Next look at the proportion of positive and negative serostatuses in each one. The size of the dot represents the proportion of tortoises in size class $X$ with serostatus $Y$. <>= plot2 <- ggplot(dat, aes(size.cut, status)) plot2a <- plot2+stat_sum(aes(group=size.cut)) (plot2b <- plot2a+aes(colour=..n..)) @ Similarly to our density-based histogram above, we use the weird-looking variable \code{..n..} to refer to a variable that is calculated internally by \ggp\ --- in this case we want to plot the number of co-occurring data values at each point, rather than the (default) proportion of values. Add a layer to the plot showing the result of a logistic regression. %% took out GAM fit; it collapsed down to a GLM fit anyway <>= (plot2c <- plot2a + geom_smooth(method="glm", family="binomial",colour="red")) @ Now fit a separate curve to the data from each sex category (male, female, juvenile). <>= (plot2d <- plot2a + geom_smooth(aes(group=Sex, colour=Sex), method='glm', family='binomial')) @ We can do the same thing by year instead of sex, but it gets very confusing --- try it for yourself. A better choice may be to plot the data from and curve from each year in a separate panel, using \code{facet\_grid}. <>= (plot2e <- plot2a + geom_smooth(aes(group=YEAR),method='glm', family='binomial')+ facet_grid(YEAR~.)) @ We can add the same sort of curve to the proportion data we looked at before with the following: <>= g1+geom_smooth(method="glm",family=binomial, aes(weight=tot), size=1, colour="red") @ (We use \code{size=1} here because we previously mapped \code{size} to the \code{tot} variable, and we want to override/ignore it for the \verb+geom_smooth+ addition.) Now we take a quick look at a very different sort of dataset. As has been mentioned, these data are not to be freely used after the workshop, but it is a very cool within-host dataset from Andrew Read's lab on malaria parasite and red blood cell densities. Again, we read in data, modify it a little and then make a first fairly simple plot. We look at the trajectory through time for each mouse, (\code{group=mouse}), and we differentiate two blocks of experiments by color. <>= alldat <- read.csv('malaria.csv', colClasses=c(rep('factor',6),rep('numeric',3))) infec.dat <- alldat[alldat$strain!='con'&alldat$day<20,] f=ggplot(infec.dat, aes(day, pathogen)) (f1=f+geom_line(aes(group=mouse, color=block))) @ That plot is not so useful because there is so much data! We can use \code{facet\_grid} to split up the data by dose and strain: <>= f1+facet_grid(dose~strain) @ Finally, perhaps we are not interested in individual trajectories, but instead how the distributions change over the course of the experiment. Let's use boxplots to look at that. <>= f+ facet_grid(dose~strain)+geom_boxplot(aes(group=day)) @ (this graph may be a little slow). Other resources: \begin{itemize} \item The \href{http://had.co.nz/ggplot2/}{ggplot2 web page} has a list of the available components of \ggp\ graphs, and there is a \href{https://github.com/hadley/ggplot2/wiki}{wiki} \item There is an active \href{http://groups.google.com/group/ggplot2}{ggplot Google group}; a lot of answered questions on \href{http://stackoverflow.com/questions/tagged/r+ggplot}{StackOverflow}; and a big variety of blog posts on the \href{http://www.r-bloggers.com/?s=ggplot}{R bloggers} site \item If you use \code{lattice} graphics, \href{http://had.co.nz/ggplot/vs-lattice.html}{this link} and \href{http://learnr.wordpress.com/2009/06/28/ggplot2-version-of-figures-in-lattice-multivariate-data-visualization-with-r-part-1/}{this one} are both useful \item If you get really into \ggp\ you can buy \href{http://amzn.com/0387981403}{the author's book}; it's a little expensive for a paperback, but hey, the software was free! \end{itemize} \bibliography{../eeid} \end{document}