\documentclass{article}
\usepackage{sober}
\usepackage[english]{babel}
\usepackage[utf8]{inputenc}
\usepackage{hyperref}
\usepackage{fancyvrb}
\usepackage{fullpage}
\VerbatimFootnotes
\usepackage{amsmath}
\usepackage{natbib}
\usepackage{color}
\usepackage{times}
\usepackage{paralist}
\bibliographystyle{chicago}

\newcommand{\ggp}{\code{ggplot}}
\title{\ggp\ for disease ecologists}
\author{Jennie Lavine (\code{jslavine@umich.edu}) and Ben Bolker (\code{bolker@mcmaster.ca})}
\date{\today}
\newcommand{\code}[1]{{\tt #1}}
\newcommand{\R}{R}  %% nothing fancy
\newcommand{\exercise}{\textbf{Exercise}}
\newcommand{\bbnote}[1]{\color{red} \small #1 \color{black}}

\begin{document}
\maketitle

\includegraphics[width=2.64cm,height=0.93cm]{cc-attrib-nc.png}
\begin{minipage}[b]{3in}
{\tiny Licensed under the Creative Commons 
  attribution-noncommercial license
(\url{http://creativecommons.org/licenses/by-nc/3.0/}).
Please share \& remix noncommercially,
mentioning its origin.}
\end{minipage}


\SweaveOpts{fig.width=5,fig.height=5,out.width="0.7\\textwidth",
  tidy=FALSE,fig.align="center",fig.path="./figs/", warning=FALSE, message=FALSE}

\section{Introduction}
\href{http://had.co.nz/ggplot2/}{ggplot2} is an R package by Hadley Wickham
\citep{wickham_ggplot2},
based on \emph{The Grammar of Graphics} by Leland Wilkinson
\citep{wilkinson_grammar_1999},
that attempts to systematize the construction of exploratory
and statistical graphics by defining a system of mappings
between variables in the data and ``aesthetics'' --- characteristics
of the plot such as $x$ and $y$ coordinates, colour, point size,
line width, etc..

The basic structure of a \ggp\ command is that you define
the overall plot by specifying a data set, always in the form of a data frame, a set of default
mappings, and then \emph{add} so-called ``geoms'' to add objects
of layers to the plot.

Each layer contains: \begin{enumerate}
\item a description of the 'geometric object' (\code{geom}) you want to produce, such as points, bars, lines, etc.
\item a statistical transformation (\code{stat}), which could be the binned data to make a histogram, a summary statistic such as the mean of groups with error bars, a smoothed curve from e.g. a linear model summarizing a relationship between two variables, etc.
\item a scale (\code{scale}) to map quantities in the data to visual aspects of the plot, such as a color scale, different shaped points for different groups, size of points determined by some characteristic in the data, etc.
  \end{enumerate}

Before you begin, make sure you have the
\ggp, \code{plyr}, and \code{reshape2} packages installed,
and download the \code{gophertortoise.txt} and \code{malaria.csv} data files from
\url{http://www.math.mcmaster.ca/bolker/eeid/private} and put
them in your working directory (the username and password 
will be available in class).

\section{\ggp\ objects}

Now we look at a dataset on seroprevalence by size in tortoises.  We read in data, take a quick look at the distribution of sizes, and familiarize ourselves with the concept of layers in \ggp.  Create a \ggp\ object called \code{hist.size}.  This object does not have any layers in it, so there is nothing to plot.  You can take a look at the object by typing \code{names(hist.size)}. Then make a histogram by adding a \code{geom} to it.

<<readdat, fig.height=3, fig.width=7.5>>=
require(ggplot2)  ## require() and library() are similar
require(plyr)
dat <- read.table("gophertortoise.txt",header=TRUE)
head(dat, 3)

hist.size <- ggplot(dat, aes(x=size))
hist.size+geom_histogram(binwidth=10)+facet_grid(.~YEAR)
@ 

To make a histogram of densities instead of counts, add the argument \code{aes(y=..density..)} to \code{geom\_histogram}.

\section{Looking at binary data}
Now we want to see how serological status is related to size.  We 
\begin{inparaenum}
\item create a \ggp\ object called \code{plot1} with size mapped to the $x$ axis and serological status to the $y$ axis,
\item change the default theme to black and white.
\item plot the binary data as points using transparency (\code{alpha=0.2}) with a black and white background
(because the transparency (\code{alpha}) is specified as a fixed value
rather than depending on a variable in our data set, 
we put it on its own, not inside an \code{aes} statement).
\end{inparaenum}

<<plot1,fig.height=3, fig.width=6>>=
plot1 <- ggplot(dat, aes(size, status))
theme_update(theme_bw())
##using transparency to show overlapping points
(plot1a <- plot1 + geom_point(alpha=0.2)) 
@

Perhaps it would be easier to look at the data as proportions rather than counts.
 When you download \ggp, you will also download two useful packages for manipulating data objects: \code{plyr} and \code{reshape2}.  Here we use the function \code{ddply} to take the dataframe \code{dat}, split it by \code{size}, apply a function to each subset, then reshape the results into a new dataframe, which we store in the object \code{sizetab}.  If you are familiar with \code{tapply}, \code{sapply}, etc., the input will feel familiar to you. (If not, don't worry about it too much.)

<<sizetab>>=
sizetab <- ddply(dat,"size",
                function(x) c(tot=nrow(x),pos=sum(x$status)))
head(sizetab,3)
@ 

Now we render a basic plot, with just the barebones: data, an aesthetic saying what values get mapped to the x and y axes, and the type of geometric object
<<basicplot,fig.height=4>>=
ggplot(data=sizetab,aes(x=size,y=pos/tot))+geom_point()
@ 

Now (1) make the size of the points scale with the number of individuals
represented by each data point (\code{tot}) and (2) make the points partway
transparent.
<<ptscale,fig.height=4>>=
(g1 <- ggplot(sizetab,aes(x=size,y=pos/tot,size=tot))+geom_point(alpha=0.5))
@ 

Now we will add two more layers, one to add a horizontal line specifying the
overall probability and the other to change the y labels (we use \code{with}
as a quick way to avoid using to many \verb+$+ to refer to variables
inside the \code{sizetab} data frame).  Once again, since \code{yintercept}
is an absolute value and not a mapping between a variable and an
aesthetic, we don't put it inside \code{aes()}.
<<addlayers,fig.height=4>>=
prob0 <- with(sizetab,sum(pos)/sum(tot))
g1+geom_hline(yintercept=prob0,width=2,colour=2)+
    ylab("Fraction/probability seropositive")
@ 

We can visualize the proportions a different way using \code{stat\_sum}.  To do this, first we bin the data into size groups

<<cutdata>>=
cut.ind <- seq(from=min(dat$size)-1, to=max(dat$size)+1, length=25) 
dat$size.cut <- cut(dat$size, cut.ind, labels=head(cut.ind,-1)+diff(cut.ind)/2)
dat$size.cut <- as.numeric(as.character(dat$size.cut))
@ 

Next look at the proportion of positive and negative serostatuses in each one.  The size of the dot represents the proportion of tortoises in size class $X$ with serostatus $Y$. 
<<propplot,fig.height=3, fig.width=7>>=
plot2 <- ggplot(dat, aes(size.cut, status))
plot2a <- plot2+stat_sum(aes(group=size.cut))
(plot2b <- plot2a+aes(colour=..n..))
@ 
Similarly to our density-based histogram above, we use the weird-looking
variable \code{..n..} to refer to a variable that is calculated internally
by \ggp\ --- in this case we want to plot the number of co-occurring
data values at each point, rather than the (default) proportion of values.

Add a layer to the plot showing the result of a logistic regression.
%% took out GAM fit; it collapsed down to a GLM fit anyway
<<addlogreg>>=
(plot2c <- plot2a +  
 geom_smooth(method="glm", family="binomial",colour="red"))
@ 


Now fit a separate curve to the data from each sex category (male, female, juvenile).
<<sepcurve,fig.height=4, fig.width=5>>=
(plot2d <- plot2a + geom_smooth(aes(group=Sex, colour=Sex),
                                method='glm', family='binomial'))
@ 


We can do the same thing by year instead of sex, but it gets very confusing --- try it for yourself.  A better choice may be to plot the data from and curve from each year in a separate panel, using \code{facet\_grid}.

<<facet1,fig.height=5, fig.width=5>>=
(plot2e <- plot2a + 
 geom_smooth(aes(group=YEAR),method='glm', family='binomial')+
 facet_grid(YEAR~.))
@ 

We can add the same sort of curve to the proportion data we looked at before with the following:

<<curve2,fig.height=5, fig.width=5>>=
g1+geom_smooth(method="glm",family=binomial,
               aes(weight=tot),
               size=1,
               colour="red")
@ 
(We use \code{size=1} here because we previously mapped \code{size} to the \code{tot} variable, and we want to override/ignore it for the \verb+geom_smooth+ addition.)

Now we take a quick look at a very different sort of dataset.  As has been mentioned, these data are not to be freely used after the workshop, but it is a very cool within-host dataset from Andrew Read's lab on malaria parasite and red blood cell densities.  Again, we read in data, modify it a little and then make a first fairly simple plot.  We look at the trajectory through time for each mouse, (\code{group=mouse}), and we differentiate two blocks of experiments by color.

<<readdat2>>=
alldat <- read.csv('malaria.csv', colClasses=c(rep('factor',6),rep('numeric',3)))
infec.dat <- alldat[alldat$strain!='con'&alldat$day<20,]

f=ggplot(infec.dat, aes(day, pathogen))
(f1=f+geom_line(aes(group=mouse, color=block)))
@

That plot is not so useful because there is so much data!  
We can use \code{facet\_grid} to split up the data by dose and strain:

<<malariafacets>>=
f1+facet_grid(dose~strain)
@

Finally, perhaps we are not interested in individual trajectories, but instead how the distributions change over the course of the experiment.  Let's use boxplots to look at that.

<<boxplots>>=
f+ facet_grid(dose~strain)+geom_boxplot(aes(group=day))
@
(this graph may be a little slow).

Other resources:
\begin{itemize}
  \item The \href{http://had.co.nz/ggplot2/}{ggplot2 web page} has a list of the available components of \ggp\ graphs, and there is a \href{https://github.com/hadley/ggplot2/wiki}{wiki}
  \item There is an active \href{http://groups.google.com/group/ggplot2}{ggplot Google group}; a lot of answered questions on \href{http://stackoverflow.com/questions/tagged/r+ggplot}{StackOverflow}; and a big variety of blog posts on the \href{http://www.r-bloggers.com/?s=ggplot}{R bloggers} site
  \item If you use \code{lattice} graphics,
    \href{http://had.co.nz/ggplot/vs-lattice.html}{this link}
    and \href{http://learnr.wordpress.com/2009/06/28/ggplot2-version-of-figures-in-lattice-multivariate-data-visualization-with-r-part-1/}{this one} are both useful
  \item If you get really into \ggp\ you can buy
    \href{http://amzn.com/0387981403}{the author's book}; it's a little expensive
    for a paperback, but hey, the software was free!
  \end{itemize}

\bibliography{../eeid}
\end{document}