% Ben Bolker % Thu Nov 1 12:45:50 2012

ggplot/data visualization lab

Licensed under the Creative Commons attribution-noncommercial license. Please share & remix noncommercially, mentioning its origin.

Oxboys example (20-30 minutes)

First load the mlmRev and ggplot2 packages:

library(mlmRev)  ## for Oxboys data
library(ggplot2)

As shown in the lecture, make a minimal plot of the Oxboys data set (see ?Oxboys for information on the data set):

ggplot(Oxboys, aes(x = age, y = height)) + geom_point()

plot of chunk ggplot1

Now play around with it.

try using geom_line() (it probably doesn't do what you want)
you can save parts of a ggplot specification to variables: define g0 <- ggplot(Oxboys,aes(x=age,y=height,colour=Subject)) (i.e. add colours to the mapping) and try g0+geom_point()
now try g0+geom_line(), or g0+geom_line()+geom_point()
you can also add smooth lines: try g0+geom_point()+geom_smooth()
you can use geom_smooth(method="lm") to add linear regression lines rather than the default loess (locally-weighted regression) smooths
you may notice that the factors aren't in a sensible order (they're sorted lexically, which means that 1…19 come before 2, 20…29, 3, …). You can use Oxboys$Subject <- factor(Oxboys$Subject,levels=1:26) (or Oxboys <- transform(Oxboys,Subject=factor(Subject,levels=1:26))). (You need to redefine g0 after you do this, because it has stored the original version of the data internally.)
the numeric ordering of the subjects is a little more sensible than the lexical order, but it has the “Alberta”/“Alabama” problem (i.e., the subject numbers are as far as we know not meaningful numbers). See if you can figure out how to use the reorder function to get a (slightly) more sensible ordering
I don't like the default gray background: I usually run theme_set(theme_bw()) to change the theme. Try it. (It also makes it easier to see colours.) (Use theme_set(theme_gray()) if you want to restore the default theme.)
If you find the confidence intervals (gray regions) distracting, try removing them by setting se=FALSE in the the geom_smooth() specification. If you want to colour the confidence intervals along with the lines, try adding aes(fill=Subject) inside the geom_smooth() call: or, if you want to change the colour of the confidence region to a single colour (and make the confidence intervals more transparent, a compromise between the default values and using se=FALSE to turn them off completely), add fill="blue",alpha=0.1 (not wrapped inside an aes() statement) to the geom_smooth() call.
the colours may be somewhat useful for diagnosing problems, but it's a bit hard to tell them apart. You could try making the points different shapes (although this doesn't work really well with such large numbers of groups – nothing does, really): add aes(shape=Subject) to the geom_point() call, and add +scale_shape_manual(values=1:26) to the end of your R command (you will get a series of warnings about unimplemented pch value '26'). scale_shape_manual is our first example of using scales, another component of ggplot: it allows customization of the values (colours, shapes, sizes, etc.) used in mappings.
another alternative would be to label the points with letters (conveniently, there are not more than 26 subjects): instead of geom_point(), use geom_text(aes(label=letters[Subject]) to add text (if you had subjects with names you could use those names as the label aesthetic rather than using the built-in letters vector on the fly)
for presentation, especially with a large number of groups, you might want to turn off the colours entirely. However, this would go back to our original plot, where ggplot didn't distinguish between subjects in drawing the lines: by default, ggplot groups by whatever aesthetic mappings have been defined (e.g. colour, shape, etc.). You can explicitly specify this, or override the default behaviour, by using the group aesthetic.
- draw a graph without colours, but use geom_line(aes(group=Subject)) to draw a separate line for each subject
- add geom_smooth(aes(group=1),method="lm",size=1.5) to one of your previous plots to get a linear regression model of the pooled data (i.e. group=1 specifies that only group is the whole data set), with a fatter line than usual
another powerful feature of ggplot is faceting: creating sub-plots (called “facets” or in other contexts “small multiples” or “trellis plots” or “conditioning plots”) for different subsets of the data. Add facet_wrap(~Subject) to one of your previous plots. (To create a two-dimensional grid of plots, use facet_wrap(x~y), where x and y are two separate factors you want to use to define the rows and columns of the sub-plot array.)

Take a look at the online ggplot documentation to get an idea of some of the other options.

Banta et al. fruiting data (20-30 mins)

Get the data set on effects of fruiting and simulated herbivory on Arabidopsis (password-protected), which is described in more detail in the material here (search for “Bolker et al 2009”) and use read.csv to import it (you can call it whatever you want, in the examples below I'll call it dat)

The data are the total number of fruits set (total.fruits), subdivided by status (a nuisance variable, the way the plants were handled); amd (simulated herbivory treatment); nutrient (nutrient level; you may want to create an fNutrient variable within the data set that is a factor instead of a number); rack (which of two experimental racks were used); gen (genotype, again probably should be a factor); popu (population); reg (region).

Start out by mapping the most important variables to the x location, y location, and colour aesthetics:

g0 <- ggplot(dat, aes(x = factor(nutrient), y = total.fruits, colour = amd))

Now experiment with different ways of displaying the data:

try geom_point, or geom_boxplot, or stat_sum(aes(size=..n..)) (the last counts the number of overlapping points and sets the point size according to the number of points).
try stat_summary(fun.y=mean,geom="line",aes(x=nutrient)): this summarizes each group within the data by the mean and adds a corresponding line: the aes(x=nutrient) is a bit of a trick, required because ggplot won't draw lines across horizontal axes that are defined as factors (as in this case). You could also try stat_summary(fun.data=mean_cl_normal), which draws points + normal error bars. Try adding a group aesthetic to stat_summary to get it to take the mean within groups rather than overall.
experiment with faceting, by region or population or genotype
experiment with different groupings, or adding other aesthetics such as shape to distinguish unresolved variables such as status.

See if you can successfully produce Figures 1, 3, and 4 from this PDF (although for Figure 1 you might want to use coord_flip to produce horizontal boxplots with easier-to-read, horizontal labels)

Your own data (20-30 mins, optional)

Alternately, or in addition to the previous example, use ggplot to explore your own data or another data set you can find lying around (try data() in R, or ask the instructor …)