% Ben Bolker
% Wed Nov 7 12:01:58 2012

Data visualization, focusing on `ggplot` and mixed models

Licensed under the
Creative Commons attribution-noncommercial license.
Please share & remix noncommercially, mentioning its origin.

goals/contexts of data visualization

exploration

want nonparametric/robust approaches to avoid imposing assumptions if possible
- boxplots instead of mean/standard deviation (generally base locations on medians rather than means)
- loess instead of linear/polynomial regression
need speed: quick and dirty
canned routines for standard tasks, flexibility for non-standard tasks
manipulation in the context of visualization

diagnostics

attempt to determine fitting problems graphically: looking for absence of patterns in residuals
e.g. scale-location plot, Q-Q plot; plot.lm
plot methods: generic (e.g. residuals vs fitted) vs specific (e.g. residuals vs predictors)
plotting predictions (intuitive) vs plotting residuals (amplifies/zooms in on discrepancies)
plotting unmodeled characteristics (e.g. spatial, temporal autocorrelation): much easier to draw a picture than fit a model
code contrasts for visual simplicity (e.g. deviations from linearity: Q-Q plots, square-root profiles)

presentation

how closely should one match analyses with graphs? “Let the data speak for themselves” vs “Tell a story”
display data (e.g. boxplots, standard deviations) or inferences from data (confidence intervals)
superimposing model fits (geom_smooth)
avoid excessive cleverness/data density
coefficient plots vs parameter tables (Gelman)
tradeoff between visual design (tweaking) and reproducibility: learning to futz with label positioning etc. may pay off in the long run (a few tools exist for automatic placement)
order factors in a sensible order (i.e not alphabetical or numerical unless (1) the labels have some intrinsic meaning or (2) you expect that readers will be interested in looking up particular levels in the plot). This is sometimes called the “what's so special about Alabama?” problem, although the Canadian version would substitute “Alberta”.

Basic criteria for data presentation

visual perception of quantitative information: Cleveland hierarchy
- Position along a common scale
- Positions along nonaligned scales
- Length, direction, angle
- Area
- Volume, curvature
- Shading, color saturation
Appropriate scaling with size of data:
- small show all points, possibly dodged/jittered, with some summary statistics: dotplot, beeswarm
- medium boxplots, loess, histograms, GAM (or linear regression)
- large modern nonparametrics: violin plots, hexbin plots, kernel densities (computational burden, and display overlapping problems, relevant)
- combinations or overlays where appropriate (beanplot)

challenges

high-dimensional data (esp continuous)

Solutions:

conditioning plots (shingles, facets)
perspective/contour plots

large data sets

problems with computation, file size, presentation
file size: raster (PNG) instead of vector (PS/PDF), pch="."
overplotting (alpha), kernel density estimation, hexagonal binning
summarize (quantiles, etc.)
variation in methods with data set size: dotplot -> boxplot -> violin plot

discrete data

lots of point overlap; jittering OK for exploratory. Need to summarize/bin appropriately.

spatial data

the best parts of the Cleveland hierarchy (x and y axes) are already taken. Representing uncertainty is a big challenge.

compositional data

would like to display “sum to 1.0” constraint but also allow accurate comparison of magnitudes: stacked bars vs grouped bars (or dotplots)?
harder if also need to represent uncertainty (would like to show correlations among components)
ternary diagrams: nice but don't generalize past 3 elements

next generation tools

dynamic/exploratory graphics: (ggobi, Mondrian, latticist, JMP)
GUI frameworks: JMP, R Commander, Deducer, Rattle, web interface to ggplot2
presentation technologies: JCGS editorial with supplementary materials
computational frameworks: lattice, ggplot, Protovis, Gapminder, googleVis

Data visualization in R

Base graphics

simple 'canvas' approach
straightforward, easy to customize
most plot methods written in base graphics

Lattice

newer
documented in a book (Sarkar: see below)
based on grid graphics package
faceting, conditioning plots
much more automatic, better graphical defaults
implements banking, other aspect ratio control
more 'magic', harder to customize
some plot methods (nlme package)
latticeExtra, directlabels packages may be handy

ggplot

newest
based on Wilkinson's ''Grammar of Graphics''
documented in a book (see below) and on a web site, as well as an active mailing list
explicit mapping from variables to ''aesthetics'': x, y, colour, size, shape
implements faceting (not quite as flexibly as lattice: no aspect ratio control)
some data summaries etc. built in
easier to overlay multiple data sets, data summaries, model predictions etc.
no 3D plots
rendering can be slow
gridExtra, ggExtra, directlabels package may be handy

ggplot intro

mappings + geoms

Data

Specified explicitly as part of a ggplot call:

library(mlmRev)
head(Oxboys)

##   Subject     age height Occasion
## 1       1 -1.0000  140.5        1
## 2       1 -0.7479  143.4        2
## 3       1 -0.4630  144.8        3
## 4       1 -0.1643  147.1        4
## 5       1 -0.0027  147.7        5
## 6       1  0.2466  150.2        6

library(ggplot2)
ggplot(Oxboys)

## Error: No layers in plot

But that isn't quite enough: we need to specify a mapping between variables (columns in the data set) and aesthetics (elements of the graphical display: x-location, y-location, colour, size, shape …)

ggplot(Oxboys, aes(x = age, y = height))

## Error: No layers in plot

but (as you can see) that's still not quite enough. We need to specify some geometric objects (called geoms) such as points, lines, etc., that will embody these aesthetics. The weirdest thing about ggplot syntax is that these geoms get added to the existing ggplot object that specifies the data and aesthetics; unless you explicitly specify other aesthetics, they are inherited from the initial ggplot call.

ggplot(Oxboys, aes(x = age, y = height)) + geom_point()

plot of chunk ggplot3

Data visualization, focusing on ggplot and mixed models