% Ben Bolker
% Wed Nov 7 12:01:58 2012
Data visualization, focusing on ggplot
and mixed models

Licensed under the
Creative Commons attribution-noncommercial license.
Please share & remix noncommercially, mentioning its origin.
goals/contexts of data visualization
exploration
- want nonparametric/robust approaches to avoid imposing assumptions if possible
- boxplots instead of mean/standard deviation (generally base locations on medians rather than means)
- loess instead of linear/polynomial regression
- need speed: quick and dirty
- canned routines for standard tasks, flexibility for non-standard tasks
- manipulation in the context of visualization
diagnostics
- attempt to determine fitting problems graphically: looking for absence of patterns in residuals
- e.g. scale-location plot, Q-Q plot;
plot.lm
- plot methods: generic (e.g. residuals vs fitted) vs specific (e.g. residuals vs predictors)
- plotting predictions (intuitive) vs plotting residuals (amplifies/zooms in on discrepancies)
- plotting unmodeled characteristics (e.g. spatial, temporal autocorrelation): much easier to draw a picture than fit a model
- code contrasts for visual simplicity (e.g. deviations from linearity: Q-Q plots, square-root profiles)
presentation
- how closely should one match analyses with graphs? “Let the data speak for themselves” vs “Tell a story”
- display data (e.g. boxplots, standard deviations) or inferences from data (confidence intervals)
- superimposing model fits (
geom_smooth
)
- avoid excessive cleverness/data density
- coefficient plots vs parameter tables (Gelman)
- tradeoff between visual design (tweaking) and reproducibility: learning to futz with label positioning etc. may pay off in the long run (a few tools exist for automatic placement)
- order factors in a sensible order (i.e not alphabetical or numerical unless (1) the labels have some intrinsic meaning or (2) you expect that readers will be interested in looking up particular levels in the plot). This is sometimes called the “what's so special about Alabama?” problem, although the Canadian version would substitute “Alberta”.
Basic criteria for data presentation
challenges
high-dimensional data (esp continuous)
Solutions:
- conditioning plots (shingles, facets)
- perspective/contour plots
large data sets
- problems with computation, file size, presentation
- file size: raster (PNG) instead of vector (PS/PDF),
pch="."
- overplotting (alpha), kernel density estimation, hexagonal binning
- summarize (quantiles, etc.)
- variation in methods with data set size: dotplot -> boxplot -> violin plot
discrete data
- lots of point overlap; jittering OK for exploratory. Need to summarize/bin appropriately.
spatial data
- the best parts of the Cleveland hierarchy (x and y axes) are already taken. Representing uncertainty is a big challenge.
compositional data
- would like to display “sum to 1.0” constraint but also allow accurate comparison of magnitudes: stacked bars vs grouped bars (or dotplots)?
- harder if also need to represent uncertainty (would like to show correlations among components)
- ternary diagrams: nice but don't generalize past 3 elements
next generation tools
- dynamic/exploratory graphics: (ggobi, Mondrian, latticist, JMP)
- GUI frameworks: JMP, R Commander, Deducer, Rattle, web interface to ggplot2
- presentation technologies: JCGS editorial with
supplementary materials
- computational frameworks: lattice, ggplot, Protovis, Gapminder, googleVis
Data visualization in R
Base graphics
- simple 'canvas' approach
- straightforward, easy to customize
- most plot methods written in base graphics
Lattice
- newer
- documented in a book (Sarkar: see below)
- based on
grid
graphics package
- faceting, conditioning plots
- much more automatic, better graphical defaults
- implements banking, other aspect ratio control
- more 'magic', harder to customize
- some plot methods (nlme package)
latticeExtra
, directlabels
packages may be handy
ggplot
- newest
- based on Wilkinson's ''Grammar of Graphics''
- documented in a book (see below) and on a web site, as well as an active mailing list
- explicit mapping from variables to ''aesthetics'': x, y, colour, size, shape
- implements faceting (not quite as flexibly as lattice: no aspect ratio control)
- some data summaries etc. built in
- easier to overlay multiple data sets, data summaries, model predictions etc.
- no 3D plots
- rendering can be slow
gridExtra
, ggExtra
, directlabels
package may be handy
ggplot intro
mappings + geoms
Data
Specified explicitly as part of a ggplot
call:
library(mlmRev)
head(Oxboys)
## Subject age height Occasion
## 1 1 -1.0000 140.5 1
## 2 1 -0.7479 143.4 2
## 3 1 -0.4630 144.8 3
## 4 1 -0.1643 147.1 4
## 5 1 -0.0027 147.7 5
## 6 1 0.2466 150.2 6
library(ggplot2)
ggplot(Oxboys)
## Error: No layers in plot
But that isn't quite enough: we need to specify a mapping between variables (columns in the data set) and aesthetics (elements of the graphical display: x-location, y-location, colour, size, shape …)
ggplot(Oxboys, aes(x = age, y = height))
## Error: No layers in plot
but (as you can see) that's still not quite enough. We need to specify some geometric objects (called geom
s) such as points, lines, etc., that will embody these aesthetics. The weirdest thing about ggplot
syntax is that these geom
s get added to the existing ggplot
object that specifies the data and aesthetics; unless you explicitly specify other aesthetics, they are inherited from the initial ggplot
call.
ggplot(Oxboys, aes(x = age, y = height)) + geom_point()