\input{labskel.tex}
\title{The many flavors of \code{apply}}
\date{\today}
\author{Ben Bolker}

\begin{document}

\maketitle

\includegraphics[width=2.64cm,height=0.93cm]{cc-attrib-nc.png}
\begin{minipage}[b]{3in}
{\tiny Licensed under the Creative Commons 
  attribution-noncommercial license
(\url{http://creativecommons.org/licenses/by-nc/3.0/}).
Please share \& remix noncommercially,
mentioning its origin.}
\end{minipage}

\SweaveOpts{keep.source=TRUE}
<<echo=FALSE>>=
options(continue=" ")
@ 
One of the more powerful capabilities of R is
the ``apply'' family.  These are functions whose
purpose is to take an R function and
some R object that  represents ``a set of things''
and apply the function to each element in the set.
You can often achieve the same results with
a \code{for} loop, stepping through the elements
of the set one by one, but the equivalent \code{*apply}
commands are (1) more compact, making code easier
to read [at least if you understand them!],
(2) slightly more convenient --- various bookkeeping
such as figuring out the number of elements in the
set and setting aside storage for the results
gets done automatically, (3) more ``idiomatic''
in R (in case that matters to you), and (4) [sometimes]
more efficient [although it is no longer always the case,
as it was in early versions of S-PLUS, that \code{for}
loops are much less efficient than the \code{apply}
commands].

This general approach to programming (define a function,
then apply it to a set of objects) is called (not too
surprisingly) \emph{functional programming}
(\url{http://en.wikipedia.org/wiki/Functional_programming}).
This style of programming started out in LISP, and is
also very common in Mathematica (where it is represented
by the Map function).

\code{*apply}ing is easiest when an existing function does
what you want, but you can also define functions on the fly.
For example, R doesn't have a \code{square()} function.
You could define it:
<<>>=
square <- function(x) {
  x^2
  }
sapply(1:5,square)
@ 
but for this kind of short function you can just say
<<>>=
sapply(1:5,function(x) {x^2})
@ 
(Mathematica has an even slicker way to do this.)

You can also omit the curly brackets when your
function consists of a single statement.  If it
has more than one you can use semicolons to
keep all the statements on the same line, for
compactness; e.g.
<<>>=
sapply(1:5,function(x) {y <- x; y^2})
@ 
(although in this case the extra statement is obviously
pointless).

You'd also be surprised sometimes what can be used
as a function:
<<>>=
sapply(1:5,"^",2)
@ 
This example also represents a powerful and sometimes overlooked
feature of \code{*apply}: extra arguments get passed
through to the function you are applying.  This is
particularly handy when you want to apply the function
to a vector but use the vector as something other
than the first argument to the function.
For example, suppose we wanted to run a linear regression
on a series of different data sets.
Rather than
<<eval=FALSE>>=
datlist = list(dat1,dat2,dat3)
lapply(datlist, function(d) lm(y~x,data=d))
@ 
we could just say
<<eval=FALSE>>=
datlist = list(dat1,dat2,dat3)
lapply(datlist, lm, formula=y~x)
@ 
R will fill in the \code{formula} argument and
then use the elements of \code{datlist} for the
next unfilled argument, which in this case
is \code{data}.

Note that \code{apply}ing can also be overdone:
See section 4 of 
Patrick Burns' ``R Inferno''
(\url{http://www.burns-stat.com/pages/Tutor/R_inferno.pdf})
(which is a pleasure to read in general).

Reproduced and slightly extended from that reference:

\setlength\parindent{0pt}
\begin{tabular}{rccp{1.5in}}
\textbf{function} & \textbf{input} & \textbf{output} & \textbf{comment} \\
\hline
\code{apply} & matrix or array & vector or array or list & \\
\code{lapply} & list or vector & list & \\
\code{sapply} & list or vector & vector or matrix or list & simplify \\
\code{tapply} & data, categories & array or list & ragged \\
\code{mapply} & lists and/or vectors & vector or matrix or list & multiple \\
\code{rapply} & list & vector or list & recursive \\
\code{eapply} & environment & list & \\
\code{dendrapply} & dendogram & dendogram & \\
\code{zoo::rollapply} & data & similar to input &  \\
\code{emdbook::apply2d} & two vectors & matrix & \\
\code{multicore::mclapply} & same as \code{lapply} & same as \code{lapply} &
parallelize across cores (OK on Unix,
experimental for Windows (pre-Vista only): 
see \url{http://rforge.net/multicore}) \\
\end{tabular}

\code{kernapply} has the same pattern, but I don't think it
is really in the \code{*apply} family.

Also: \code{simFrame::simApply}, functions in \code{Rmpi}
(\code{mpi.parapply}, \code{mpi.iapply}, \code{mpi.apply}), 
\code{gridR::apply},
\code{RMySQL::dbApply}, \code{RPostgreSQL::dbApply},
\code{PerformanceAnalytics::apply.rolling}, \code{ff::ffapply},
\code{xts::\{period.apply,apply.monthly\}}, etc. etc. etc..
(these are the results of \code{sos::findFn("apply")}).
Also \code{nlme::gapply}.

\section{\code{apply}}

Apply \code{fun} to the ``margins'' of a matrix or array.
``Margin'' here means row, column, or other ``slices'' of a higher-dimensional
array.  The \code{MARGIN} argument is 1 for rows, 2 for columns, and \code{n}
for another dimension of a higher-dimensional array.  You can give more than
one margin:
<<>>=
m = matrix(1:4,byrow=TRUE,ncol=2)
apply(m,c(1,2),function(x) x^2)
@ 
Of course, in this case we don't do any better than just saying
\verb+m^2+.  But we could \code{apply} over more than one, but
not all, dimensions of an array with $>2$ dimensions.

\code{colSums}, \code{rowSums}, \code{colMeans}, \code{rowMeans}
are special cases that are considerably faster than the
equivalent \code{apply} commands.  (I think there's an equivalent
for the median somewhere in a Bioconductor package.)

\section{\code{lapply}}

Apply a function to a list.

\section{\code{sapply}}

Apply a function to a list, or a vector (this is handy
so you don't have to say \code{lapply(as.list(x))}, and
simplify the results if possible.

\section{\code{mapply}}

Apply a function of multiple arguments to multiple
lists.  I sometimes use this as a shortcut where
I should probably just give up and use a \code{for} loop.

<<eval=FALSE>>=
mapply(function(dat,i) {
  plot(dat$x,dat$y,col=i)
  text(1,2,names(dat)[i])
  },
  datlist,1:length(datlist))
@ 

it would be great to have a way within an \code{*apply} function
to access the current value of the index (or name of the
current element) but I don't know of one \ldots

Additional arguments have to be specified explicitly
with \code{MoreArgs}.
Depending on what you're doing you may want
\code{SIMPLIFY} to be \code{TRUE} or \code{FALSE} \ldots


\section*{Related functions}

\begin{tabular}{lp{4in}}
\textbf{function} & \textbf{purpose} \\
\code{do.call} & apply a function to a list of arguments \\
\code{replicate} & repeat an expression many times \\
\code{outer} & apply a function to all combinations of two
vectors (function must be vectorized --- otherwise
see \code{emdbook::apply2d} \\
\code{Map} & equivalent to \code{mapply}: see \code{?funprog} \\
\code{Reduce} & apply a function to successively combine elements \\
\code{cumsum} & (and \code{cummax}, \code{cummin}, \code{cumprod}):
cumulative functions \\
\code{plyr::ddply} & (and friends) split an object, apply a function
to chunks, then recombine the chunks (\code{split}/\code{tapply}/\code{rbind}
on steroids)
\end{tabular}

For the truly clever: why does this work?
<<>>=
N <- 0; replicate(20,N <<- N-round(0.25*N)+10)
@ 
\end{document}