\input{labskel.tex} \title{The many flavors of \code{apply}} \date{\today} \author{Ben Bolker} \begin{document} \maketitle \includegraphics[width=2.64cm,height=0.93cm]{cc-attrib-nc.png} \begin{minipage}[b]{3in} {\tiny Licensed under the Creative Commons attribution-noncommercial license (\url{http://creativecommons.org/licenses/by-nc/3.0/}). Please share \& remix noncommercially, mentioning its origin.} \end{minipage} \SweaveOpts{keep.source=TRUE} <>= options(continue=" ") @ One of the more powerful capabilities of R is the ``apply'' family. These are functions whose purpose is to take an R function and some R object that represents ``a set of things'' and apply the function to each element in the set. You can often achieve the same results with a \code{for} loop, stepping through the elements of the set one by one, but the equivalent \code{*apply} commands are (1) more compact, making code easier to read [at least if you understand them!], (2) slightly more convenient --- various bookkeeping such as figuring out the number of elements in the set and setting aside storage for the results gets done automatically, (3) more ``idiomatic'' in R (in case that matters to you), and (4) [sometimes] more efficient [although it is no longer always the case, as it was in early versions of S-PLUS, that \code{for} loops are much less efficient than the \code{apply} commands]. This general approach to programming (define a function, then apply it to a set of objects) is called (not too surprisingly) \emph{functional programming} (\url{http://en.wikipedia.org/wiki/Functional_programming}). This style of programming started out in LISP, and is also very common in Mathematica (where it is represented by the Map function). \code{*apply}ing is easiest when an existing function does what you want, but you can also define functions on the fly. For example, R doesn't have a \code{square()} function. You could define it: <<>>= square <- function(x) { x^2 } sapply(1:5,square) @ but for this kind of short function you can just say <<>>= sapply(1:5,function(x) {x^2}) @ (Mathematica has an even slicker way to do this.) You can also omit the curly brackets when your function consists of a single statement. If it has more than one you can use semicolons to keep all the statements on the same line, for compactness; e.g. <<>>= sapply(1:5,function(x) {y <- x; y^2}) @ (although in this case the extra statement is obviously pointless). You'd also be surprised sometimes what can be used as a function: <<>>= sapply(1:5,"^",2) @ This example also represents a powerful and sometimes overlooked feature of \code{*apply}: extra arguments get passed through to the function you are applying. This is particularly handy when you want to apply the function to a vector but use the vector as something other than the first argument to the function. For example, suppose we wanted to run a linear regression on a series of different data sets. Rather than <>= datlist = list(dat1,dat2,dat3) lapply(datlist, function(d) lm(y~x,data=d)) @ we could just say <>= datlist = list(dat1,dat2,dat3) lapply(datlist, lm, formula=y~x) @ R will fill in the \code{formula} argument and then use the elements of \code{datlist} for the next unfilled argument, which in this case is \code{data}. Note that \code{apply}ing can also be overdone: See section 4 of Patrick Burns' ``R Inferno'' (\url{http://www.burns-stat.com/pages/Tutor/R_inferno.pdf}) (which is a pleasure to read in general). Reproduced and slightly extended from that reference: \setlength\parindent{0pt} \begin{tabular}{rccp{1.5in}} \textbf{function} & \textbf{input} & \textbf{output} & \textbf{comment} \\ \hline \code{apply} & matrix or array & vector or array or list & \\ \code{lapply} & list or vector & list & \\ \code{sapply} & list or vector & vector or matrix or list & simplify \\ \code{tapply} & data, categories & array or list & ragged \\ \code{mapply} & lists and/or vectors & vector or matrix or list & multiple \\ \code{rapply} & list & vector or list & recursive \\ \code{eapply} & environment & list & \\ \code{dendrapply} & dendogram & dendogram & \\ \code{zoo::rollapply} & data & similar to input & \\ \code{emdbook::apply2d} & two vectors & matrix & \\ \code{multicore::mclapply} & same as \code{lapply} & same as \code{lapply} & parallelize across cores (OK on Unix, experimental for Windows (pre-Vista only): see \url{http://rforge.net/multicore}) \\ \end{tabular} \code{kernapply} has the same pattern, but I don't think it is really in the \code{*apply} family. Also: \code{simFrame::simApply}, functions in \code{Rmpi} (\code{mpi.parapply}, \code{mpi.iapply}, \code{mpi.apply}), \code{gridR::apply}, \code{RMySQL::dbApply}, \code{RPostgreSQL::dbApply}, \code{PerformanceAnalytics::apply.rolling}, \code{ff::ffapply}, \code{xts::\{period.apply,apply.monthly\}}, etc. etc. etc.. (these are the results of \code{sos::findFn("apply")}). Also \code{nlme::gapply}. \section{\code{apply}} Apply \code{fun} to the ``margins'' of a matrix or array. ``Margin'' here means row, column, or other ``slices'' of a higher-dimensional array. The \code{MARGIN} argument is 1 for rows, 2 for columns, and \code{n} for another dimension of a higher-dimensional array. You can give more than one margin: <<>>= m = matrix(1:4,byrow=TRUE,ncol=2) apply(m,c(1,2),function(x) x^2) @ Of course, in this case we don't do any better than just saying \verb+m^2+. But we could \code{apply} over more than one, but not all, dimensions of an array with $>2$ dimensions. \code{colSums}, \code{rowSums}, \code{colMeans}, \code{rowMeans} are special cases that are considerably faster than the equivalent \code{apply} commands. (I think there's an equivalent for the median somewhere in a Bioconductor package.) \section{\code{lapply}} Apply a function to a list. \section{\code{sapply}} Apply a function to a list, or a vector (this is handy so you don't have to say \code{lapply(as.list(x))}, and simplify the results if possible. \section{\code{mapply}} Apply a function of multiple arguments to multiple lists. I sometimes use this as a shortcut where I should probably just give up and use a \code{for} loop. <>= mapply(function(dat,i) { plot(dat$x,dat$y,col=i) text(1,2,names(dat)[i]) }, datlist,1:length(datlist)) @ it would be great to have a way within an \code{*apply} function to access the current value of the index (or name of the current element) but I don't know of one \ldots Additional arguments have to be specified explicitly with \code{MoreArgs}. Depending on what you're doing you may want \code{SIMPLIFY} to be \code{TRUE} or \code{FALSE} \ldots \section*{Related functions} \begin{tabular}{lp{4in}} \textbf{function} & \textbf{purpose} \\ \code{do.call} & apply a function to a list of arguments \\ \code{replicate} & repeat an expression many times \\ \code{outer} & apply a function to all combinations of two vectors (function must be vectorized --- otherwise see \code{emdbook::apply2d} \\ \code{Map} & equivalent to \code{mapply}: see \code{?funprog} \\ \code{Reduce} & apply a function to successively combine elements \\ \code{cumsum} & (and \code{cummax}, \code{cummin}, \code{cumprod}): cumulative functions \\ \code{plyr::ddply} & (and friends) split an object, apply a function to chunks, then recombine the chunks (\code{split}/\code{tapply}/\code{rbind} on steroids) \end{tabular} For the truly clever: why does this work? <<>>= N <- 0; replicate(20,N <<- N-round(0.25*N)+10) @ \end{document}