\documentclass{article} \usepackage{Sweave} \newcommand{\R}{{\sf R}} \title{Whirlwind R notes for STRANGE} \author{Ben Bolker} \date{\today} \begin{document} \maketitle \section{why R?} The standard list: \begin{itemize} \item{it's an advanced stats package --- comparable to SAS etc., you are very unlikely ever to find a statistical procedure you can't do in R} \item{complete programming language} \item{good graphics (although not easy)} \item{free, both ``as in beer'' (\$) and philosophically (open source)} \item{cross-platform} \item{encourages \emph{repeatable} and \emph{automated} data manipulation, analysis, plotting} \end{itemize} Disadvantages: speed (vs. C/Java/FORTRAN/MATLAB?); user-friendliness (vs. JMP, Excel, SPSS); handling enormous data sets (vs. SAS; but use database backend); particular problems (vs. MARK, DISTANCE). \section{Getting in and getting out} Lebanese proverb: ``when entering, always look for the exit'' Use {\tt q()} to quit (not {\tt q}, which will list the function for you! Say ``yes'' to saving the workspace if you want to continue on the same problem later. The Escape ({\tt ESC}) key will stop a running computation or print-out. \section{Interactive calculations} When \R\ is launched it opens the \textbf{console} window This has a few basic menus at the top; check them out on your own. The console window is also where you enter commands for \R\ to execute \emph{interactively}, meaning that the command is executed and the result is displayed as soon as you hit the {\tt Enter} key. For example, at the command prompt \texttt{>}, type in \texttt{2+2} and hit \texttt{Enter}; you will see <<>>= 2+2 @ To do anything complicated, you have to \emph{assign} the results from calculations to a variable, e.g. <<>>= a=2+2 @ The variable \texttt{a} is automatically created and the result (4) is stored in it, but nothing is printed. This may seem strange, but the default is to \emph{not} to print results so large lists won't fill the screen. To print the value of a variable, just type the variable name by itself <<>>= a @ In this case {\tt a} is a \emph{numeric vector} (of length 1), which acts just like a number: we'll talk about data types more shortly. You can break lines \textbf{anywhere that \R\ can tell you haven't finished your command} and \R\ will give you a ``continuation'' prompt (+) to let you know that it doesn't thinks you're finished yet: try typing \begin{verbatim} a=3*(4+ 5) \end{verbatim} to see what happens (this often happens e.g. if you forget to close parentheses). Variable names in \R\ must begin with a letter, followed by alphanumeric characters. Long names can be broken up using a period, as in \texttt{very.long.variable.number.3}, but (beware!) you \textbf{cannot} use blank spaces in variable names. \R\ is case sensitive: \texttt{Abc} and \textsf{abc} are \textbf{not} the same variable. Make names long enough to remember, short enough to type. Avoid: {\tt c}, {\tt l}, {\tt q}, {\tt t}, {\tt C}, {\tt D}, {\tt F}, {\tt I}, {\tt T}, which are all built-in functions or hard to distinguish. Calculations are done with variables as if they were numbers. \R\ uses \verb! +, -, *, /, and ^ ! for addition, subtraction, multiplication, division and exponentiation, respectively. For example: <<>>= x=5 y=2 z1=x*y z2=x/y z3=x^y z2 z3 @ Even though the values of \texttt{x} and \texttt{y} were not displayed, \R\ remembers that values have been assigned to them. Type {\tt x} or {\tt y} to display the values. You can edit commands to correct or modify them. The \thinspace $\uparrow$ \thinspace key (or \texttt{Control-P}) recalls previous commands to the prompt. For example, you can bring back the third-from-last command and edit it to <<>>= z3=2*x^y @ (experiment with the $\downarrow$, $\rightarrow$, $\leftarrow$, {\tt Home} and {\tt End} keys). You can combine several operations in one calculation: <<>>= A=3 C=(A+2*sqrt(A))/(A+5*sqrt(A)); C @ \R\ also has many built-in mathematical functions: {\tt log()} and {\tt exp()} (and {\tt log10()}), {\tt sin()} and {\tt cos()}, etc.. \textbf{Exercise:} the equation for the standard normal distribution is $\frac{1}{\sqrt{2 \pi}} e^{-x^2/2}$. Compute the value for $x=1$ and $x=2$. \textbf{Logical operators} produce {\tt TRUE} or {\tt FALSE} as answers: {\tt ==} (double-equals), $>$ and $<$ compare two values, {\tt |} (or), {\tt \&} (and), and {\tt !} (not) modify other logical values. <<>>= A == 3 A > 2 (A>2) & (A<2) (A>2) | (A<4) !(A>2) @ \textbf{Exercise:} convince yourself that {\tt !(a \& b)} is equivalent to {\tt !a | !b}, no matter what the values of {\tt a} and {\tt b} are. \section{Data types and structures} \subsection{vectors} Lists of values, all the same type or \emph{class} (numeric, logical, character). As mentioned above, a number is just a vector of length 1. Some functions create vectors: you can also create/assign vectors by, e.g. <<>>= x = c(1,2,7,8,9) @ Typing {\tt 1,2,7,8,9} without {\tt c()} gives an error. Vector elements can be named: <<>>= x = c(first=1,second=1.2,third=1.5) @ Refer to elements in vectors s \begin{itemize} \item{by \emph{position}: {\tt x[1]} or (multiple elements) {\tt x[c(1,2)]}} \item{by \emph{name}: {\tt x["first"]}} \item{by \emph{exclusion}: {\tt x[-1]} drops the first element} \item{with \emph{logical vectors}: {\tt x[c(TRUE,TRUE,FALSE)]} gives the first two elements; more usefully, {\tt x[x>1.1]} first computes a logical vector {\tt [TRUE, FALSE, FALSE]} and then uses it to select the first element only.} \end{itemize} Most of \R's functions are \emph{vectorized}: give them vectors and they automatically do the right thing. Functions of two vectors ({\tt c(1,3,5,7) + c(2,4)}) will automatically \emph{replicate} the shorter vector until it is as long as the longer one, giving a warning message the longer is not an even multiple of the shorter. <<>>= x = c(1,3,5,7) 2*x x+c(2,5) x+c(1,2,4) @ \begin{verbatim} Warning message: longer object length is not a multiple of shorter object length in: x + c(1, 2, 4) \end{verbatim} Other functions in \R\ are inherently vector functions: {\tt mean()}, {\tt var()}, {\tt sum()} \ldots \subsection{Matrices} Matrices are tables of data, all of the same type (numeric, character, logical). Retrieve elements from matrices, selecting rows, columns, or both, by placing commas appropriately. Rows come first (before the comma) and columns second. {\tt p[1,]} (row 1), {\tt p[,2:5]} (columns 2 through 5), {\tt p[3,4]}. Create a matrix with {\tt matrix()}: by default \R\ orders matrices \emph{column-first}: <<>>= matrix(c(1,3,4,5),nrow=2) matrix(c(1,3,4,5),nrow=2,byrow=TRUE) @ Matrices act like vectors when appropriate: <<>>= log(matrix(c(1,3,4,5),nrow=2)) @ \subsection{Lists} Lists are collections of \emph{anything}: vectors and matrices of different sizes and types, results of statistical analyses, etc.. Use {\tt \$} to pull out the elements of a list by name, and {\tt [[]]} to pull out elements by number. <<>>= x = list(x=c(1,2,3),y=c("a","b")) x[[2]] x$y @ \subsection{Data frames} Data frames are a confusing but useful cross between lists and matrices: lists of equal-length columns, which can be different types, and which are treated like either matrices or lists depending on the context. You can access their elements either like a list ({\tt x\$x, x[[1]]}) or like a matrix ({\tt x[2,4]}). \section{{\tt rep()} and {\tt seq()}} These two functions are very useful for generating vectors: {\tt seq()} (and its abbrevation, :) generates different kinds of regular sequences, while {\tt rep()} replicates existing vectors. <<>>= rep(1:4,3) rep(1:4,each=3) rep(1:4,c(1,2,3,4)) 1:10 seq(1,10) seq(1,10,by=2) seq(0,2,length=7) @ \section{Getting help} \begin{itemize} \item{Typing {\tt ?} followed by the name of a function (e.g. {\tt ?mean}) will pop up a window with information on the function (only useful if you already know its name (!) and it is part of a package that has been loaded} \item{{\tt help(package=packagename)} lists and gives brief descriptions of all of the functions in a package} \item{{\tt help.search("word")} looks through all documentation, including installed but unloaded packages, whose name or short description includes {\tt word}. \emph{Does not do full-text search}, nor have R functions been extensively cross-indexed, so you have to be close or get lucky.} \item{{\tt help.start()} (with parentheses) pops up a browser window: go to Packages (probably {\tt base}, {\tt stats}, or {\tt graphics} to find info on functions} \item{{\tt RSiteSearch("word or phrase")} (new!) goes to the R web site and does a query} \item{{\tt example("function")} runs the examples given in the help page for the function} \end{itemize} Ask an instructor! \section{Getting stuff in and out of R} \subsection{Data} Basic input functions are {\tt read.table()} and {\tt read.csv()} --- see other notes. Save objects in R format with {\tt save("x","y","z",file="savefile.RData")}. Write data out to a file with {\tt write.data("mydata",file="mydata.txt")}. \subsection{Code} Your own code, written in a text file: {\tt source("myfuns.R")}. Pre-existing code that has been compiled in a {\tt package} and installed on your system (including functions found by {\tt help.start()}): {\tt library(packagename)}. Installing new packages on your system: {\tt install.package(packagename)}. For smaller pieces of code, you can simply cut and paste from Notepad or Wordpad, Tinn-R, or another \emph{text editor} such as emacs. \textbf{Do not use Microsoft Word to edit R code, it will screw things up.} Save code by cutting and pasting (also possibly {\tt save()} or {\tt dump()}). \section{Plotting} <<>>= t=seq(0,20,length=50) x=rnorm(50) y=rnorm(50) @ I will explain {\tt par(mfrow=c(2,2))} in a bit: it makes a 2 $\times$ 2 array of plots on the page. plot array: <>= par(mfrow=c(2,2)) plot(x,y) plot(t,x) plot(t,x,type="l") matplot(t,cbind(x,y),type="l") @ {\tt plot()} is a generic function that \emph{may} do the right thing if you just give it a data object, for example a \emph{factor} (a categorical variable): Sample at random (with replacement) <<>>= f <- factor(sample(c("healthy","sick","dead"),size=50,replace=TRUE)) table(f) @ <>= plot(f) @ <<>>= g <- factor(sample(c("AA","Aa","aa"),size=50,replace=TRUE)) table(f,g) @ <>= plot(f,g) @ <> par(mfrow=c(2,2)) boxplot(x) boxplot(x~f) boxplot(x~f+g) barplot(c(5,7,8,9)) @ There are \textbf{many many} options for changing axes, labels, adding legends, colors, line types, line widths. The help page for the {\tt par()} command ({\tt ?par}) is voluminous, but you can search within it for the stuff you want to find. In general, {\tt par("name")} queries the current setting of graphics parameter {\tt name}, while {\tt par(name=value)} sets the value of the parameter (you can set several at once: {\tt par(name1=value1,name2=value2)}). {\tt par()} shows \emph{all} the parameters at once. <>= \section{Flow control} \begin{itemize} \item{{\tt if (condition) \{ \} else \{ \}} (braces around chunks of code) } \item{{\tt elseif(x,y,z)} returns {\tt y} if condition {\tt x} is true, otherwise (else) returns {\tt z}: e.g. {\tt ifelse(x<0,0,x)} (this particular example is equivalent to {\tt pmax(0,x)} where {\tt pmax()} is ``parallel (vectorized) maximum'')} \item{{\tt for (i in x) \{ \}}: works through the elements of {\tt x}, setting {\tt i} to each one in turn. Most common is {\tt for (i in 1:n)}. If working through a non-integer vector (e.g. of parameters; {\tt beta=seq(1,10,by=0.1)}, still probably want to do {\tt for i in 1:length(x)} so that you can use {\tt i} as an index for saving the results \ldots} \item{\verb+while () {}+} \end{itemize} \section{Functions} Many \textbf{many} built-in functions. \emph{Arguments} are sometimes obvious, e.g. {\tt sin(x)}. More complex functions can have many different arguments (e.g. {\tt matrix}); these can be specified by name or by position: \begin{verbatim} matrix(data=1:12,nrow=2,ncol=3,byrow=TRUE) \end{verbatim} is the same as \begin{verbatim} matrix(1:12,2,3,TRUE) \end{verbatim} but the first is obviously easier to understand. Most complicated functions also have \emph{defaults} for most of their arguments, so you can specify just the ones you want to be different from the defaults. Named arguments can be in any order, so \verb+matrix(1:12,ncol=3,byrow=TRUE)+ is also equivalent to the variants above. You can (and should) easily define your own functions: a very simple example is \begin{verbatim} square <- function(x) { x^2 } \end{verbatim} The last statement in the function is its \emph{return value}; you can also use the {\tt return()} function to specify what the sfunction returns. It's a common mistake to forget this: \begin{verbatim} myfun <- function(x) { if (x<0) { y=x^2 } else { y=x^3 } } \end{verbatim} returns \emph{nothing}: it should be \begin{verbatim} myfun <- function(x) { if (x<0) { y=x^2 } else { y=x^3 } y } \end{verbatim} An equivalent solution: \begin{verbatim} myfun <- function(x) { if (x<0) { return(x^2) } else { return(x^3) } } \end{verbatim} \textbf{Scope:} inside functions, variables are \emph{local}, insulated from the ``calling environment''. You can re-use variables inside a function that you use outside a function, without changing the value once the function is done. This means \emph{you can't use a function to change the value of a variable}. Calling the following function \begin{verbatim} squarex <- function(x) { x=x^2 return(x) } \end{verbatim} will not change the value of {\tt x}. You need to say {\tt x = squarex(x)}. If you \emph{absolutely must} you can use the global assignment operator \verb+<<-+, but you probably shouldn't. \section{Maximization and fitting} Give {\tt optim()} a function (that takes a vector of parameters and returns a single number, e.g. a goodness-of-fit value), and a vector of parameter starting values, and it will try to find the parameter values that minimize the function. \section{Probability distributions} R knows about most distributions you would ever care about. For each, there are four associated functions that give the probability density function, cumulative distribution function, quantile (inverse CDF) function, and a random-deviate generator. Each has standard parameters for its type and the parameters of the distribution. \section{Standard statistics} \begin{itemize} \item{Linear regression, ANOVA, ANCOVA: {\tt lm()}} \item{Generalized linear models (logistic regression, Poisson regression, etc.): {\tt glm()}} \item{Survival analysis: package {\tt surv}} \item{Others: {\tt t.test()}, {\tt chisq.test()}, {\tt pairwise.t.test()}, {\tt binom.test()} (difference in proportions, binomial samples), {\tt cor.test()} (correlations), {\tt wilcox.test()} (Wilcoxon/Mann-Whitney nonparametric), {\tt kruskal.test()} (Kruskal-Wallis), {\tt ks.test()} (Kolmogorov-Smirnov), \ldots} \end{itemize} \end{document}