\documentclass{article}
\usepackage{Sweave}
\newcommand{\R}{{\sf R}}
\title{Whirlwind R notes for STRANGE}
\author{Ben Bolker}
\date{\today}
\begin{document}
\maketitle
\section{why R?}
The standard list:
\begin{itemize}
\item{it's an advanced stats package --- comparable to
SAS etc., you are very unlikely ever to find
a statistical procedure you can't do in R}
\item{complete programming language}
\item{good graphics (although not easy)}
\item{free, both ``as in beer'' (\$) and
philosophically (open source)}
\item{cross-platform}
\item{encourages \emph{repeatable}
and \emph{automated} data manipulation,
analysis, plotting}
\end{itemize}

Disadvantages: speed (vs. C/Java/FORTRAN/MATLAB?);
user-friendliness (vs. JMP, Excel, SPSS);
handling enormous data sets (vs. SAS; but use
database backend); particular problems
(vs. MARK, DISTANCE).

\section{Getting in and getting out}
Lebanese proverb: ``when entering, always
look for the exit''

Use {\tt q()} to quit (not {\tt q}, which will
list the function for you!  Say ``yes'' to
saving the workspace
if you want to continue on the same problem
later.

The Escape ({\tt ESC}) key will stop a
running computation or print-out.

\section{Interactive calculations}
When \R\ is launched it opens the \textbf{console} window
This has a few basic menus at the top; check them out on your own. The console window is
also where you enter commands for \R\ to execute
\emph{interactively}, meaning that the command is executed and
the result is displayed as soon as you hit the {\tt Enter} key. For example, at
the command prompt \texttt{>}, type in \texttt{2+2} and hit \texttt{Enter}; you will see
<<>>=
2+2
@

To do anything complicated, you have to
\emph{assign} the results from calculations to a variable, e.g.
<<>>=
a=2+2
@
The variable \texttt{a} is automatically created and the result (4) is stored
in it, but nothing is printed.  This may seem strange, but the default is
to \emph{not} to print results so large lists won't fill the screen.
To print the value of a variable, just type the variable name by itself
<<>>=
a
@

In this case {\tt a} is a \emph{numeric vector} (of length 1),
which acts just like a number: we'll talk about data types more
shortly.

You can break lines \textbf{anywhere that \R\ can tell you haven't
finished your command} and \R\ will give you a ``continuation'' prompt
(+) to let you know that it doesn't thinks you're finished yet: try typing
\begin{verbatim}
a=3*(4+
5)
\end{verbatim}
to see what happens
(this often happens e.g. if you forget to close parentheses).

Variable names in \R\ must begin with a letter, followed by alphanumeric
characters. Long names can be broken up using a period, as in
\texttt{very.long.variable.number.3}, but (beware!) you
\textbf{cannot} use  blank spaces
in variable names. \R\ is case sensitive: \texttt{Abc} and \textsf{abc}
are \textbf{not} the same variable. Make names long enough to remember, short
enough to type.  Avoid:
{\tt c}, {\tt l}, {\tt q}, {\tt t}, {\tt C}, {\tt D},
{\tt F}, {\tt I}, {\tt T}, which are all built-in functions or
hard to distinguish.

Calculations are done with variables as if they were numbers. \R\ uses
\verb! +, -, *, /, and ^ !
for addition, subtraction, multiplication, division and
exponentiation, respectively. For example:
<<>>=
x=5
y=2
z1=x*y
z2=x/y
z3=x^y
z2
z3
@

Even though the values of \texttt{x} and \texttt{y} were not displayed, \R\
remembers that values have
been assigned to them. Type {\tt x} or {\tt y} to display the values.

You can edit commands to correct or modify them.
The \thinspace $\uparrow$ \thinspace key (or
\texttt{Control-P}) recalls previous
commands to the prompt. For example, you can bring back the third-from-last command and edit it to
<<>>=
z3=2*x^y
@
(experiment with the $\downarrow$, $\rightarrow$, $\leftarrow$, {\tt Home} and {\tt End} keys).

You can combine several operations in one calculation:
<<>>=
A=3
C=(A+2*sqrt(A))/(A+5*sqrt(A)); C
@

\R\ also has many built-in mathematical functions:
{\tt log()} and {\tt exp()} (and {\tt log10()}), {\tt sin()} and {\tt cos()},
etc..

\textbf{Exercise:} the equation for the standard normal distribution is
$\frac{1}{\sqrt{2 \pi}} e^{-x^2/2}$.  Compute the value for $x=1$
and $x=2$.

\textbf{Logical operators} produce {\tt TRUE} or {\tt FALSE} as answers:
{\tt ==} (double-equals), $>$ and $<$ compare two values,
{\tt |} (or), {\tt \&} (and), and {\tt !} (not) modify other logical
values.
<<>>=
A == 3
A > 2
(A>2) & (A<2)
(A>2) | (A<4)
!(A>2)
@

\textbf{Exercise:} convince yourself that
{\tt !(a \& b)} is equivalent to {\tt !a | !b}, no
matter what the values of {\tt a} and {\tt b} are.

\section{Data types and structures}
\subsection{vectors}
Lists of values, all the same
type or \emph{class} (numeric, logical, character).
As mentioned above, a number is just a vector
of length 1.
Some functions create vectors: you can
also create/assign vectors by, e.g.
<<>>=
x = c(1,2,7,8,9)
@
Typing {\tt 1,2,7,8,9}
without {\tt c()} gives an error.
Vector elements can be named:
<<>>=
x = c(first=1,second=1.2,third=1.5)
@
Refer to elements in vectors                       s
\begin{itemize}
\item{by \emph{position}: {\tt x[1]} or (multiple elements)
  {\tt x[c(1,2)]}}
\item{by \emph{name}: {\tt x["first"]}}
\item{by \emph{exclusion}: {\tt x[-1]} drops the first element}
\item{with \emph{logical vectors}: {\tt x[c(TRUE,TRUE,FALSE)]}
gives the first two elements; more usefully, {\tt x[x>1.1]}
first computes a logical vector {\tt [TRUE, FALSE, FALSE]}
and then uses it to select the first element only.}
\end{itemize}

Most of \R's functions are \emph{vectorized}: give them
vectors and they automatically do the right thing.
Functions of two vectors ({\tt c(1,3,5,7) + c(2,4)}) will
automatically \emph{replicate} the shorter vector until
it is as long as the longer one, giving a warning message
the longer is not an even multiple of the shorter.
<<>>=
x = c(1,3,5,7)
2*x
x+c(2,5)
x+c(1,2,4)
@
\begin{verbatim}
Warning message:
longer object length
        is not a multiple of shorter object length in: x + c(1, 2, 4)
\end{verbatim}

Other functions in \R\ are inherently vector functions:
{\tt mean()}, {\tt var()}, {\tt sum()} \ldots

\subsection{Matrices}

Matrices are tables of data, all of the same type
(numeric, character, logical).

Retrieve elements from
matrices, selecting rows, columns, or both,
by placing commas appropriately.
Rows come first (before the comma) and columns second.
{\tt p[1,]} (row 1), {\tt p[,2:5]} (columns 2 through 5),
{\tt p[3,4]}.

Create a matrix with {\tt matrix()}: by default \R\ orders
matrices \emph{column-first}:
<<>>=
matrix(c(1,3,4,5),nrow=2)
matrix(c(1,3,4,5),nrow=2,byrow=TRUE)
@

Matrices act like vectors when appropriate:
<<>>=
log(matrix(c(1,3,4,5),nrow=2))
@

\subsection{Lists}
Lists are collections of \emph{anything}:
vectors and matrices of different sizes
and types, results of statistical analyses,
etc..  Use {\tt \$} to pull out the elements
of a list by name, and {\tt [[]]} to pull
out elements by number.
<<>>=
x = list(x=c(1,2,3),y=c("a","b"))
x[[2]]
x$y
@

\subsection{Data frames}
Data frames are a confusing but useful cross between lists
and matrices: lists of equal-length columns,
which can be different types, and which are treated
like either matrices or lists depending on the
context.  You can access their elements either
like a list ({\tt x\$x, x[[1]]}) or
like a matrix ({\tt x[2,4]}).

\section{{\tt rep()} and {\tt seq()}}
These two functions are very useful for
generating vectors: {\tt seq()} (and its
abbrevation, :) generates different kinds
of regular sequences, while {\tt rep()} replicates
existing vectors.
<<>>=
rep(1:4,3)
rep(1:4,each=3)
rep(1:4,c(1,2,3,4))
1:10
seq(1,10)
seq(1,10,by=2)
seq(0,2,length=7)
@

\section{Getting help}

\begin{itemize}
\item{Typing {\tt ?} followed by the name of a
function (e.g. {\tt ?mean}) will pop up a window
with information on the function (only useful if
you already know its name (!) and it is part of
a package that has been loaded}
\item{{\tt help(package=packagename)} lists and
gives brief descriptions of all
of the functions in a package}
\item{{\tt help.search("word")} looks through
all documentation, including installed but
unloaded packages, whose name or short description
includes {\tt word}. \emph{Does not do full-text
search}, nor have R functions been extensively
cross-indexed, so you have to be close or get
lucky.}
\item{{\tt help.start()} (with parentheses) pops
up a browser window: go to Packages (probably
{\tt base}, {\tt stats}, or {\tt graphics} to
find info on functions}
\item{{\tt RSiteSearch("word or phrase")} (new!) goes to the R
web site and does a query}
\item{{\tt example("function")} runs the examples
given in the help page for the function}
\end{itemize}

Ask an instructor!

\section{Getting stuff in and out of R}
\subsection{Data}
Basic input functions are {\tt read.table()}
and {\tt read.csv()} --- see other notes.

Save objects in R format with {\tt save("x","y","z",file="savefile.RData")}.
Write data out to a file with {\tt write.data("mydata",file="mydata.txt")}.

\subsection{Code}
Your own code, written in a text file: {\tt source("myfuns.R")}.
Pre-existing code that has been compiled in a  {\tt package}
and installed on your system (including functions found
by {\tt help.start()}): {\tt library(packagename)}.
Installing new packages on your system: {\tt install.package(packagename)}.

For smaller pieces of code, you can simply cut and
paste from Notepad or Wordpad, Tinn-R, or another
\emph{text editor} such as emacs.  \textbf{Do not
use Microsoft Word to edit R code, it will screw
things up.}

Save code by cutting and pasting (also possibly
{\tt save()} or {\tt dump()}).

\section{Plotting}
<<>>=
t=seq(0,20,length=50)
x=rnorm(50)
y=rnorm(50)
@

I will explain {\tt par(mfrow=c(2,2))}
in a bit: it makes a 2 $\times$ 2
array of plots on the page.
plot array:
<<fig=TRUE>>=
par(mfrow=c(2,2))
plot(x,y)
plot(t,x)
plot(t,x,type="l")
matplot(t,cbind(x,y),type="l")
@

{\tt plot()} is a generic function that
\emph{may} do the right thing if you
just give it a data object, for example
a \emph{factor} (a categorical variable):

Sample at random (with replacement)
<<>>=
f <- factor(sample(c("healthy","sick","dead"),size=50,replace=TRUE))
table(f)
@

<<fig=TRUE>>=
plot(f)
@

<<>>=
g <- factor(sample(c("AA","Aa","aa"),size=50,replace=TRUE))
table(f,g)
@

<<fig=TRUE>>=
plot(f,g)
@

<<fig=TRUE>>
par(mfrow=c(2,2))
boxplot(x)
boxplot(x~f)
boxplot(x~f+g)
barplot(c(5,7,8,9))
@

There are \textbf{many many} options for changing
axes, labels, adding legends, colors, line types,
line widths.  The help page for the {\tt par()} command
({\tt ?par}) is voluminous, but you can search
within it for the stuff you want to find.
In general, {\tt par("name")} queries the current
setting of graphics parameter {\tt name}, while
{\tt par(name=value)} sets the value of the parameter
(you can set several at once: {\tt par(name1=value1,name2=value2)}).
{\tt par()} shows \emph{all} the parameters at once.

<<fig=TRUE>>=


\section{Flow control}
\begin{itemize}
\item{{\tt if (condition) \{
\} else \{ \}}
(braces around chunks of code)
}
\item{{\tt elseif(x,y,z)} returns {\tt y} if condition
{\tt x} is true, otherwise (else) returns {\tt z}:
e.g. {\tt ifelse(x<0,0,x)} (this particular example
is equivalent to {\tt pmax(0,x)} where {\tt pmax()}
is ``parallel (vectorized) maximum'')}
\item{{\tt for (i in x) \{ \}}:
works through the elements of {\tt x},
setting {\tt i} to each one in turn.
Most common is {\tt for (i in 1:n)}.
If working through a non-integer
vector (e.g. of parameters;
{\tt beta=seq(1,10,by=0.1)},
still probably want to do
{\tt for i in 1:length(x)}
so that you can use {\tt i}
as an index for saving the
results \ldots}
\item{\verb+while () {}+}
\end{itemize}

\section{Functions}
Many \textbf{many} built-in functions.
\emph{Arguments} are sometimes obvious,
e.g. {\tt sin(x)}.  More complex
functions can have many different arguments
(e.g. {\tt matrix}); these can be
specified by name or by position:
\begin{verbatim}
matrix(data=1:12,nrow=2,ncol=3,byrow=TRUE)
\end{verbatim}
is the same as
\begin{verbatim}
matrix(1:12,2,3,TRUE)
\end{verbatim}
but the first is obviously easier to understand.
Most complicated functions also have \emph{defaults}
for most of their arguments, so you can specify just
the ones you want to be different from the defaults.
Named arguments can be in any order, so
\verb+matrix(1:12,ncol=3,byrow=TRUE)+ is also
equivalent to the variants above.

You can (and should) easily define your own
functions: a very simple example is
\begin{verbatim}
square <- function(x) {
  x^2
}
\end{verbatim}
The last statement in the function is its
\emph{return value}; you can also use the
{\tt return()} function to specify what the
sfunction returns.
It's a common mistake to forget this:
\begin{verbatim}
myfun <- function(x) {
  if (x<0) {
     y=x^2
  } else {
    y=x^3
  }
}
\end{verbatim}
returns \emph{nothing}: it should be
\begin{verbatim}
myfun <- function(x) {
  if (x<0) {
     y=x^2
  } else {
    y=x^3
  }
  y
}
\end{verbatim}
An equivalent solution:
\begin{verbatim}
myfun <- function(x) {
  if (x<0) {
     return(x^2)
  } else {
    return(x^3)
  }
}
\end{verbatim}

\textbf{Scope:} inside functions,
variables are \emph{local}, insulated
from the ``calling environment''.
You can re-use variables inside a function
that you use outside a function,
without changing the value once
the function is done.  This means
\emph{you can't use a function to change
the value of a variable}.

Calling the following function
\begin{verbatim}
squarex <- function(x) {
   x=x^2
   return(x)
}
\end{verbatim}
will not change the value of {\tt x}.
You need to say {\tt x = squarex(x)}.
If you \emph{absolutely must} you can
use the global assignment operator
\verb+<<-+, but you probably shouldn't.

\section{Maximization and fitting}
Give
{\tt optim()} a function
(that takes a vector of parameters
and returns a single number, e.g. a
goodness-of-fit value),
and a vector of parameter starting
values, and it will try to find
the parameter values that minimize
the function.

\section{Probability distributions}
R knows about most distributions you would
ever care about.  For each, there are
four associated functions that give the
probability density function, cumulative
distribution function, quantile (inverse CDF)
function, and a random-deviate generator.
Each has standard parameters for its
type and the parameters of the distribution.

\section{Standard statistics}
\begin{itemize}
\item{Linear regression, ANOVA, ANCOVA: {\tt lm()}}
\item{Generalized linear models (logistic regression,
Poisson regression, etc.): {\tt glm()}}
\item{Survival analysis: package {\tt surv}}
\item{Others: {\tt t.test()}, {\tt chisq.test()},
{\tt pairwise.t.test()}, {\tt binom.test()}
(difference in proportions, binomial samples),
{\tt cor.test()} (correlations),
{\tt wilcox.test()} (Wilcoxon/Mann-Whitney
nonparametric), {\tt kruskal.test()} (Kruskal-Wallis),
{\tt ks.test()} (Kolmogorov-Smirnov), \ldots}
\end{itemize}
\end{document}