\documentclass{article}
\usepackage[dvips]{graphicx,color}
\usepackage{amsmath}
\usepackage{epsfig}
\usepackage{geometry}
\geometry{letterpaper, left=1in, bottom=1in, right=1in, top=1in,
foot=0.5in, head=0.5in}
\usepackage{Sweave}
\title{Geostatistics for spatial and space-time data}
\author{Ottar Bj{\o}rnstad}
\date{\today}
\begin{document}
\maketitle

\section*{Introduction}
Exploratory analysis of spatial data is of great interest in ecology of
infectious disease for a number of reasons such as looking for space-time
clustering, hot-spot detection, characterizing spatial waves. Spatial
stats is also important in order to correct for the problem of spurious
associations between incidence and environmental data because
spatial autocorrelation can violate the assumption of independence.

Spatial statistics is a very field. In this tutorial we will focus on a
subset of methods that are more (or less) commonly used. Many of these
involves the notion of spatial autocorrelation in one form or another.
While all the methods we will be discussing -- such as Mantel tests, parametric
and nonparametric correlation functions, local indicators of spatial
association, etc -- come in canned packages (we will be using the `ncf'-package),
it is useful to spend a bit of time on the underlying ideas.

\section*{Spatial autocorrelation}
Pearson's product moment correlation between to random variables, $Z_1$ and
$Z_2$, let's denote this by $\rho_12$ is defined as:
$$
\frac{(Z_1 - \mu_1)}{\sigma_1} \frac{(Z_2 - \mu_2)}{\sigma_2}
$$
where $\mu$'s are expectations and $\sigma$'s are standard deviations.
\emph{auto}correlation has exactly the same definition and is used when the
$Z$'s are measurements of the same quantity (e.g. prevalence, incidence,
presence/absence, etc) at different spatial locations (or different times).

To estimate this we need to know (or have an estimate of) the
values of the $\mu$s and $\sigma$s. In the case of single snapshot spatial
data we use the \emph{marginal} mean and \emph{marginal} s.d. Let's explore
using the filipendula rust data. The variable `now' is presence/absence of
the rust in 1994 and `nxt' is
presence/absence in 1995. `X' and `Y' are spatial coordinates.

<<fig=TRUE>>=
data=read.table("filipendula.txt", header=T)
names(data)
symbols(data$X, data$Y, circles=data$now, inches=.1)
@

<<keep.source=TRUE>>=
#marginal mean:
mu=mean(data$now)
#marginal sd:
sig=sd(data$now)
@

The estimated `autocorrelation matrix' among all 162 locations is then:
<<>>=
zscale=(data$now-mu)/sig
#autocorrelation matrix
rho=outer(zscale, zscale)
@
Note that since the data are binary presence/absence data, the 'individual
pair-wise autocorrelation values' can only take one of three values --
those arising from double-zero, double-ones and zero-one pairs.
Note also that these individual values are not quite constrained to be between
-1 and 1. This latter is not a worry though! Because the various different
geostatistical methods we will be discussing involves simple manipulations
of this matrix. \footnote{Note also that the geostats methods usually use
the MLE estimator of sd rather than the BLUE estimator: i.e the denominator
is $n$ rather than $n-1$. This is of no importance in this introductory
expos\'e.}. For several of the methods we also need some sort of spatial
distance matrix. Most commonly used is the Euclidian distance, but lots
of other measures could be possible. The Euclidean
distance matrix among all 162 locations is:
<<>>=
dst = as.matrix(dist(data[,c(3,4)]))
@

To understand the different geostatistical methods we will consider the
plot of the first 500 pairs as a function of their spatial distance.
(Plotting all the 26,000+ pairs would clutter up the screen). I have
jittered the data to make the plot a little less cluttered.

<<fig=TRUE>>=
plot(jitter(as.vector(rho[1:500]))~as.vector(dst[1:500]), ylab='pair-wise autoccorelation', xlab='pairwise distance')
@

NOW we are ready to conceptually understand the different geostatistical methods
(We gloss over lots of bells and whisles here, of course):
\begin{itemize}
  \item{\textbf{Mantel test:} An overall test for whether there is any significant
  relationship between the two matrices. This is essentially a test of the
  linear regression of the data plotted in the figure.}
  \item{\textbf{Correlogram:} The most classic tool of testing how autocorrelation
  depends on distance whithout assuming any particular function -- hack the
  `x-axis' into segments (given by specifying some increment) and calculate the
  average within each distance class.}
  \item{\textbf{Parametric correlation functions:} Assume the relationship
  follows some parametric relationship -- such as Exponential, Gaussian or
  Sperical functions -- and do the appropriate nonlinear regression}.
  \item{\textbf{Nonparametric correlation function:} Fit a `nonparametric
  regression' (usually a smoothing spline or a kernel smoother) to the
  relationship. This also goes by the name of the `spline correlogram'.}
  \item{\textbf{LISA:} Local indicators of spatial association. A test for
  'hotspots'. Specify a neighbourhood, and for each location calculate the
  average autocorrelation with all the other locations that belongs to its
  neighborhood.}
\end{itemize}

There are a bunch of other named methods that are variations of these. Several
of which are extentions to when there is mutiple observations at each location.
In which case it is natural(?) to estimate the `autocorrelation matrix' using
the regular correlation matrix. The `non-centred correlogram' is this
multivariate extention of the correlogram. The `time-lagged spatial
cross-correlation function' has been used to study waves of spread (see below).
Various directional variations which allows the spatial correlation function to
vary by cardinal direction. \footnote{The semivariogram is similar to
the correlogram except that instead of using the `autocorrelation
similarity' measure it uses the `semivariance dissimilarity' measure}.

\section*{Testing and confidence intervals}
An important reason why specialized packages are needed for these methods --
despite why they are conceptually very, very simple -- is because while the
$n$ original data-points may (or may not be) statistically independent, the
$n^2$ numbers in the autocorrelation matrix is obviously very NOT statistically
independent and the interdependence is very intricate. None of the classical
ways of testing for significane or generating confidence intervals are therefore
applicable. Testing is therefore usually done using permutation tests under the
null-hypothesis of no spatial patterns. The correlogram (or Mantel test, or ...)
of the real data should look no different than that for a random
re-allocation of observations to the spatial coordinates. Statistical
significance is calculated by comparing the observed estimate to the
distribution of estimates for, say, 1000 different randomized datasets. If the
observed is more extreme than, say, 950 of the randmized we conlude that
there is significant deviation from spatial randomness at a nominal 5\%-level.
For some of the methods it is possible to
generate confidence intervals using bootstrapping (resampling with replacement).

\section*{Practical examples}
All the above methods are implemented in the `ncf'-package in R.

\subsection*{Mantel test}
Using the Fillipendula as a case study.

<<results=hide>>=
test=mantel.test(x=data$X, y=data$Y, z=data$now)
@
<<>>=
test
@

We see that there is a significant negative association between autocorrelation
and distance. This is a crude tool but it does reveal that locations near
each other tend to be more similar in disease status but those separated
by a greater distance.

If we already have distance/simmilarity matrices, the syntax is:
<<results=hide>>=
test=mantel.test(M1=dst, M2=rho)
@
<<>>=
test
@

\*subsection{Correlograms}
Now we first try the correlogram to look at how the autocorreltion is a function of distance
<<results=hide>>=
test=correlog(x=data$X, y=data$Y, z=data$now, increment=25)
@
<<>>=
test
@
The first distance class is significantly positiv. And the estimated distance to
which the local positive distance decays to zero is 73.6 meters. Let's inspect:
<<fig=TRUE>>=
plot(test)
@

\*subsection{Nonparametric spatial correlation functions}
We can get a bit finer resolution and confidence intervals for the underlying
spatial correlation function through the spline correolgram. Note, here, that we
only use 100 bootstrap resamples here too save some time -- for a real analysis
we wwould need 500 or a 1000 to get reasonable presicin for our
<<results=hide>>=
test=spline.correlog(x=data$X, y=data$Y, z=data$now, resamp=100)
@
<<>>=
summary(test)
@
The spline correlogram returns a bunch of stuff -- in fact all the summary
statistics I thought might be of interest. These are:
\begin{itemize}
\item{estimates: a vector of benchmark statistics}
\item{x: is the lowest value at which the function is = 0.
  If correlation is initially negative, the distance
  calculated appears as a negative measure.}
\item{e: is the lowest value at which the
  function is <= 1/e.}
\item{y: is the extrapolated value at x=0.}
\item{quantiles: A matrix summarizing the quantiles in the
  bootstrap distributions of the benchmark statistics.}
\end{itemize}
<<fig=TRUE>>=
plot(test)
@
The plot shows the estimated correlation function with its bootstrap 95\%
confidence intervals. Since we have confidence intervals this allows us
to compare different data sets.

\textbf{Excercize}: Try to estimate the spline correlogram for the 1995 data.
Is there a difference between the 1994 and the 1995 patterns?

\section*{Two-species and space-time extensions}
The `Sncf'-function (`spatial nonparametric covariance function') is a function
that allows for more complex (mysterious?). To explore some of these aspects,
let's play with Aaron's CML model from yesterday. Recall yesterdays code:

<<results=hide>>=
niter <- 520
b0 <- 0.04
b1 <- 0.8
mu <- 0.02/26
N <- 1000
S.0 <- 100
xlen <- 30
ylen <- 30
m <- 0.5

local.dyn <- function(t, S, I, b0, b1, mu, N) {
  beta <- b0*(1+b1*cos(2*pi*t/26))
  I <- S*(1-exp(-beta*I))
  S <- (1-mu)*S+mu*N-I
  list(S=S,I=I)
}

npatch <- xlen*ylen

xy <- expand.grid(1:xlen,1:ylen)
d <- dist(xy)
Psi <- matrix(0,npatch,npatch)
Psi[(as.matrix(d)>0)&(as.matrix(d)<1.5)] <- 1
D <- (m/8)*Psi+diag(1-m,nrow=npatch)
S <- array(0,dim=c(npatch,niter))
I <- array(0,dim=c(npatch,niter))
I[(xy[,1]<5)&(xy[,2]<5),1] <- 10
S[,1] <- S.0-I[,1]

for (t in 2:niter) {
   tmp <- local.dyn(t,S=S[,t-1],I=I[,t-1],b0=b0,b1=b1,mu=mu,N=N)
   S[,t] <- D%*%tmp$S
   I[,t] <- D%*%tmp$I
   cat(t," of ",niter,"\n")
 }

x <- xy[,1]
y <- xy[,2]
#for (k in 1:niter) {
#  symbols(x,y,fg=2,circles=I[,k],inches=0.1,bg=2,xlab="",ylab="")
#  Sys.sleep(.1)
#}
@

We can use the Sncf to calculate the spatial correlation functions.
However, we can also look at the spatial \emph{cross-correlation} functions.

<<>>=
fitI = Sncf(x=x, y=y, z=I[,261:520], resamp=0)
fitS = Sncf(x=x, y=y, z=S[,261:520], resamp=0)
fitSI = Sncf(x=x, y=y, z=S[,261:520], w=I[,261:520], resamp=0)
par(mfrow=c(1,3))
plot(fitI)
plot(fitS)
plot(fitSI)
summary(fitI)
@

The spline correlogram returns a bunch of stuff -- in fact all the summary
statistics I thought might be of interest. These are:
\begin{itemize}
\item{Regional.synch: the regional mean (cross-)correlation.}
\item{Squantile: the quantile distribution from the
  resampling for the regional correlation.}

\item{estimates -- a vector of benchmark statistics:}
\item{x: is the lowest value at which the function is = 0.
  If correlation is initially negative, the distance
  calculated appears as a negative measure.}
\item{e: is the lowest value at which the
  function is <= 1/e.}
\item{y: is the extrapolated value at x=0.}
\item{cbar: is the shortest distance at which
  function is = regional mean correlation.}
\item{quantiles}{A matrix summarizing the quantiles in the
  bootstrap (or null) distributions of the benchmark statistics.}
\end{itemize}

One interesting additional application is so-called time-lagged spatial
correlation functions. For example we can look at the spatio-temporal
relationship between the infecteds and themselves 5 time-steps later.
This may help quantify wave-like spread.
<<>>=
fitI = Sncf(x=x, y=y, z=I[,261:515], w=I[,266:520], resamp=0)
plot(fitI)
@

%\*subsection{lisa}
%<<results=hide>>=
%test=lisa(x=data$X, y=data$Y, z=data$now, neigh=50)
%@

\end{document}