\documentclass{article} \usepackage[dvips]{graphicx,color} \usepackage{amsmath} \usepackage{epsfig} \usepackage{geometry} \geometry{letterpaper, left=1in, bottom=1in, right=1in, top=1in, foot=0.5in, head=0.5in} \usepackage{Sweave} \title{Geostatistics for spatial and space-time data} \author{Ottar Bj{\o}rnstad} \date{\today} \begin{document} \maketitle \section*{Introduction} Exploratory analysis of spatial data is of great interest in ecology of infectious disease for a number of reasons such as looking for space-time clustering, hot-spot detection, characterizing spatial waves. Spatial stats is also important in order to correct for the problem of spurious associations between incidence and environmental data because spatial autocorrelation can violate the assumption of independence. Spatial statistics is a very field. In this tutorial we will focus on a subset of methods that are more (or less) commonly used. Many of these involves the notion of spatial autocorrelation in one form or another. While all the methods we will be discussing -- such as Mantel tests, parametric and nonparametric correlation functions, local indicators of spatial association, etc -- come in canned packages (we will be using the `ncf'-package), it is useful to spend a bit of time on the underlying ideas. \section*{Spatial autocorrelation} Pearson's product moment correlation between to random variables, $Z_1$ and $Z_2$, let's denote this by $\rho_12$ is defined as: $$ \frac{(Z_1 - \mu_1)}{\sigma_1} \frac{(Z_2 - \mu_2)}{\sigma_2} $$ where $\mu$'s are expectations and $\sigma$'s are standard deviations. \emph{auto}correlation has exactly the same definition and is used when the $Z$'s are measurements of the same quantity (e.g. prevalence, incidence, presence/absence, etc) at different spatial locations (or different times). To estimate this we need to know (or have an estimate of) the values of the $\mu$s and $\sigma$s. In the case of single snapshot spatial data we use the \emph{marginal} mean and \emph{marginal} s.d. Let's explore using the filipendula rust data. The variable `now' is presence/absence of the rust in 1994 and `nxt' is presence/absence in 1995. `X' and `Y' are spatial coordinates. <>= data=read.table("filipendula.txt", header=T) names(data) symbols(data$X, data$Y, circles=data$now, inches=.1) @ <>= #marginal mean: mu=mean(data$now) #marginal sd: sig=sd(data$now) @ The estimated `autocorrelation matrix' among all 162 locations is then: <<>>= zscale=(data$now-mu)/sig #autocorrelation matrix rho=outer(zscale, zscale) @ Note that since the data are binary presence/absence data, the 'individual pair-wise autocorrelation values' can only take one of three values -- those arising from double-zero, double-ones and zero-one pairs. Note also that these individual values are not quite constrained to be between -1 and 1. This latter is not a worry though! Because the various different geostatistical methods we will be discussing involves simple manipulations of this matrix. \footnote{Note also that the geostats methods usually use the MLE estimator of sd rather than the BLUE estimator: i.e the denominator is $n$ rather than $n-1$. This is of no importance in this introductory expos\'e.}. For several of the methods we also need some sort of spatial distance matrix. Most commonly used is the Euclidian distance, but lots of other measures could be possible. The Euclidean distance matrix among all 162 locations is: <<>>= dst = as.matrix(dist(data[,c(3,4)])) @ To understand the different geostatistical methods we will consider the plot of the first 500 pairs as a function of their spatial distance. (Plotting all the 26,000+ pairs would clutter up the screen). I have jittered the data to make the plot a little less cluttered. <>= plot(jitter(as.vector(rho[1:500]))~as.vector(dst[1:500]), ylab='pair-wise autoccorelation', xlab='pairwise distance') @ NOW we are ready to conceptually understand the different geostatistical methods (We gloss over lots of bells and whisles here, of course): \begin{itemize} \item{\textbf{Mantel test:} An overall test for whether there is any significant relationship between the two matrices. This is essentially a test of the linear regression of the data plotted in the figure.} \item{\textbf{Correlogram:} The most classic tool of testing how autocorrelation depends on distance whithout assuming any particular function -- hack the `x-axis' into segments (given by specifying some increment) and calculate the average within each distance class.} \item{\textbf{Parametric correlation functions:} Assume the relationship follows some parametric relationship -- such as Exponential, Gaussian or Sperical functions -- and do the appropriate nonlinear regression}. \item{\textbf{Nonparametric correlation function:} Fit a `nonparametric regression' (usually a smoothing spline or a kernel smoother) to the relationship. This also goes by the name of the `spline correlogram'.} \item{\textbf{LISA:} Local indicators of spatial association. A test for 'hotspots'. Specify a neighbourhood, and for each location calculate the average autocorrelation with all the other locations that belongs to its neighborhood.} \end{itemize} There are a bunch of other named methods that are variations of these. Several of which are extentions to when there is mutiple observations at each location. In which case it is natural(?) to estimate the `autocorrelation matrix' using the regular correlation matrix. The `non-centred correlogram' is this multivariate extention of the correlogram. The `time-lagged spatial cross-correlation function' has been used to study waves of spread (see below). Various directional variations which allows the spatial correlation function to vary by cardinal direction. \footnote{The semivariogram is similar to the correlogram except that instead of using the `autocorrelation similarity' measure it uses the `semivariance dissimilarity' measure}. \section*{Testing and confidence intervals} An important reason why specialized packages are needed for these methods -- despite why they are conceptually very, very simple -- is because while the $n$ original data-points may (or may not be) statistically independent, the $n^2$ numbers in the autocorrelation matrix is obviously very NOT statistically independent and the interdependence is very intricate. None of the classical ways of testing for significane or generating confidence intervals are therefore applicable. Testing is therefore usually done using permutation tests under the null-hypothesis of no spatial patterns. The correlogram (or Mantel test, or ...) of the real data should look no different than that for a random re-allocation of observations to the spatial coordinates. Statistical significance is calculated by comparing the observed estimate to the distribution of estimates for, say, 1000 different randomized datasets. If the observed is more extreme than, say, 950 of the randmized we conlude that there is significant deviation from spatial randomness at a nominal 5\%-level. For some of the methods it is possible to generate confidence intervals using bootstrapping (resampling with replacement). \section*{Practical examples} All the above methods are implemented in the `ncf'-package in R. \subsection*{Mantel test} Using the Fillipendula as a case study. <>= test=mantel.test(x=data$X, y=data$Y, z=data$now) @ <<>>= test @ We see that there is a significant negative association between autocorrelation and distance. This is a crude tool but it does reveal that locations near each other tend to be more similar in disease status but those separated by a greater distance. If we already have distance/simmilarity matrices, the syntax is: <>= test=mantel.test(M1=dst, M2=rho) @ <<>>= test @ \*subsection{Correlograms} Now we first try the correlogram to look at how the autocorreltion is a function of distance <>= test=correlog(x=data$X, y=data$Y, z=data$now, increment=25) @ <<>>= test @ The first distance class is significantly positiv. And the estimated distance to which the local positive distance decays to zero is 73.6 meters. Let's inspect: <>= plot(test) @ \*subsection{Nonparametric spatial correlation functions} We can get a bit finer resolution and confidence intervals for the underlying spatial correlation function through the spline correolgram. Note, here, that we only use 100 bootstrap resamples here too save some time -- for a real analysis we wwould need 500 or a 1000 to get reasonable presicin for our <>= test=spline.correlog(x=data$X, y=data$Y, z=data$now, resamp=100) @ <<>>= summary(test) @ The spline correlogram returns a bunch of stuff -- in fact all the summary statistics I thought might be of interest. These are: \begin{itemize} \item{estimates: a vector of benchmark statistics} \item{x: is the lowest value at which the function is = 0. If correlation is initially negative, the distance calculated appears as a negative measure.} \item{e: is the lowest value at which the function is <= 1/e.} \item{y: is the extrapolated value at x=0.} \item{quantiles: A matrix summarizing the quantiles in the bootstrap distributions of the benchmark statistics.} \end{itemize} <>= plot(test) @ The plot shows the estimated correlation function with its bootstrap 95\% confidence intervals. Since we have confidence intervals this allows us to compare different data sets. \textbf{Excercize}: Try to estimate the spline correlogram for the 1995 data. Is there a difference between the 1994 and the 1995 patterns? \section*{Two-species and space-time extensions} The `Sncf'-function (`spatial nonparametric covariance function') is a function that allows for more complex (mysterious?). To explore some of these aspects, let's play with Aaron's CML model from yesterday. Recall yesterdays code: <>= niter <- 520 b0 <- 0.04 b1 <- 0.8 mu <- 0.02/26 N <- 1000 S.0 <- 100 xlen <- 30 ylen <- 30 m <- 0.5 local.dyn <- function(t, S, I, b0, b1, mu, N) { beta <- b0*(1+b1*cos(2*pi*t/26)) I <- S*(1-exp(-beta*I)) S <- (1-mu)*S+mu*N-I list(S=S,I=I) } npatch <- xlen*ylen xy <- expand.grid(1:xlen,1:ylen) d <- dist(xy) Psi <- matrix(0,npatch,npatch) Psi[(as.matrix(d)>0)&(as.matrix(d)<1.5)] <- 1 D <- (m/8)*Psi+diag(1-m,nrow=npatch) S <- array(0,dim=c(npatch,niter)) I <- array(0,dim=c(npatch,niter)) I[(xy[,1]<5)&(xy[,2]<5),1] <- 10 S[,1] <- S.0-I[,1] for (t in 2:niter) { tmp <- local.dyn(t,S=S[,t-1],I=I[,t-1],b0=b0,b1=b1,mu=mu,N=N) S[,t] <- D%*%tmp$S I[,t] <- D%*%tmp$I cat(t," of ",niter,"\n") } x <- xy[,1] y <- xy[,2] #for (k in 1:niter) { # symbols(x,y,fg=2,circles=I[,k],inches=0.1,bg=2,xlab="",ylab="") # Sys.sleep(.1) #} @ We can use the Sncf to calculate the spatial correlation functions. However, we can also look at the spatial \emph{cross-correlation} functions. <<>>= fitI = Sncf(x=x, y=y, z=I[,261:520], resamp=0) fitS = Sncf(x=x, y=y, z=S[,261:520], resamp=0) fitSI = Sncf(x=x, y=y, z=S[,261:520], w=I[,261:520], resamp=0) par(mfrow=c(1,3)) plot(fitI) plot(fitS) plot(fitSI) summary(fitI) @ The spline correlogram returns a bunch of stuff -- in fact all the summary statistics I thought might be of interest. These are: \begin{itemize} \item{Regional.synch: the regional mean (cross-)correlation.} \item{Squantile: the quantile distribution from the resampling for the regional correlation.} \item{estimates -- a vector of benchmark statistics:} \item{x: is the lowest value at which the function is = 0. If correlation is initially negative, the distance calculated appears as a negative measure.} \item{e: is the lowest value at which the function is <= 1/e.} \item{y: is the extrapolated value at x=0.} \item{cbar: is the shortest distance at which function is = regional mean correlation.} \item{quantiles}{A matrix summarizing the quantiles in the bootstrap (or null) distributions of the benchmark statistics.} \end{itemize} One interesting additional application is so-called time-lagged spatial correlation functions. For example we can look at the spatio-temporal relationship between the infecteds and themselves 5 time-steps later. This may help quantify wave-like spread. <<>>= fitI = Sncf(x=x, y=y, z=I[,261:515], w=I[,266:520], resamp=0) plot(fitI) @ %\*subsection{lisa} %<>= %test=lisa(x=data$X, y=data$Y, z=data$now, neigh=50) %@ \end{document}