Point Pattern Analysis

Many of the most common kinds of spatial analyses in archaeology involve the assessment of patterning in point-located data where observations represent objects or object attributes linked to XY coordinates in Cartesian space. Data commonly organized in this way include point-plotted artifacts, point-plotted burials, and settlement distributions. With these data, we are often interested in determining the degree to which points (or specific attributes) are clustered, dispersed, or random. Further, we are also often interested in determining how point patterns for different kinds of objects or attributes co-vary or patterns of dispersal and clustering change at different spatial scales.

In this module, I will walk you though a series of useful analyses for conducting basic point-pattern analysis using data from the Paleolithic Liencres site in northern Spain. These data consist of point-plotted artifacts assigned to a number of common artifact categories from a well preserved surface at this open site.

First, let’s read in the point data and the shapefile for the excavation boundary. We can then plot the Excavation boundary with points

library(maptools)
library(spatstat)

LiencresEx <- readShapePoly("LiencresEX")  #read in excavation boundary shapefile
LiencresEx <- as(LiencresEx, "SpatialPolygons")
Liencres <- read.csv(file = "Liencres.csv", header = T)  # read in csv file

Next, we need to convert the point plotted data into a ppp object for the spatstat package that we will be using. This requires us define a boundary (window) for the the points. We’ll use the LiencresEx shapefile as our boundary for now. Following that, we can plot the data.

# define the window (class owin) for analyses
W <- as(LiencresEx, "owin")
# Define ppp object based on Liencres point locations
Liencres.pp <- ppp(x = Liencres$E, y = Liencres$N, window = W, marks = Liencres$Name)
# now let's plot it
plot(Liencres.pp, cex = 0.5)

Okay, now let’s calculate the Ripley’s K function for all points at this site and plot the results. Ripley’s K is a method of determining whether a point pattern deviates from complete spatial randomness (CSR) at a given distance. We’ll use the Kest function from the spatstat package. In this example, we’re using the “Ripley” correction factor that helps us deal with edge effects. See help(Kest) for more info.

# calculate and plot Ripley's K function using the 'Ripley' correction
kf <- Kest(Liencres.pp, correction = "Ripley")
plot(kf)

The plot shows that at all tested distances the actual observed value of K-hat is greater than the theoretically derived expected value indicating clustering.

In many cases, it is useful to simulate the potential effects of sampling by creating what is called an “envelope” from the actual data. Specifically, this procedure involves simulating some number of datasets by permuting the locations of input data (99 times by default) and then obtaining the range of K-hat values across a range of distances for each permuted dataset. We can then plot the K estimates at each distance for the original data along with the “envelope” showing the range for the simulated datasets. When the sample size is high, the envelope will tend to be small.

# calculate envelope around K-hat estimates. The arguement verbose=F just
# keeps the output from showing progress as the command runs.
kf.env <- envelope(Liencres.pp, Kest, correction = "Ripley", verbose = F)
plot(kf.env)

Once again, this plot shows that K-hat is considerably higher than the envelope of simulated K values across all but the shortest distances suggesting significant clustering and deviations from CSR.

In many cases, it is preferable to use a modified version of Ripley’s K known as the L function. L is simply the square root of K at a given distance divided by pi. This essentially stabilizes K by the variance, often producing more interpretable results. It is possible to calculate L just like K using the envelope method described above.

# calculate envelope around L-hat estimates.
lf.env <- envelope(Liencres.pp, Lest, correction = "Ripley", verbose = F)
plot(lf.env)

The values of L-hat once again demonstrate considerable clustering at all distances.

It is often further helpful to recenter values of L-hat in these plots by subtracting r at each distance so that L-hat is always defined as 0 at all distances and deviations of L-hat from L are shown in relation to this horizontal line. In the plots below, values above the line indicate clustering and values below the line indicate dispersion.

Note that some packages calculate this recentered version of L as r-L rather than L-r as we’re doing here. For example the PASSaGE 2.0 software that Stojanowski used in your reading for this week calculates r-L so that deviations above line line indicate dispersion and deviations below the line indicate clustering. There seem to be preferences for one or the other equation in different fields but the decision is really arbitrary. Make sure you always check documentation carefully to know which equation your software package uses.

# plot recentered L results
plot(lf.env, . - r ~ r)

Importantly, we can also calculate K and L functions for individual artifact classes using the “marks” argument we defined above.

# calculate L-hat for the Flint core mark only
lf.mark.env <- envelope(Liencres.pp[Liencres.pp$marks == "Flint core"], Lest, 
    correction = "Ripley", verbose = F)
plot(lf.mark.env, . - r ~ r)

As this plot shows, the value of L-hat for Flint cores is within the envelope of simulated values of L for most distances suggesting that this particular artifact category is not considerably more clustered than we might expect by chance given the total number of cores present.

It is also possible to compare two “mark” categories using either the K or L functions using the Kcross or Lcross command respectively. These commands both take two mark names as arguments (denoted as i and j) and determine the degree to which mark j is clustered or dispersed from mark i. The argument i always indicates the type of point from which distances are measured and j indicates the type to which distances are measured. Let’s try an example again using the envelope function.

# Calculate L-hat between Qz flakes and Qz split cobbles
lf.cross <- envelope(Liencres.pp, Lcross, i = "Qz flake", j = "Qz split cobble", 
    verbose = F)
plot(lf.cross, . - r ~ r)

In this example, Qz split cobbles are non-randomly clustered in relation to Qz flakes, but only when distance beyond about 0.3 meters are considered.

We can also calculate Kdot and Ldot which calcuate the relationship between mark i and all other types in a dataset. Here is a quick example.

# Calculate L-hat between Flint decortication flakes and all other artifact
# categories
lf.dot <- envelope(Liencres.pp, Ldot, i = "Flint decort flake", verbose = F)
plot(lf.dot, . - r ~ r)

Not surprisingly, flint decortication flakes are significantly clustered at all consdered distances.

Dealing with inhomogeneous data

So far, all of the examples we have shown have assumed an homogeneous Poisson point process, meaning that we are assuming that the locations of points across the site are independent (a Poisson process) and that they DO NOT differ in intensity from place to place (homogeneous). An inhomogeneous Poisson point process refers to a distribution generated by process where the locations of points across the site are independent but they DO differ in intensity. How can we tell if our point distribution is inhomogeneous? In many cases we can simply create a density map of all points and assess the degree to which there are concentrated areas of high density. Let’s do that for the Liencres data. The value for sigma below defines the interpolation scale for the density plot. I’ve provided three values to give you an idea of how this can impact your assessments of density.

par(mfrow = c(1, 3))  # set up to plot 3 plots on one row
# plot density at various levels of sigma
plot(density(Liencres.pp, sigma = 0.5), main = "Sigma 0.5")
plot(density(Liencres.pp, sigma = 0.75), main = "Sigma 0.75")
plot(density(Liencres.pp, sigma = 1), main = "Sigma 1.0")

The data in the Liencres example are pretty clearly inhomogeneous, but it isn’t always so easy to tell. Another method sometimes used assess the homogeneity of Poisson point processes is known as a quadrat test. Essentially, a grid is placed across the point patterns and a chi-square test is used to assess the degree to which the grid counts differ from expected. We determine the size of the grid squares by specifying the number of cells in both the x and y dimensions. This decision should be informed by the scale at which we wish to assess homogeneity.

# create a quardat test statistic by simulating the p-value using Monte
# Carlo methods to help deal with cells with low counts. In this example,
# nx=4,ny=10 means that we will create a grid 4 squares wide and 10 high (2
# meter resolution in this case)
set.seed(324324)  # set random seed for repeatability
(M <- quadrat.test(Liencres.pp, nx = 4, ny = 10, method = "MonteCarlo"))

## 
##  Conditional Monte Carlo test of CSR using quadrat counts
##  Pearson X2 statistic
## 
## data:  Liencres.pp
## X2 = 1370.8, p-value = 0.001
## alternative hypothesis: two.sided
## 
## Quadrats: 37 tiles (irregular windows)

In this example we get a statistically significant result suggest that the observed quadrats differ from the expected under a homogeneous Poisson point process model. It is sometimes useful to display the quadrat counts directly. In the code below I plot observed and expected quadrat counts on a density map of the Liencres data. In this plot for each grid square the observed count is shown in the top left, the expected is shown in the top right, and the Pearson residual is shown at the bottom. Residuals are calculated as (observed-expected)/sqrt(expected) such that the residuals have a mean of 0 and variance of 1 meaning that deviations far beyond -1 to 1 in either direction are important where as variables within the range of -1 to 1 are expected.

# plot density map and display grid info
plot(density(Liencres.pp), main = "Grid Counts")
plot(M, cex = 0.5, col = "white", add = T)

As we can see, the area of high density in the middle of the site results in huge deviations from expected based on the residuals. So it seems, not surprisingly, that the Liencres point data are highly inhomogeneous. What can we do about this? Luckily, the spatstat package has versions of all of the tools we’ve been using so far for modeling inhomogeneous Poisson point processes. These include the Kinhom and Linhom commands that can be used anywhere we’ve used Kest or Lest above and the Kcross.inhom/Kdot.inhom and Lcross.inhom/Ldot.inhom which take the places of Kcross/Kdot and Lcross/Ldot. See the spatstat guide for more info.

The versions of our point pattern tools designed for inhomogeneous point processes essentially test for complete spatial randomness much like the traditional measures, but they take the intensity the distribution of points across the study area into account when defining expected values. By default, the spatstat tools use a density map just like I’ve produced above to determine the intensity at any given point. It is also possible to feed the function some other kind of raster input. For example, say we are assessing site distributions around several lakes. It might make sense to remove the lakes from consideration because they create an inhomogeneous pattern. Again the spatstat documentation has examples.

Let’s now cacluate L-hat for inhomogeneous distributions using the density function. Just like the direct call to the density function, we can specify a sigma that determines our interpolation factor. Our choice of sigma can to be informed by the distances at which we might expect inhomogeneity or can be based on the distribution of points itself. For the purposes of this example, I’ll use the middle value of sigma at 0.75 from the original density plots above.

In the following plots, I produce L-hat estimates for inhomogeneous Poisson point processes with envelopes adjusted for density based on our original data.

# calculate and plot inhomogeneous L-hat
plot(envelope(Liencres.pp, Linhom, sigma = 0.75, correction = "Ripley", verbose = F), 
    . - r ~ r, main = "Sigma 0.75")

This plot is considerably different from the original L-hat estimates where an homogeneous Poisson point process was assumed. Notably, the Liencres points show considerably clustering (far above the 95% confidence envelope) at distances less than about 0.8 meters but actually show strong dispersion beyond about 1 meter. This further illustrates the inhomogeneity of our data.

As another way of assessing this pattern, we can conduct another quadrat test, this time taking inhomogenity into account using the density map. In the next chunk of code I repeat the quadrat test with density considered and then plot the grid counts once again.

# the lambda argument specifys the density plot for assessing inhomogeneity
(Minhom <- quadrat.test(Liencres.pp, nx = 4, ny = 10, method = "MonteCarlo", 
    lambda = density(Liencres.pp, sigma = 0.75)))

## 
##  Conditional Monte Carlo test of Poisson process with given
##  intensity using quadrat counts
##  Pearson X2 statistic
## 
## data:  Liencres.pp
## X2 = 57.719, p-value = 0.073
## alternative hypothesis: two.sided
## 
## Quadrats: 37 tiles (irregular windows)

# plot grid counts on density map
plot(density(Liencres.pp, sigma = 0.75), main = "Grid Counts for Inhomogenious Density")
plot(Minhom, cex = 0.5, col = "white", add = T)

As the quadrat test and grid plot show, taking density into account no longer results in a statistically significance difference from a Poisson point process (p > 0.05). Few grid cells in the plot show deviations far from the -1 to 1 range as well. This suggests that our L-hat results taking density into account are likely more reliable.

The Pair Correlation Function and other approaches

In your Bevan et al. 2013 reading, the authors advocate a method known as the pair correlation function. This is a quite similar measure to Ripley’s K but it measures the intensity of points around each target point in ever expanding donuts (annuli) rather than in simply expanding radii meaning that it is not cumulative in the same way that K-hat and L-hat are. Again, PCF can be calculated using arguments for assessing inhomogeniety (as described by Bevan et al. 2013, this can also include covariates other than simply the density of points). In the plot below, I calculate pcf for the Liencres data, including an envelope, taking the intensity of points into account just as above.

# calculate and plot PCF for inhomogeneous data
plot(envelope(Liencres.pp, pcfinhom, sigma = 0.75, correction = "Ripley", verbose = F), 
    . - r ~ r, main = "Inhomogenious PCF of Liencres point data, Sigma 0.75")

The results here are fairly similar to those obtained using the inhomogeneous L-hat function in that we see non-random clustering at short distances and non-random dispersion at greater distances. The main difference is that the PCF highlights an area of ambiguity between about 0.4 and 0.6 meters where values fall within the envelope, suggesting that the do not differ significantly from what we would expect if the pattern represented complete spatial randomness (taking intensity into account).

Bevan et al. 2013 provide one more method for assessing non-random patterning where they plot histograms of nearest neighbor distances for the actual data along with an envelope of the expected values after simulation. The spatstat package doesn’t do this automatically but it is possible to implement (though probably not necessary as we will see below). The code to do this is a bit convoluted and involves a couple of things we haven’t covered yet. In short, this procedure involves creating 1000 random Poisson distributed point patterns of a given intensity defined by the density function for the original data. I then calcuate the 95% confidence interval for each “bar” of the histogram and display those directly on top of the original histogram.

par(mfrow = c(1, 2))  # set up for plotting 2 plots on 1 row
# histogram of nearest neighbor distances
hist(nndist(Liencres.pp), breaks = 50, xlab = "Distance in Meters", main = "Nearest Neighbor Distances")


# ugly code for calculating envelope around histogram
n.dist <- nndist(Liencres.pp)  #calculate neareast neighbor distances
nb <- 50
# set up seq of nb points between 0 and max nndist
z.seq <- seq(0, max(n.dist), length.out = nb)
# plot histogram of original data
hist(n.dist, breaks = z.seq, xlab = "Distance in Meters", main = "NN Distances with Confidence Interval")

# calculate nearest neighbor distances for 1000 simulated point patterns and
# then bin into z.seq
nn.hist <- lapply(rpoispp(density(Liencres.pp, sigma = 0.75), nsim = 1000), 
    nndist)
z.int <- lapply(nn.hist, findInterval, vec = z.seq)

# create a matrix containing the values for binned results for all 1000
# simulated runs
z.set <- matrix(NA, length(z.seq), 1000)
for (i in 1:1000) {
    temp1 <- (table(z.int[[i]])/sum(table(z.int[[i]]))) * nrow(Liencres)
    temp2 <- as.numeric(names(temp1))
    for (j in temp2) {
        z.set[j, i] <- temp1[j]
    }
}

# calculate quantiles at 0.05 and 0.95 for each bin in the histogram
q.int <- apply(z.set, 1, quantile, na.rm = T, probs = c(0.05, 0.95))

# add polygon for confidence interval
x <- c(z.seq, rev(z.seq))
y <- c(q.int[1, ], rev(q.int[2, ]))
y[is.na(y)] <- 0
polygon(x, y, col = rgb(0, 0, 1, 0.3))

As this example shows, we get non-random deviations from the expected nearest neighbor distances for a Poisson process of similar density with more points than expected clustered at about 0.1 meters and fewer points than expected clustered at about 0.2 meters, but other than that, the range falls within the envelope suggesting that it approximates a Poisson process (taking intensity into account).

Another way to do this is to use a measure called G-hat which is defined in relation to the cumulative distribution of nearest neighbor distances. What this means is that we plot nearest neighbor distances in order from low to high and see what proportion of all distances fall above or below a particular distance. Again, this procedure works with the envelope function. By default, this procedure uses a random uniform distribution as the null model. Since we want to simulate our pattern in relation to density in our original data, we need to specify that expression. In the example below, I’ll show both the model using a uniform distribution followed by the model using the Poisson point distribution based on the density of Liencres points.

par(mfrow = c(1, 2))  # set up for plotting 2 plots on 1 row
# first I create an expression which defines the simulation to be based on a
# random Poisson process using the intensity of the original data
ex <- expression(rpoispp(density(Liencres.pp, sigma = 0.75)))
# call and plot the envelope function for Gest using the default uniform
# distribution
plot(envelope(Liencres.pp, Gest, verbose = F), xlim = c(0, 1), main = "G-hat, Uniform Distribution")
# call and plot the envelope function for Gest using the specified Poisson
# distribution based on the 'simulate' argument
plot(envelope(Liencres.pp, Gest, simulate = ex, verbose = F), xlim = c(0, 1), 
    main = "G-hat, Poisson Distribution")

These results are similar for both the uniform and Poisson distributions but note that adding the density distribution into th emix tends to pull the envelope closer to the observed values at shorter distances. In both plots, note that observed line in black is above the envelope based on the simulated runs for most distances below about 0.4, but falls within the window thereafter. This suggests that there are more shorter nearest neighbor distances than we would expect based on both a random uniform and random Poisson point process given the intensity distribution of our original data.

As a final step in explaining these results, I’ll display a simulated random uniform and random Poisson point process using density. The differences between the plots below highlight the importances of considering the homogeniety/inhomogeniety of point processes as I discussed above.

par(mfrow = c(1, 2))
# create a random uniform point distribution of a number of points equal to
# the Liencres data within the same window
ru <- runifpoint(n = nrow(Liencres), win = W)
# create a Poisson point distribution based on the intensity of points in
# the Liencres data within the same window
rp <- rpoispp(density(Liencres.pp, sigma = 0.75))
# plot both density maps with the points plotted on top
plot(density(ru), main = "Random Uniform Distribution")
points(ru, cex = 0.25, col = "white")
plot(density(rp), main = "Poisson Distribution based on Liencres")
points(rp, cex = 0.25, col = "white")

Point Pattern Analysis

Matthew Peeples