Cluster Analysis - Pure Locational Clustering

In this module, I expand upon the Pure Locational Clustering (PLC) approach using various tools for defining and evaluating clusters. As we discussed in class, PLC refers to an approach where clusters are defined based on the geographic coordinates of a set of points (artifacts, sites, etc.) rather than any attributes associated with those points. In the example below, I will use the Barmose I dataset from your Ducke reading (2015; see also Blankholm 1991). Again, Barmose I is an early Maglemosian (ca. 7500-6000 B.C.) site in South Zealand, Denmark consisting of a hypothesized hut or sleeping area surrounded by a dense scatter of lithic artifacts. I will focus in particular on the K-means and DBSCAN methods but will also discuss the fuzzy K-means procedure. I’m leaving out the EM expectation maximization procedure but I will talk about that one a bit after you turn in this assignment.

For all clustering methods considered here there are a few decisions an analyst needs to make. For K-means/fuzzy K-means, the analyst much select the number of clusters to be returned. For the DBSCAN method, the analyst must determine the distance and number of points to be considered in calculating density to define clusters (DBSCAN - density based spatial clustering). In the example below I’ve provided some diagnostics that can help you make these decisions, but remember, these are merely heuristic approaches and shouldn’t be thought of as THE ONLY way to parameterize these clustering algorithms.

This module will rely primarily on the “cluster,” “spatstat,” “maptools,” and “dbscan” packages in R.

The first step is to read in the data and create a new object containing only the xy coordinates

# read in the Barmose data file, columns E=Easting, N=Northing, TYPE=artifact
# category, Label=artifact type name
Barmose <- read.csv(file = "Barmose.csv", header = T)
# create new object using only the first two columns
Barmose.coord <- as.matrix(Barmose[, 1:2])

To get a sense of the spatial data, I can then plot the points with symbols indicating different artifact types (unlabeled for now)

plot(Barmose.coord, pch = Barmose$TYPE, cex = 0.5)

K-means clusters

We will first tackle the traditional K-means approach to defining clusters. K-means is a divisive non-hierarchical method of clustering. This is an iterative process, which means that at each step the membership of each individual in a cluster is reevaluated based on the current centers of each existing cluster. This is repeated until the desired number of clusters (or the number of individuals) is reached. Thus, it is non-hierarchical because an individual can be assigned to a cluster, and reassigned at any later stage in the analysis. Clusters are defined based on Euclidean distances so as to reduce the variability of individuals within a cluster, while maximizing the variability between clusters (known as the Sum of Squares Error or SSE).

One approach to choosing an appropriate number of clusters to test is to conduct a randomization procedure to assess changes in the global SSE for varying cluster levels in both the original data and in a large number of randomized versions of those data. In the code below, I create 1000 randomized versions of the points and then compare the SSE at each cluster solution to the actual data. The results are then plotted both in terms of absolute SSE and in terms of the difference between the actual and mean randomized data.The goal of these plots is to identify any inflection points in SSE values and/or the pattern of departures from random at varying cluster levels (see Kintigh 1990 for more details).

In the example below I will evaluate the first 15 cluster solutions.

# define function for calculating within group sum of squares for kmeans
# solutions 1 through max K
kmeans.wss <- function(data, maxk) {
    wss <- (nrow(data) - 1) * sum(apply(data, 2, var))
    for (i in 2:maxk) wss[i] <- sum(kmeans(data, centers = i, iter.max = 50, nstart = 10)$withinss)
    return(wss)
}

set.seed(1275)  #set random seed for repeatability
# calculate wss for cluster levels 1 through 15 for the original data
kmeans.orig <- kmeans.wss(Barmose.coord, maxk = 15)

# the following chunk creates 1000 column randomized version of the original
# dataset and calculates global SSE for cluster solutions 1 through 15. This will
# take some time to calculate
rand.out <- NULL
nsim <- 1000
for (i in 1:nsim) {
    rand.tab <- apply(Barmose.coord, 2, sample)
    rand.out <- rbind(rand.out, kmeans.wss(rand.tab, maxk = 15))
}

# next I plot the SSE scores by cluster solution for the original and the mean of
# the randomized data
plot(log(kmeans.orig), type = "b", col = "blue", xlab = "Number of Clusters", ylab = "Sum of Squares")
points(log(colMeans(rand.out)), type = "l", col = "red", lty = 2)
legend("topright", c("Original Data", "Mean of 1000 Random Runs"), col = c("blue", 
    "red"), lty = c(1, 2))

# finally, I plot the difference between the mean random SSE and the actual SSE
# from the original data
plot(log(colMeans(rand.out)) - log(kmeans.orig), type = "b", col = "blue", xlab = "Number of Clusters", 
    ylab = "Actual SSE - Mean Randomized SSE")

Note in the plots above the actual SSE declines faster than the mean randomized SSE at increasing cluster solutions. It appears here that the difference between actual and randomized data is greatest at the 13 cluster solution. There is also a big difference between the 6 cluster solution and the 7 cluster solution and only small difference thereafter. For the purposes of this example, I select the 8 cluster solution for further evaluation as this is about the point where differences between random and actual SSE apear to level off. It seems that the other cluster solutions may also be worth exploring in the future, though I don’t do that here. Remember, it is good practice to test different solutions and evaluate their impact on the results. Note that the 6 cluster solution was selected by Ducke based on externally derived expectations.

In the code below I calculate the 8 cluster solution and then plot the results on the map, color coded by cluster.

# calculate the 8 cluster soultion and extract cluster assignments
fitK <- kmeans(Barmose.coord, 8, iter.max = 100, nstart = 100)$cluster
# plot the results
plot(Barmose.coord, pch = Barmose$TYPE, cex = 0.5, col = fitK)
legend("bottomleft", as.character(seq(1:8)), pch = 16, col = seq(1:8))

As this image shows there are several dense clusters in the central portion of the site and a few sparser clusters around the edges. Since density differs pretty dramatically across the site as a whole, it might be worth considering a density based clustering algorithm.

DBSCAN - Density Based Clustering

The DBSCAN clustering algorithm is a density based algorithm that groups points based on their relative to user provided parameters. DBSCAN creates neighborhoods around core points of a specified diameter. Core points are defined as those points where a specified minimum number of points falls within the specified diameter from that point. The algorithm identifies all cores points and then determines which core points are mutually reachable. These are then connected until no more points can be added. All of these connected core points and all the non-core points within their neighborhoods are then defined as a single cluster (and multiple clusters are defined in the same way across a sampled area). All non-core points that are not reachable by any core point are excluded and considered “noise” and not explicitly assigned to a cluster.

Thus, for this clustering algorithm we do not choose how many clusters to define as this is generated by the algorithm itself. Instead, we must select parameters regarding the distances across which the density neighborhoods will be evaluated and the minimum number of points to include in a neighborhood. There are a few diagnostics that we can use to help us parameterize this algorithm including kernal density maps and variations on Ripley’s K.

The first thing we need to is to convert the Barmose dataset into a ppp Point Pattern object as required by the ‘spatstat’ package. We will also import a shapefile defining the limits of excavation at Barmose because, as we previously discussed, Ripley’s K is sensitive to the shape and size of the boundary area selected (see Ripley’s K module for more details). For good measure we’ll plot the resulting excavation limits and points in this new ppp format with artifacts symbolized by type.

# Initialize required libraries
library(spatstat)
library(maptools)

# Import shapefile for grid and covert to SpatialPolygons object
Excav <- readShapePoly("BarmoseGrid.shp")
Excav <- as(Excav, "SpatialPolygons")
# this line dfines the window to be used in Ripley's K calculations
W <- as(Excav, "owin")

# Define ppp object based on Barmose point locations
Barmose.pp <- ppp(x = Barmose$E, y = Barmose$N, window = W, marks = Barmose$Label)

plot(Barmose.pp, cex = 0.5)

Next, I create a series of kernal density maps of the point locations across the Barmose site. For these maps I change the sigma scaling parameter to explore structure at the site at varying distances. For this example, I explore values between 0.1 and 0.6 at intervals of 0.1.

par(mfrow = c(2, 3))  # set up plot for two rows and three columns
# plot density maps at various sigma values
plot(density(Barmose.pp, sigma = 0.1))
plot(density(Barmose.pp, sigma = 0.2))
plot(density(Barmose.pp, sigma = 0.3))
plot(density(Barmose.pp, sigma = 0.4))
plot(density(Barmose.pp, sigma = 0.5))
plot(density(Barmose.pp, sigma = 0.6))

As these density plots show, a sigma value of 0.1 shows many very small pockets of density while values of 0.4 through 0.6 essentially smear the point distribution into a single blob. Intermediate values at about 0.2 and 0.3 seem to represent a decent compromise.

Before I select a distance, however, I want to evaluate clustering using the Ripley’s K approach. For all of the examples below I’m going to use the variance adjusted version of Ripley’s K defined as L-hat. I first calculate L-hat for the entire dataset and then plot those values across a range of plausible distances in relation to both the theoretical expectation and the envelope of values defined based on 99 simulations of complete spatial randomness within the area bounded by the limit of excavation (see Ripley’s K overview). I then replot the same data with L centered at 0 for all distances to look at deviations between the observed and theoretical expectation at different distances. I use the Ripley correction for edge effects here. This calculation may take several seconds.

par(mfrow = c(1, 2))  #set up for two plots on one row
# calculate L-hat within windown 'W' using the Ripley correction
L <- envelope(Barmose.pp[W], Lest, correction = "Ripley", verbose = F)

# plot L-hat against distance
plot(L, xlab = "Meters", ylab = expression(hat("L")))
# plot L-hat recentered at 0
plot(L, . - r ~ r, xlab = "Meters", ylab = expression(hat("L")), main = "L recentered")

As these plots indicate, the Barmose point locations are considerably more clustered than would be expected for complete spatial randomness at essentially all distances but the rate of increase in the difference between the observed and simulated points starts to level off after 1.5 meters or so. The next question I might ask is how this pattern plays out for particular artifact types. To explore this, I plot re-centered L-hat plots for several artifact categories (those with at least 15 objects) to compare the results. This time for the sake of calculation time I’ll simply plot the results against the theoretical expectation and forego the simulations (but it would be a good idea to check that later).

par(mfrow = c(3, 3))  # set up for plotting
# set up vector of artifact type names and remove categories with fewer than 15
# artifacts
B.lab <- levels(Barmose.pp$marks)[-c(1, 3, 4)]

# plot recentered L-hat plot for each artifact type
for (i in 1:length(B.lab)) {
    plot(Lest(Barmose.pp[Barmose.pp$marks == B.lab[i]], correction = "Ripley", rmax = 1.5), 
        . - r ~ r, xlab = "Meters", ylab = expression(hat("L")), main = B.lab[i])
}

As these plots demonstrate, most artifact types show their greatest increases in L-hat (indicating increases in clustering) at distances between about 0.2 and 0.4. This corresponds nicely with the kernal density adjustment values that seemed to produce the best results above. Thus, for the purposes of demonstration, I’ll select the same 0.3 meter neighborhood size for the DBSCAN procedure that Ducke used.

Next, I need to determine the minimum number of points to define a cluster. In the example below I use the subjectively selected 10 points. I selected this threshold after experimenting with a range of values and determining that 10 provided a good compromise between interpretability and spatial resolution (this is the same value selected by Ducke for similar reasons).

Some might call the selection of clustering parameters a “Dark Art” but note that all of these decisions are guided by exploratory analysis and substantive knowledge of the data. Again, it is typically a good practice to test a range of parameter values.

In the code below I define clusters using DBSCAN and then plot them on the map of the Barmose site.

library(dbscan)
# calcuate cluster solution and collect values
fitD <- dbscan(Barmose.coord, eps = 0.3, minPts = 10)$cluster

# plot the results
plot(Barmose.coord, pch = Barmose$TYPE, cex = 0.5, col = fitD + 1)
legend("bottomleft", as.character(seq(0:length(unique(fitD))) - 1), pch = 16, col = seq(0:length(unique(fitD))))

Based on the parameters we gave it, the DBSCAN algorithm created 6 clusters all near the center and densest portion of the site and left the vast majority of points as unclassified (labeled 0 in our plot). In other words, most points were not near a core point as defined by DBSCAN for our provided parameters. This is certainly a different way of conceptualizing clustering than K-means.

Fuzzy K-means

Another set of clustering algorithms that have recently gained popularity in many fields (though not yet in archaeology to my knowledge) are fuzzy clustering algorithms. Here I will demonstrate the Fuzzy K-means approach. Fuzzy K-means works much like traditional K-means analysis except that rather than being assigned to a single cluster each point is given a probabilities of belonging to every cluster (which sum to 1). This allows us to consider the potentially “fuzziness” of the edges of clusters which may be useful in many cases, especially when there aren’t clear breaks between the clusters. As this algorithm works on the same principle as traditional K-means, I will test the same 8 cluster solution selected above.

library(cluster)
# calcuate fuzzy cluster memberships using euclidean distances between points
fitF <- fanny(Barmose.coord, 8, diss = F, memb.exp = 1.5)$membership

head(fitF)  # display the first few rows

##            [,1]       [,2]        [,3]        [,4]        [,5]        [,6]
## [1,] 0.30981948 0.09423480 0.054901760 0.214696409 0.080412566 0.111227386
## [2,] 0.27622084 0.38016430 0.068580754 0.090785582 0.054485563 0.049086332
## [3,] 0.02097258 0.92822460 0.017527711 0.008620795 0.007318296 0.005078569
## [4,] 0.36084543 0.39855686 0.051734391 0.067910719 0.038231878 0.031297214
## [5,] 0.81225584 0.06803078 0.020200672 0.043997047 0.018804838 0.015380906
## [6,] 0.92378847 0.01977489 0.008434467 0.023400782 0.008769183 0.006899559
##             [,7]        [,8]
## [1,] 0.049305720 0.085401874
## [2,] 0.048646029 0.032030604
## [3,] 0.009272974 0.002984478
## [4,] 0.033674831 0.017748682
## [5,] 0.013777722 0.007552195
## [6,] 0.005863928 0.003068728

Plotting these fuzzy clusters is a bit more complicated but we can plot clusters one at a time with the transparency indicating their probability of membership in a given cluster (more opaque points have a greater membership probability in a given cluster). In the code below, I plot each of the 8 clusters one at a time.

par(mfrow = c(3, 3))  # set up for drawing multiple plots
# plot with transparency set by fuzzy cluster membership
for (i in 1:8) {
    plot(Barmose.coord, pch = 16, cex = 0.5, col = rgb(0, 0, 0, fitF[, i]), main = i)
}

Now I plot all clusters simultaneously with color coding.

par(mfrow = c(1, 1))
plot(Barmose.coord, type = "n")  # initialize plot
cc <- col2rgb(palette())/255  # set up color palette

# plot with transparency set by fuzzy cluster membership
for (i in 1:8) {
    points(Barmose.coord, pch = 16, cex = 0.5, col = rgb(cc[1, i], cc[2, i], cc[3, 
        i], fitF[, i]), main = "All Clusters")
}
legend("bottomleft", as.character(seq(1:8)), pch = 16, col = palette())

Note that this approach produced clusters that were pretty different than the traditional K-means algorithm. Also, it is interesting that many of the peripheral points in the distribution were grouped together with high probabilities associated with cluster 8. It seems that the fuzzy K-means algorithm might help deal with the “circular clusters” problem described by Ducke (2015).

Now that we have defined clusters a few ways, the next step is to evaluate the content of those clusters.

Evaluating clusters

First, I simply output tables of artifact types by cluster first for the K-means method and then for the DBSCAN method. I’ll leave in the unassigned artifacts as cluster 0 in the DBSCAN table for this example. We’ll leave the Fuzzy K-means solution for last.

(K.tab <- as.matrix(table(Barmose$Label, fitK)))

##                              fitK
##                                1  2  3  4  5  6  7  8
##   Blade/Flake Knives           2  0  5  0  3  3  1  4
##   Burins                       2  2  5  2  5  5  2  2
##   Core Axes                    1  0  0  0  1  1  0  1
##   Core Platforms               1  3  2  0  1  0  0  2
##   Cores                        1  6 19  6  8 14 12 15
##   Denticulated/Notched Pieces  1  3  4  2  9  6  0  1
##   Flake Axes                   4  0  2  1  3  3  2 13
##   Lanceolate Microliths        1  1  8  1 12  7  1  5
##   Microburins                  0  0  6  0  4  6  0  0
##   Scrapers                     2  1  4  1 10  8  3  9
##   Square Knives                8  8 28 11 36 59 11 31

(D.tab <- as.matrix(table(Barmose$Label, fitD)))

##                              fitD
##                                 0   1   2   3   4   5   6
##   Blade/Flake Knives            8   1   1   4   1   1   2
##   Burins                       12   1   2   9   1   0   0
##   Core Axes                     1   0   0   1   1   0   1
##   Core Platforms                6   1   0   2   0   0   0
##   Cores                        50   4   5  14   0   6   2
##   Denticulated/Notched Pieces  13   1   1   9   0   2   0
##   Flake Axes                   17   2   0   2   1   2   4
##   Lanceolate Microliths        22   3   3   5   0   2   1
##   Microburins                   4   0   0   7   0   5   0
##   Scrapers                     22   5   5   3   2   0   1
##   Square Knives               101   9  11  26   4  37   4

Before I jump into interpretations I want to evaluate the degree to which artifact classes are non-randomly concentrated in specific clusters for both methods. To do this, I’ll use the Koetje procedure described by Howell and Kintigh (1996) where I calculate the Simpson’s C measure of concentration for the table of artifacts by cluster and then simulate a large number (10,000 in this case) of tables with the same marginal totals in order to estimate the probability of getting a C value as high or higher than the observed.

# Define function for Koetje procedure and calculate
koetje.sim <- function(x, nsim) {
    Cactual <- weighted.mean(rowSums(prop.table(as.matrix(x), margin = 1)^2), rowSums(x))
    sim.tab <- r2dtable(nsim, rowSums(x), colSums(x))
    out <- 0
    for (i in 1:nsim) {
        temp <- weighted.mean(rowSums(prop.table(as.matrix(sim.tab[[i]]), margin = 1)^2), 
            rowSums(sim.tab[[i]]))
        if (temp >= Cactual) {
            out <- out + 1
        }
    }
    out.final <- round(cbind(Cactual, out/nsim), 3)
    colnames(out.final) <- c("Simpsons C", "Estimated Prob")
    return(out.final)
}

(koetje.sim(t(K.tab), nsim = 10000))  # Kmeans results

##      Simpsons C Estimated Prob
## [1,]      0.239          0.004

(koetje.sim(t(D.tab)[-1, ], nsim = 10000))  # DBSCAN results

##      Simpsons C Estimated Prob
## [1,]      0.264              0

The Koetje results confirm that both method produced clusters where artifacts were highly concentrated. With that in mind, what can we make of the clusters themselves?

A quick glance at these tables reveals some pretty big differences. For example, square knives and cores are much more common than other artifact types in almost every K-means cluster (and overall) but are less common in the DBSCAN clusters as the majority remain unclassified. cluster 7 in the K-means analysis is characterized by relatively high frequencies of cores, core axes, and core platforms, along with flake axes. This cluster is on the east side of the hypothesized hut at Barmose. Perhaps this indicates a primary production/flaking area. Many of the other clusters appear to contain most artifact categories suggesting perhaps general work areas (as propsoed by Blankholm 1991).

The DBSCAN approach produced much smaller clusters (spatially and in terms of artifact counts). These clusters are a bit difficult to interpret but it seems that even the smallest clusters contain a diversity of materials. As Blankholm (1991) suggests, it seems that several of these clusters likely represent multi-purpose work areas.

What about the Fuzzy K-means results. In order to explore these results, I want to create a table displaying the proportional probability of each artifact type in each cluster. In order to do this, I sum the probabilities for each artifact class in each cluster and then divide by the number of artifacts in that class. This gives me, in essesnce a mean probability of a each artifact class in each cluster. As displayed below the probabilities of each row sum to 1.

# create list of fuzzy probabilities by artifact class
F.tab <- by(fitF, Barmose$Label, colSums)
# create table of row probabilites by artifact class
F.tab2 <- round(prop.table(matrix(unlist(F.tab), ncol = 8, byrow = T), 1), 2)
# add row and column labels
row.names(F.tab2) <- names(F.tab)
colnames(F.tab2) <- seq(1:8)

F.tab2  #look at the results

##                                1    2    3    4    5    6    7    8
## Blade/Flake Knives          0.18 0.08 0.13 0.14 0.21 0.12 0.12 0.03
## Burins                      0.08 0.09 0.17 0.19 0.21 0.09 0.05 0.11
## Core Axes                   0.20 0.04 0.20 0.16 0.15 0.08 0.12 0.03
## Core Platforms              0.21 0.02 0.02 0.06 0.21 0.18 0.03 0.27
## Cores                       0.17 0.12 0.12 0.10 0.19 0.10 0.12 0.08
## Denticulated/Notched Pieces 0.06 0.02 0.09 0.25 0.19 0.12 0.15 0.13
## Flake Axes                  0.38 0.11 0.07 0.10 0.10 0.11 0.09 0.06
## Lanceolate Microliths       0.14 0.04 0.14 0.15 0.16 0.23 0.10 0.04
## Microburins                 0.03 0.01 0.08 0.13 0.33 0.10 0.29 0.02
## Scrapers                    0.22 0.07 0.19 0.16 0.10 0.15 0.06 0.05
## Square Knives               0.13 0.07 0.13 0.15 0.15 0.12 0.20 0.05

These results are quite interesting. Tools associated with primary production (Cores, Core Axes, Core Platforms, as well as Blade/Flake Knives) appear to have high probabilities associated with clusters 1 and 5. Cluster 5 is essentially the same area highlighted as a zone of potential primary production based on the traditional K-means analysis above. Microburins show the highest probabilites in clusters 5 and 7 also on the east side of the hyopthesized hut. Flake Axes on the other hand show the highest probabilites with cluster 1 on the west side of the hut. Cluster 8 which represents largely the periphery of the scatter to the SE is associated with high probabilities for several artifact classes but notably not microliths, microburins, or formal tools like Blades and Square Knives.

As the example here has demonstrated, each of these clustering algorithms provides different information about the potential spatial structure of activities at the Barmose site. Each method has both advantages and disadvantages and we can learn quite a bit by comparing the methods.

Cluster Analysis - Pure Locational Clustering

Matthew Peeples

K-means clusters

DBSCAN - Density Based Clustering

Fuzzy K-means

Evaluating clusters