This document can be cited as follows:
This markdown document provides code and detailed examples associated with the Computer Applications and Quantitative Methods in Archaeology Workshop hosted by Matt Peeples and Tom Brughmans on March 13th, 2017 in Atlanta at Georgia State University. This document assumes you have a basic knowledge of network science terminology and at least some basic familiarity with R and R Studio. If you are new to R, we recommend you first run through the brief Code School examples focused on R available here (http://tryr.codeschool.com/). The examples below can be run by simply copying and pasting the code below but modifications will require a bit of additional experience. For those new to network science in general, we recommend you start by reading the Brughmans (2014) review article (see recommended bibliography for more details).
For the purposes of this workshop we will be using one real archaeological dataset as an example (and some simulated data). These are data from research Peeples has conducted and can be used to demonstrate some of the most common methods for archaeological network analysis. These data come from the Southwest Social Networks Project (http://www.southwestsocialnetworks.net) and represent counts of decorated ceramic wares for one 50 year interval (A.D. 1050-1100) from a series of 223 settlements with public architectural features located across the northern U.S. Southwest (Chacoan Great Houses and Great Kivas for those who like such things). This dataset is divided into two .csv files, the first with the ceramic counts by site and a second with the attributes of those sites, including their locations. To ensure the security of site locational information as required by the state data repositories charged with maintaining these data, we have randomly relocated each settlement between 7-10 kilometers from their actual locations. The code below imports these data from the .csv files into objects we can use in R.
Right click and choose “save as” to download the ceramic data and attribute data.
Before running the code below, however, we need to ensure that our R session is set to the correct working directory (the location where you placed the .csv files for this workshop). To do that, go to the menu bar at the top and click Session > Set Working Directory > Choose Directory and navigate to the place on your hard drive where these files reside (alternatively you can hit Ctrl + Shift + H and then navigate to the appropriate directory). Once you have done that, you will be able to execute the code below by simply copying the text and then pasting it into the R console.
Let’s start by importing our ceramic and attribute data.
# Import chaco ceramic data into an object named chaco. row.names=1 sets the
# first column values to the row names for each item.
chaco <- read.csv(file = "AD1050cer.csv", row.names = 1)
# Import attribute data into an object named chaco.attr
chaco.attr <- read.csv("AD1050attr.csv", row.names = 1)
For the purposes of this workshop, we will rely on a few pre-existing R packages. In order to use these packages in a new installation of R and R-studio, we first need to install them. Note that you will only need to do this once on each new installation of R/R-Studio. To install packages, you can click on the “Packages” tab in the window in the bottom right of R studio, then click the “Install” button at the top and type the names of the packages separated by commas. Alternatively you can install packages from the console by simply typing “install.packages(‘nameofpackagehere’)” without the quotation marks.
For the purposes of this workshop we will rely on the statnet and tnet packages which also install several other required packages like sna, network, and igraph. Copy the line below into the console or follow the instructions to install using the packages tab:
install.packages(‘statnet’,‘tnet’)
Once you have installed these packages we use the library console command to initialize our packages.
library(statnet)
library(tnet)
In addition to the pre-existing R packages, there are also a few additional functions that we will need to define manually for several of our analyses below. Specifically, we want to be able to take our ceramic frequency data and convert that into a symmetric similarity or distance matrix that we will use to define and weight our networks. For the examples below, we will rely on three methods for defining these matrices: 1) Co-presence of types, 2) Brainerd-Robinson similarity, and 3) \(\chi^{2}\) distances. These are only a few among a wide variety of options.
The first simple measure we will use for defining networks here is the number of categories that are co-present at pairs of sites. We can calculate this measure through simple matrix algebra. First, we create a binary incidence matrix of ceramic counts by site by simply dichotomizing our count data (a matrix with 1 for all categories that are present and 0 elsewhere). We can define that incidence matrix as \(A\) and the transpose (switching rows and columns) of that matrix as \(A^{T}\) we can find the number of categories that overlap between sites \(P\) as: \[P=A * A^{T}\]
The result of this procedure is a symmetric matrix with the number of rows and columns determined by the number of nodes comparing each site to every other site with each cell represents counts of co-occurrence between pairs of sites. The diagonal of this matrix represents the total number of categories present for the site denoted by that row. In many cases, we may not want to count rare occurrences as “present” and may instead want to create a threshold absolute value or proportion that a given category must exceed to be counted as “present” for creating our matrix. In the example we will use here today a category must represent at least 10% of a row to be considered “present” in this calculation. The code below could be easily modified to use a different cutoff and the choice of cutoff is an important decision with substantive impacts on our networks.
The chunk of code below defines a procedure for calculating simple co-presence for all categories representing more than 10% of a given row. We then create a new object called “chacoP” which represents this matrix of co-occurrence from site to site.
co.p <- function(x, thresh = 0.1) {
# create matrix of proportions from chaco
temp <- prop.table(as.matrix(x), 1)
# define anything with greater than or equal to 0.1 as present (1)
temp[temp >= thresh] <- 1
# define all other cells as absent (0)
temp[temp < 1] <- 0
# matrix algebraic calculation to find co-occurence (%*% indicates matrix
# multiplication)
out <- temp %*% t(temp)
return(out)
}
# run the function
chacoP <- co.p(chaco)
The next metric we will use here is a rescaled version of the Brainerd-Robinson (BR) similarity metric. This BR measure is commonly used in archaeology including in a number of recent (and not so recent) network studies. This measure represents the total similarity in proportional representation of categories and is defined as:
\[S = {\frac{2-\sum_{k} \left|x_{k} - y_{k}\right|} {2}}\]
where, for all categories \(k\), \(x\) is the proportion of \(k\) in the first assemblage and \(y\) is the proportion of \(k\) in the second. This provides a scale of similarity from 0-1 where 1 is perfect similarity and 0 indicates no similarity. The chunk below defines the code for calculating this modified BR similarity measure. Note that this code could certainly be reduced and streamlined in many ways but I have chosen to use the “expanded” version for the sake of clarity for those new to R. By default, this code rounds the resulting similarity metric to three digits past the decimal for ease of viewing the output. This chunk ends by running this function for our sample dataset defining a new object “chacoBR” with the resulting similarity matrix.
sim.mat <- function(x) {
# get names of sites
names <- row.names(x)
x <- na.omit(x) # remove any rows with missing data
x <- prop.table(as.matrix(x), 1) # convert to row proportions
rd <- dim(x)[1]
# create an empty symmetric matrix of 0s
results <- matrix(0, rd, rd)
# the following dreaded double for loop goes through every cell in the
# output data table and calculates the BR value as descried above
for (s1 in 1:rd) {
for (s2 in 1:rd) {
x1Temp <- as.numeric(x[s1, ])
x2Temp <- as.numeric(x[s2, ])
results[s1, s2] <- 2 - (sum(abs(x1Temp - x2Temp)))
}
}
row.names(results) <- names # assign row names to output
colnames(results) <- names # assign column names to output
results <- results/2 # rescale results between 0 and 1
results <- round(results, 3) # round results
return(results)
} # return the final output table
# Run the function defined above on our sample data
chacoBR <- sim.mat(chaco)
The next measure we will use is the \(\chi^{2}\) distance metric which is the basis of correspondence analysis and related methods commonly used for frequency seriation in archaeology (note that this should probably really be called the \(\chi\) distance since the typical form we use is not squared, but the name persists this way in the literature so that’s what we use here). This measure is defined as:
\[\chi_{jk} = \sqrt{\sum \frac 1{c_{j}} ({x_{j}-y_{j})^{2}}}\]
where \(c_j\) denotes the \(j_{th}\) element of the average row profile (the proportional abundance of \(j\) across all rows) and \(x\) and \(y\) represent row profiles for the two sites under comparison. This metric therefore takes raw abundance (rather than simply proportional representation) into account when defining distance between sites. The definition of this metric is such that rare categories play a greater role in defining distances among sites than common categories (as in correspondence analysis). This measure has a minimum value of 0 and no theoretical upper limit.
The code for calculating \(\chi^{2}\) distances is defined in the chunk below and a new object called “chacoX” is created using this measure. It is sometimes preferable to rescale this measure so that it is bounded between 0 and 1. We create a second object called “chacoX01” which represents rescaled distances by simply dividing the matrix by the maximum observed value (there are many other ways to convert a distance to a similarity measure but this will be fine for our current purposes).
chi.dist <- function(x) {
rowprof <- x/apply(x, 1, sum) # calculates the profile for every row
avgprof <- apply(x, 2, sum)/sum(x) # calculates the average profile
# creates a distance object of $\chi^{2}$ distances
chid <- dist(as.matrix(rowprof) %*% diag(1/sqrt(avgprof)))
# return the reults
return(as.matrix(chid))
}
# Run the script and then create the rescaled 0-1 version
chacoX <- chi.dist(chaco)
chacoX01 <- chacoX/max(chacoX)
Now that we have defined our three measures of similarity or distance, the next step is to convert these into network objects that our R packages will be able to work with. We can do this by either creating binary networks (where ties are either present or absent) or weighted networks (which in many cases are simply the raw similarity/distance matrices we calculated above). We will provide examples of both approaches, starting with simple binary networks. There are many ways to define networks from matrices like those we generated above and our examples below should not been seen as an exhaustive set of approaches.
First, we will create binary networks using our ceramic ware co-occurrence matrix. Sites that share a single co-occurrence will be defined as connected here. We then use the “network” function to create a network object using the argument “directed=F” to let the function know that our data are from a symmetric matrix and that ties always extend in both directions between pairs of linked nodes. See help(network) for more details on options here.
After we create this new network object we plot it first using the default graph layout (Fruchterman-Reingold - We’ll discuss what this is in the workshop) and then based on the geographic location of our nodes. We’re not going to spend a lot of time exploring network plotting procedures in this workshop, but for those of you who may be interested in making far prettier graphs than those used here for demonstration purposes, I would recommend exploring the new “ggraph” package. (https://github.com/thomasp85/ggraph)
# create network object from co-occurrence
Pnet <- network(chacoP, directed = F)
# Now let's add names for our nodes based on the row names of our original
# matrix
Pnet %v% "vertex.names" <- row.names(chacoP)
# look at the results
Pnet
## Network attributes:
## vertices = 223
## directed = FALSE
## hyper = FALSE
## loops = FALSE
## multiple = FALSE
## bipartite = FALSE
## total edges= 16042
## missing edges= 0
## non-missing edges= 16042
##
## Vertex attribute names:
## vertex.names
##
## Edge attribute names not shown
# set up for plotting two plots, side by side by setting the pot to 1 row, 2
# columns
par(mfrow = c(1, 2))
# plot network using default layout
plot(Pnet, edge.col = "gray", edge.lwd = 0.25, vertex.cex = 0.5, main = "co-presence network")
# plot network using geographic coordinates
plot(Pnet, edge.col = "gray", edge.lwd = 0.25, vertex.cex = 0.5, coord = chaco.attr[,
9:10])
par(mfrow = c(1, 1)) # return to single plotting mode
The next chunk of code will produce a network based on our BR similarity matrix. In this example, we define ties as present between pairs of sites when they share more than 75% commonality (BR > 0.75) in terms of the proportions of ceramic wares recovered from both sites in a dyad.
In the code below, the event2dichot function (from the statnet/network package) takes our matrix and divides it into 1s and 0s based on the cut off we choose. Here we’re using and ‘absolute’ cut off meaning we’re assigning a specific value to use as the cut off (0.75) and defining all similarity values higher than that cutoff as 1 and all other dyads as 0. We then send the output of this function to the network function just as before. After examining our new network we then plot it both using a graph layout and geographic locations.
# Define our binary network object from BR similarity
BRnet <- network(event2dichot(chacoBR, method = "absolute", thresh = 0.75),
directed = F)
# Now let's add names for our nodes based on the row names of our original
# matrix
BRnet %v% "vertex.names" <- row.names(chacoBR)
# look at the results.
BRnet
## Network attributes:
## vertices = 223
## directed = FALSE
## hyper = FALSE
## loops = FALSE
## multiple = FALSE
## bipartite = FALSE
## total edges= 7245
## missing edges= 0
## non-missing edges= 7245
##
## Vertex attribute names:
## vertex.names
##
## Edge attribute names not shown
par(mfrow = c(1, 2)) # set up for plotting two plots, side by side
# plot network using default layout
plot(BRnet, edge.col = "gray", edge.lwd = 0.25, vertex.cex = 0.5, main = "BR network")
# plot network using geographic coordinates
plot(BRnet, edge.col = "gray", edge.lwd = 0.25, vertex.cex = 0.5, coord = chaco.attr[,
9:10])
par(mfrow = c(1, 1)) # return to single plotting mode
In the next chunk of code we will use the \(\chi^2\) distances to create binary networks. This time, we will not use an absolute value to define ties as present, but instead will define those similarities (1-distances) greater than 80 percent of all similarities as present. We will then once again plot just as above.
# Note we use 1 minus chacoX01 here so to convert a distance to a similarity
Xnet <- network(event2dichot(1 - chacoX01, method = "quantile", thresh = 0.8),
directed = F)
# Once again add vertext names
Xnet %v% "vertex.names" <- row.names(chacoX01)
# look at the results
Xnet
## Network attributes:
## vertices = 223
## directed = FALSE
## hyper = FALSE
## loops = FALSE
## multiple = FALSE
## bipartite = FALSE
## total edges= 4861
## missing edges= 0
## non-missing edges= 4861
##
## Vertex attribute names:
## vertex.names
##
## Edge attribute names not shown
par(mfrow = c(1, 2)) # set up for plotting two plots, side by side
# plot network using default layout
plot(Xnet, edge.col = "gray", edge.lwd = 0.25, vertex.cex = 0.5, main = "Chi-squared network")
# plot network using geographic coordinates
plot(Xnet, edge.col = "gray", edge.lwd = 0.25, vertex.cex = 0.5, coord = chaco.attr[,
9:10])
par(mfrow = c(1, 1)) # return to single plotting mode
It is also possible to use R to create weighted networks where individual edges are valued. We have found that this works reasonably well with networks of co-presence or something similar (counts of mentions in texts or monuments for example) but this does not perform well when applied to similarity or distance matrices (because every possible link has a value, however small, so the network gets unwieldy very fast). In the latter case, we have found it is better to just work directly with the underlying similarity/distance matrix.
Creating a weighted network object in R is easy and only requires a slight modification from the procedure above. In the chunk of code below, we will simply add the arguments “ignore.eval=F” and “names.eval=‘weight’” to let the network function know we would like weights to be retained and we would like that attribute called ‘weight’. We will apply this to the matrix of co-presence defined above and then plot the result showing the weights of individual ties. Although it is difficult to tell with a network this size, the lines defining the edges are scaled based on their edge weights. We further color code the nodes based on the “CSN_macro_group” (sub-region) variable in our attribute file. Although we only show this approach for the co-presence network, these same arguments work with the other matrices defined above.
# create weighted network object from co-occurrence matrix by adding the
# ignore.eval=F argument
Pnet2 <- network(chacoP, directed = F, ignore.eval = F, names.eval = "weight")
Pnet2 %v% "vertex.names" <- row.names(chacoP)
Pnet2
## Network attributes:
## vertices = 223
## directed = FALSE
## hyper = FALSE
## loops = FALSE
## multiple = FALSE
## bipartite = FALSE
## total edges= 16042
## missing edges= 0
## non-missing edges= 16042
##
## Vertex attribute names:
## vertex.names
##
## Edge attribute names not shown
par(mfrow = c(1, 2)) # set up for plotting two plots, side by side
# create object to add color by subregion
reg.col <- as.factor(chaco.attr$CSN_macro_group)
# plot weighted network using default layout
plot(Pnet2, edge.col = "gray", edge.lwd = "weight", vertex.cex = 0.5, vertex.col = reg.col,
main = "co-presence weighted network")
# plot weighted network using geographic coordinates
plot(Pnet2, edge.col = "gray", edge.lwd = "weight", vertex.cex = 0.5, coord = chaco.attr[,
9:10], vertex.col = reg.col)
par(mfrow = c(1, 1)) # return to single plotting mode
Now for the sake of visualization, we’ll create and plot a smaller random weighted network to show how edge weights are depicted. There are many more plotting options within R base graphics and ggplot/ggraph and we direct you to the help documents for those packages for more details.
set.seed(45645) # set random seed for repeatability
num_nodes <- 40 # set the number of nodes
num_edges <- 80 # set the number of edges
# create a blank 'edgelist' for this network
edgelist <- matrix("", nrow = num_edges, ncol = 2)
# populate the edgelist with random node number assignments
for (i in 1:num_edges) edgelist[i, ] <- sample(x = seq(1, num_nodes), size = 2,
replace = F)
# convert to network object (empty)
net2 <- network.initialize(num_nodes)
# define ties based on edglist as 1
net2[as.matrix(edgelist)] <- 1
# set edge weights using a random uniform draw
edge_weights <- round(runif(num_edges, min = 1, max = 5))
# apply edge weights to network object
set.edge.value(net2, "weight", edge_weights)
# create a scale for color coding edges from gray to darkblue
edge.cols <- colorRampPalette(c("gray", "darkblue"))(5)
# plot network and add legend (get.edge.value uses the values of weight to
# color code and define line thickness
plot(net2, usearrows = F, edge.lwd = get.edge.value(net2, "weight"), edge.col = edge.cols[get.edge.value(net2,
"weight")], main = "random weighted network")
legend("bottomright", legend = seq(1, 5), lwd = seq(1, 5), col = edge.cols)
One of the most common kinds of analysis for archaeological and other networks is the calculation of measures of node/edge centrality and graph centralization. There are many different measures in the literature each appropriate for different kinds of research questions and data formats (see Borgatti and Everett 2006). We will not cover the interpretation of these network metrics in depth in this workshop, but we will be using several common measures as our means of assessing the impact of missing data and other kinds of uncertainty on our interpretations of archaeological networks.
In this section, we briefly describe the network metrics we will use and then define a function to easily calculate multiple measures simultaneously on multiple networks. The primary measures we will use are the binary and weighted versions of: 1) degree centrality, 2) betweenness centrality, and 3) eigenvector centrality. We direct readers to Peeples and Roberts (2013) and especially the online supplemental materials for that article for more details on the calculation of these metrics.
Degree centrality for a node is defined as the total number of direct connections in which that node is involved. In weighted networks, this is simply the total weight of all connections for that node (minus 1 to remove self-loops). Betweenness centrality is defined as the number of shortest paths between pairs of nodes in a network involving the target node divided by the total number of shortest paths in the network as a whole. For binary networks, shortest paths are defined as the smallest number of direct ties that must be crossed to get from one specific node to another. Calculating betweenness centrality for weighted networks is a bit more complicated. The method we use here come from Opsahl and others (2010). This approach defines shortest paths as the “path of least resistance” between pairs of target nodes. In other words, the stronger the path (the higher the edge weights) the “shorter” it is according to this algorithm. Refer to Peeples and Roberts (2013) for more details. Eigenvector centrality is a measure of a node’s importance in a network defined in relation to the first eigenvector of the adjacency matrix of nodes for both binary and weighted networks. For both binary and weighted networks, a node’s eigenvector centrality score is proportional to the summed scores of other nodes to which it is connected. In other words, eigenvector centrality for a node will increase if a node is either connected to lots of other nodes, or if a node is connected to highly central nodes. We rescale this measure as is commonly seen in the network science literature so that the sum of squared scores is equal to the total number of nodes in that network.
The chunk of code below contains two functions: 1) the first calculates the three measures of centrality described above for binary networks and 2) the second calculates these measures for weighted networks (or similarity matrices). The results are by default returned in a matrix with a column for each measure and a row for each node. Note that the statnet/sna package uses some of the same names for functions as tnet/igraph. Thus, we must specify which package we mean to use in the code below (e.g., sna::degree means the degree calculation method used in the sna package which is initialized through statnet. If we wished to use the igraph version we would instead use igraph::degree).
# Calculate centrality scores for binary networks
net.stats <- function(y) {
# calculate degree centrality
dg <- as.matrix(sna::degree(y, gmode = "graph"))
# calculate and scale eigenvector centrality
eg <- as.matrix(sna::evcent(y))
eg <- sqrt((eg^2) * length(eg))
# calculate betweenness centrality
bw <- sna::betweenness(y, gmode = "graph")
# combine centrality scores into matrix
output <- cbind(dg, eg, bw)
rownames(output) <- rownames(as.matrix(y))
colnames(output) <- c("dg", "eg", "bw")
return(output)
} # return results of this function
# Calculate centrality scores for weighted networks (similarity matrices)
net.stats.wt <- function(y) {
# calculate weighted degree as the sum of weights - 1
dg.wt <- as.matrix(rowSums(y) - 1)
# calculate weighted eigenvector centrality and rescale
eg.wt <- as.matrix(sna::evcent(y))
eg.wt <- sqrt((eg.wt^2) * length(eg.wt))
# calculate weighted betweenness from the tnet package (we use the
# suppressWarnings package to avoid notifications)
bw.wt <- suppressWarnings(betweenness_w(y, directed = F))[, 2]
output <- cbind(dg.wt, eg.wt, bw.wt)
rownames(output) <- rownames(as.matrix(y))
colnames(output) <- c("dg.wt", "eg.wt", "bw.wt")
return(output)
} # return results of this function
Now let’s calculate these measures for our networks defined above and look at the first couple of rows for the final example based on the \(\chi^2\) network.
# net stats for binary co-presence network
co.p.stats <- net.stats(Pnet)
# net stats for binary BR similarity network
BR.stats <- net.stats(BRnet)
# net stats for binary X^2 similarity network (1-distance)
X.stats <- net.stats(Xnet)
head(X.stats)
## dg eg bw
## Aztec 3 0.017140077 89.73043
## Pueblo Pintado 92 1.638913576 70.11709
## Morris 39 37 0.011199709 947.53069
## Holmes Group 37 0.011199709 947.53069
## Jaquez 13 0.066034620 925.91999
## Fort Wingate 5 0.002894922 284.80973
And now let’s calculate the weighted versions.
# net stats for weighted co-presence network
co.pw.stats <- net.stats.wt(chacoP)
# net stats for weighted BR similarity network
BR.stats.w <- net.stats.wt(chacoBR)
# net stats for X^2 similarity (1-distance)
X.stats.w <- net.stats.wt(1 - chacoX01)
head(X.stats.w)
## dg.wt eg.wt bw.wt
## Aztec 180.9051 1.0567527 0
## Pueblo Pintado 186.9819 1.0961256 0
## Morris 39 179.5033 1.0495716 0
## Holmes Group 179.4261 1.0490867 0
## Jaquez 182.8987 1.0688679 0
## Fort Wingate 167.2580 0.9757783 0
It is also often informative to evaluate graph-level measures of centralization. In the code below, we calculate centralization measures associated with the node-level centrality measures described above. These measures are essentially a measure of how central all nodes in our network are in relation to a theoretical maximum. We can calculate most measures using existing functions but we need to create a custom function for calculating weighted betweenness centralization. Again Peeples and Roberts (2013) provide more details.
## calculate centralization measures for binary network
cent.bin <- function(net) {
output <- matrix(0, 1, 3)
colnames(output) <- c("degree", "between", "eigen")
# calculate binary degree centralization
output[1, 1] <- centralization(net, sna::degree, normalize = T)
# calculate binary eigenvector centralization
output[1, 2] <- centralization(net, sna::evcent, normalize = T)
# calculate binary betweenness centralization
output[1, 3] <- centralization(net, sna::betweenness, normalize = T)
return(output)
}
# define function for calculating weighted betweenness centralization
bw.cent <- function(x) {
Cstar <- max(x)
Csum <- (Cstar - x)
num <- 2 * (sum(Csum))
den <- ((length(x) - 1)^2) * (length(x) - 2)
out <- num/den
return(out)
} # output result of this function
## calculate centralization measures for weighted network
cent.wt <- function(sim) {
output <- matrix(0, 1, 3)
colnames(output) <- c("degree.wt", "eigen.wt", "between.wt")
# calculate degree centralization
output[1, 1] <- centralization(sim, sna::degree, normalize = T)
# calculate eigenvector centralization
output[1, 2] <- centralization(sim, sna::evcent, normalize = T)
# calculate betweenness centralization
output[1, 3] <- bw.cent(sim)
return(output)
}
Now let’s see examples of a couple of results using these functions focusing on the BR measure for now.
# BR binary net centralization
cent.bin(BRnet)
## degree between eigen
## [1,] 0.2500917 0.05083638 0.04374436
# BR similarity centralization
cent.wt(chacoBR)
## degree.wt eigen.wt between.wt
## [1,] 0.1610866 0.03173933 4.6861e-10
Now (finally) we’re ready to get into the meat of this workshop; assessing the impact of various forms of uncertainty on archaeological networks. The first example we will work with is the potential impact of missing nodes. In almost any archaeological network study, the networks we create are incomplete (i.e., we know that we are missing nodes for various reasons; site destruction, lack of survey coverage, looting, etc.). How might the fact that our networks are samples of a larger and typically unobtainable “complete network” influence our interpretations of network structure and node position? In this section, we are going to follow the lead of a study previously published in the journal Social Networks focused on this same issue (Costenbader and Valente 2003; see also Borgatti et al. 2006).
The overarching goal of the work by Costenbader and Valente (2003) was to take an existing network sampled from a (perhaps unknowable) population and determine how robust different structural and positional aspects of that network are to sub-sampling. This procedure, along with a careful consideration of the nature and quality of our sample, allows us to estimate the degree to which certain aspects of our sampled network do or do not approximate the same aspects of the original “complete” network from which it was drawn. Essentially, the idea is that if a particular metric is robust to sub-sampling in the data we have, we may have greater faith that a particular measure approximates the value in the complete network. Importantly, when we use the same underlying data to produce different networks as we have here (binary and weighted networks, networks of co-presence, BR similarity, and \(\chi^2\) distance) we can further determine the degree to which different network definitions are vulnerable to different kinds of sampling issues. Due to this, we argue that it is good practice to explore a range of network definitions.
The basic approach advocated by Costenbader and Valente is to take a given network and then sub-sample it at intervals of 10% (producing some large number of replicates for each sampling fraction) and then calculate a set of network centrality and centralization metrics for every replicate of at each sampling fraction. We then calculate rank-order correlations (Spearman’s \(\rho\)) between the metrics for the original network sample and every sub-sample. This allows us to evaluate the average correlation and error at each sampling fraction.
In order to do this, we first provide a modifiable chunk of code that conducts this procedure for a single centrality (or centralization measure) and then plots the results as a boxplot. You may get some warning messages when you run this code if your random sub-samples end up creating a situation where some nodes connected to nothing (but the output will still be correct). For the sake of time in this workshop, we are only making 25 random replicates, but good practice suggests using many more (set the script running and go to lunch has always been my strategy).
nsim <- 25 # set number of replicates
# define function for testing betweenness centrality stability across
# sub-samples
resamp.test <- function(x) {
# the next line can be replaced with any centrality measure you'd like
bw.g <- sna::betweenness(x, gmode = "graph")
mat <- as.matrix(x)
dim.x <- dim(mat)[1]
out.mat <- matrix(NA, nsim, 9) # create matrix for output and then name
colnames(out.mat) <- c("S90", "S80", "S70", "S60", "S50", "S40", "S30",
"S20", "S10")
# this double loop goes through each sampling fraction and each random
# replicate to cacluate centrality statistics and runs a Spearman's rho
# correlation between the resulting centrality values and the original
# sample
for (j in 1:9) {
for (i in 1:nsim) {
# this sampling procedure samples without replacement from a sequence from 1
# to the total number of nodes, the size of the sample being determined by
# 10-j/10. For example if j=1, 10-1/10 = 0.9 or a 90% sub-sample.
sub.samp <- sample(seq(1, dim.x), size = round(dim.x * ((10 - j)/10),
0), replace = F)
# calculate the betweenness statistic for the matrix reduced to only include
# the sub-sampled rows/columns
temp.stats <- sna::betweenness(mat[sub.samp, sub.samp], gmode = "graph")
# calcuate spearman's rho for replicate
out.mat[i, j] <- suppressWarnings(cor(temp.stats, bw.g[sub.samp],
method = "spearman"))
}
}
return(out.mat)
} # return the result
# display results as boxplot by sampling fraction
boxplot(resamp.test(BRnet), ylim = c(0, 1), main = "BR - betweenness", xlab = "sampling fraction",
ylab = "Spearmans rho")
The next chunk of code provides a single function that automates all of the steps described above using the centrality measures previously defined for both binary and weighted networks.
# The following function does the same calculation as above but is set up to
# work with the output of net.stats and net.stats.wt
nsim <- 25
samp.frac <- c("S90", "S80", "S70", "S60", "S50", "S40", "S30", "S20", "S10")
cv.resamp.bin <- function(x) {
# calculate all network stats for the original network
stats.g <- net.stats(x)
mat <- as.matrix(x)
dim.x <- dim(mat)[1] # count number of rows (nodes)
# define empty matrices for output
dg.mat <- matrix(NA, nsim, 9)
ev.mat <- matrix(NA, nsim, 9)
bw.mat <- matrix(NA, nsim, 9)
# add column names based on sampling fraction
colnames(dg.mat) <- samp.frac
colnames(ev.mat) <- samp.frac
colnames(bw.mat) <- samp.frac
# this double loop goes through each sampling fraction and each random
# replicate to cacluate centrality statistics and runs a Spearman's rho
# correlation between the resulting centrality values and the original
# sample
for (j in 1:9) {
for (i in 1:nsim) {
sub.samp <- sample(seq(1, dim.x), size = round(dim.x * ((10 - j)/10),
0), replace = F)
temp.stats <- net.stats(mat[sub.samp, sub.samp])
dg.mat[i, j] <- suppressWarnings(cor(temp.stats[, 1], stats.g[sub.samp,
1], method = "spearman"))
ev.mat[i, j] <- suppressWarnings(cor(temp.stats[, 2], stats.g[sub.samp,
2], method = "spearman"))
bw.mat[i, j] <- suppressWarnings(cor(temp.stats[, 3], stats.g[sub.samp,
3], method = "spearman"))
}
}
out.list <- list() # create list for output and populate it
out.list[[1]] <- dg.mat
out.list[[2]] <- ev.mat
out.list[[3]] <- bw.mat
return(out.list)
} # return the resulting list
# And now for the version for weighted networks (similarity matrices)
cv.resamp.wt <- function(x) {
# calculate network stats for the original network
stats.g <- net.stats.wt(x)
mat <- as.matrix(x)
dim.x <- dim(mat)[1]
# create empty matrices for output
dg.mat <- matrix(NA, nsim, 9)
ev.mat <- matrix(NA, nsim, 9)
bw.mat <- matrix(NA, nsim, 9)
# add column names by sampling fraction
colnames(dg.mat) <- samp.frac
colnames(ev.mat) <- samp.frac
colnames(bw.mat) <- samp.frac
# this double loop goes through each sampling fraction and each random
# replicate to cacluate centrality statistics and runs a Spearman's rho
# correlation between the resulting centrality values and the original
# sample
for (j in 1:9) {
for (i in 1:nsim) {
sub.samp <- sample(seq(1, dim.x), size = round(dim.x * ((10 - j)/10),
0), replace = F)
temp.stats <- net.stats.wt(mat[sub.samp, sub.samp])
dg.mat[i, j] <- suppressWarnings(cor(temp.stats[, 1], stats.g[sub.samp,
1], method = "spearman"))
ev.mat[i, j] <- suppressWarnings(cor(temp.stats[, 2], stats.g[sub.samp,
2], method = "spearman"))
bw.mat[i, j] <- suppressWarnings(cor(temp.stats[, 3], stats.g[sub.samp,
3], method = "spearman"))
}
}
out.list <- list() # create list for output
out.list[[1]] <- dg.mat
out.list[[2]] <- ev.mat
out.list[[3]] <- bw.mat
return(out.list)
} # return the results
Now let’s run this for each of our binary and weighted networks defined above.
cop.rs <- cv.resamp.bin(Pnet)
BR.rs <- cv.resamp.bin(BRnet)
X.rs <- cv.resamp.bin(Xnet)
copw.rs <- cv.resamp.wt(chacoP)
BRw.rs <- cv.resamp.wt(chacoBR)
Xw.rs <- cv.resamp.wt(1 - chacoX01)
Once we’ve calculated the statistics for all of these sub-samples, it is then possible to plot the results to see how each network definition and each measure behaves in relation to sub-sampling. Let’s do that first for the three binary networks.
par(mfrow = c(3, 3)) # set up for 3 by 3 plotting
# plot boxplots by sampling fraction for each measure and each network type
boxplot(cop.rs[[1]], ylim = c(0, 1), main = "co-presence - degree", xlab = "sampling fraction",
ylab = "Spearmans rho")
boxplot(cop.rs[[2]], ylim = c(0, 1), main = "co-presence - eigenvector", xlab = "sampling fraction")
boxplot(cop.rs[[3]], ylim = c(0, 1), main = "co-presence - betweenness", xlab = "sampling fraction")
boxplot(BR.rs[[1]], ylim = c(0, 1), main = "BR - degree", xlab = "sampling fraction",
ylab = "Spearmans rho")
boxplot(BR.rs[[2]], ylim = c(0, 1), main = "BR - eigenvector", xlab = "sampling fraction")
boxplot(BR.rs[[3]], ylim = c(0, 1), main = "BR - betweenness", xlab = "sampling fraction")
boxplot(X.rs[[1]], ylim = c(0, 1), main = "Chi squared - degree", xlab = "sampling fraction",
ylab = "Spearmans rho")
boxplot(X.rs[[2]], ylim = c(0, 1), main = "Chi squared - eigenvector", xlab = "sampling fraction")
boxplot(X.rs[[3]], ylim = c(0, 1), main = "Chi squared - betweenness", xlab = "sampling fraction")