R Script for Calculating the Brainerd-Robinson Coefficient of Similarity and Assessing Sampling Error
This document can be cited
as follows:
Society for American Archaeology style:
Peeples, Matthew A.
2011 R Script for Calculating the Brainerd-Robinson Coefficient of Similarity
and Assessing Sampling Error. Electronic document, http://www.mattpeeples.net/br.html,
accessed
APA style:
Peeples, Matthew A. (2011) R Script for Calculating the Brainerd-Robinson Coefficient
of Similarity and Assessing Sampling Error. [online]. Available: http://www.mattpeeples.net/br.html.
(
)
| return to my homepage | more
R scripts |
Brainerd Robinson Coefficient of Similarity:
The Brainerd-Robinson (BR) coefficient is a similarity measure that was developed
within archaeology specifically for comparing assemblages in terms of the proportions
of types or other categorical data. Although this measure is commonly used in
archaeological studies, it cannot be calculated with most commercial software
packages. This script calculates BR coefficients on count or percent data. Further,
following procedures first described by DeBoer, Kintigh, and Rostoker (1996),
this script also calculates the probabilities of obtaining a BR similarity value
less than or equal the actual value by chance for every pair-wise comparison.
These probability values can be useful in determing when differences between
sites might be a function of sampling error and when they are likely not. This
probability assessment is described in more detail below.
BR is a city-block metric of similarity (S) that is calculated as:
where, for all variables (k), P is the total percentage in assemblages i and
j. This provides a scale of similarity from 0-200 where 200 is perfect similarity
and 0 is no similarity.
This remainder of this document provides a brief overview of the BR.R script.
This script is based on the BRSAMPLE originally written by Keith Kintigh as
part of the Tools for Quantitative
Archaeology program suite. You can download a sample data
file along with the script to follow along
with this example. Right click and click Save As for both of the files above.
The sample data used here are based on Huntley's (2004, 2008) ceramic compositional
study focused on the Zuni region. The discussion and interpretations of the
sample data presented below are also based on Huntley's published work.
File Format:
This script is designed to use the *.csv (comma separated value) file format.
Microsoft Excel as well as the open-source program Calc in the Open
Office suite can be used to produce files in this format from any tabular
data. For the purposes of this script, the file should be named "BR.csv".
Note that file names are case sensitive. The text of the script file may be
edited to change the input file name.
Table Format:
Tables should be formatted with each of the samples/sites/observations as rows
and each of the categorical variables as columns. The first row of the spreadsheet
should be a header that labels each of the columns. The first column should
contain the name of each unit (i.e., level, unit, site, etc.). Row names may
not be repeated. All of the remaining columns should contain numerical count
or percent data. The sample data from Huntley's (2004, 2008) study consist of
counts of 5 ceramic compositional groups from 9 sites in the Zuni region :
SITE |
DLH-1 |
DLH-2a |
DLH-2b |
DLH-2c |
DLH-4 |
Atsinna |
16 |
9 |
3 |
0 |
1 |
Cienega | 13 |
3 |
2 |
0 |
0 |
Mirabal | 9 |
5 |
2 |
5 |
0 |
PdMuertos | 14 |
12 |
3 |
0 |
0 |
Hesh | 0 |
26 |
4 |
0 |
0 |
LowPesc | 1 |
26 |
4 |
0 |
0 |
BoxS | 0 |
11 |
3 |
13 |
0 |
OjoBon | 0 |
0 |
17 |
0 |
16 |
Sp170 | 0 |
0 |
18 |
0 |
14 |
Requirements
for Running the Script:
In order to run this
script, you must install the R statistical package (version 2.8 or later). R
can be downloaded for free here. Follow
the instructions on the R site for installation procedures. In addition to this,
this script requires one specific R packages to be installed (statnet). In order
to install this package, simply click on the "packages" drop down
menu at the top of the R window and click on "Install package(s)".
Choose a CRAN mirror (it is best to choose the location closest to you). Select
the "statnet" package and click OK. For further instructions for installing
packages, check here.
Starting the Script:
The first step for running the script is to place the script file "BR.R"
and the data file "BR.csv" in the working directory of R. To change
the working directory, click on "File" in the R window and select
"Change dir", then simply browse to the directory that you would like
to use as the working directory. Next, to actually run the script, type the
following line into the R command line:
source('BR.R')
Running the Script:
After typing the command
above into the command line, the console will request user input with two prompts.
1) Are the
data percents or counts? 1=percents, 2=counts:
At this prompt enter 1 if input data
are percents. For percent data, the script will output a rectangular matrix
of BR values for all pair-wise comparisons and then end without running the
sampling error assessment (which requires count data). If the data are counts,
a second prompt will appear.
2) How many
random runs :
At this prompt, enter the number of
random runs that you wish use to assess sampling error. Increasing the number
of random runs increases the total time it will take to calculate probabilities.
Most relatively recent computers (should be able to calculate 1,000 runs in
a few seconds to several minutes depending on the size of the original data
table.
Output:
After the script has run, one or two files will be placed in the R
working directory depending on whether percent or count data were input.
For both percent and count data, a file named "BR_out.csv" will be
output. This file is a rectangular matrix of Brainerd-Robinson similarity coefficients
for all sites/samples included. The sample output is shown below. Numbers shown
in red in the table below represent comparisons between sites in the
same settlement cluster and comparisons in black represent comparisons between
settlement clusters. Data will not be color coded in the actual output.
Atsinna |
Cienega |
Mirabal |
PdMuertos |
Hesh |
LowPesc |
BoxS |
Ojo Bon |
S170 |
|
Atsinna | 200 |
164 |
152 |
179 |
83 |
89 |
83 |
28 |
28 |
Cienega | 164 |
200 |
138 |
151 |
56 |
62 |
56 |
22 |
22 |
Mirabal | 152 |
138 |
200 |
152 |
67 |
73 |
114 |
19 |
19 |
PdMuertos | 179 |
151 |
152 |
200 |
103 |
110 |
102 |
21 |
21 |
Hesh | 83 |
56 |
67 |
103 |
200 |
194 |
104 |
27 |
27 |
LowPesc | 89 |
62 |
73 |
110 |
194 |
200 |
104 |
26 |
26 |
BoxS | 83 |
56 |
114 |
102 |
104 |
104 |
200 |
22 |
22 |
Ojo Bon | 28 |
22 |
19 |
21 |
27 |
26 |
22 |
200 |
191 |
S170 | 28 |
22 |
19 |
21 |
27 |
26 |
22 |
191 |
200 |
If the input data are counts, a second file named "BR_prob" will also be created. This file is a rectangular matrix of probability values based on the Monte Carlo simulation procedure described below. In this table, a value of 0 indicates a probability less than the number of significant digits considered. For example, if probabilities were calculated to thousandths (i.e., 0.115), a value of 0 indicates some value < 0.001. The sample output table should be similar to the table shown below. Actual values will vary somewhat do to the randomization procedure. The sample output shown here was calculated based on 1,000 random runs. Again, numbers shown in red in the table below represent comparisons between sites in the same settlement cluster and comparisons in black represent comparisons between settlement clusters.
Atsinna |
Cienega |
Mirabal |
PdMuertos |
Hesh |
LowPesc |
BoxS |
Ojo Bon |
S170 |
|
Atsinna | 0 |
0.687 |
0.397 |
0.857 |
0 |
0.001 |
0 |
0 |
0 |
Cienega | 0.687 |
0 |
0.246 |
0.402 |
0 |
0 |
0 |
0 |
0 |
Mirabal | 0.397 |
0.246 |
0 |
0.403 |
0 |
0 |
0.022 |
0 |
0 |
PdMuertos | 0.857 |
0.402 |
0.403 |
0 |
0.004 |
0.005 |
0.004 |
0 |
0 |
Hesh | 0 |
0 |
0 |
0.004 |
0 |
0.997 |
0.001 |
0 |
0 |
LowPesc | 0.001 |
0 |
0 |
0.005 |
0.997 |
0 |
0.002 |
0 |
0 |
BoxS | 0 |
0 |
0.022 |
0.004 |
0.001 |
0.002 |
0 |
0 |
0 |
Ojo Bon | 0 |
0 |
0 |
0 |
0 |
0 |
0 |
0 |
0.994 |
S170 | 0 |
0 |
0 |
0 |
0 |
0 |
0 |
0.994 |
0 |
Interpreting
the Output:
The BR values provided in the first table above are relatively easy to interpret.
A higher value indicates that the assemblages being compared are more similar
in terms of the proportions of categorical variables considered (in this case,
ceramic compositional groups). In the table above, comparisons between sites
in the same settlement cluster (geographic location) are shown in red and comparisons
between clusters are shown in black. A cursory inspection of the BR matrix suggests
that BR values are highest among settlements in the same cluster (Huntley 2004,
2008).
Because the sizes of the samples considered in a similarity matrix can often
vary considerably, however, it is also important to assess the possibility that
the differences among samples are the result of sampling error. For this script,
I follow the procedures described by DeBoer and others (1996). This script estimates
the probability of obtaining a BR value as low as or lower than a given comparison
by chance using a Monte Carlo simulation procedure. Specifically, 1,000 random
samples of a specified sample size (based on the actual number of samples in
each two-way comparison) are drawn with replacement from a population with proportions
defined by the actual number of samples in each category for all samples. In
the example above, the proportions of each category (calculated based on the
raw data table) are as follows:
DLH-1 |
DLH-2a |
DLH-2b |
DLH-2c |
DLH-4 |
TOTAL |
|
Counts | 53 |
92 |
56 |
18 |
31 |
250 |
Percents | 21.2% |
36.8% |
22.4% |
7.2% |
12.4% |
100.0% |
If we were to compare the compositional
sample from Atsinna (n=29) with the sample from Cienega (n=18), we would draw
two random samples (with replacement) from the global pool of all samples from
all sites of n=29 and n=18. We would then calculate the BR similarity coefficient
between these two random samples. This same procedure would then be repeated
until the desired number of random runs has been obtained (in this case 1,000).
The proportion of all of the random samples which produce BR values less than
or equal to the actual BR value for a comparison between two samples provides
an indication of the probability that an observed difference may be due to sampling
error. For example, a probability value of p=0.005 means that in only 5 out
of 1,000 random runs was a BR value less than or equal to the observed obtained.
Such a low probability suggests that the differences between the sites being
compared are extremely unlikely to have been the result of sampling error. On
the other hand, a value of p=0.250 suggests that approximately 1 in 4 random
samples pulled from the global pool produced BR values less than or equal to
the observed value. This suggests that any minor differences among samples may
be related to the vagaries of sampling.
As the table of probabilities above shows, comparisons between sites in the
same settlement cluster (shown in red) all display relatively high probability
values whereas comparisons between sites in different settlement clusters all
have quite low probability values. This suggests that the differences between
sites in different clusters are not likely the result of sampling error. Conversely,
minor differences in similarity values among sites in the same settlement cluster
are similar to what might be expected by chance. Thus, minor differences among
sites in the same area are likely attributable, at least in part, to sampling
error.
It is also be possible to calculate probabilities based on some distribution
other than the global pool of all samples (DeBoer et al. 1996). In order to
do this, replace the section of the script below shown in red with the following:
p.grp <- as.matrix(read.table(file='prob.csv',sep=',',header=T))
Place a tabular *.csv file in the R working directory named "prob.csv"
in the following format. The first row should provide variable names and the
second row should provide the proportions of each variable to be used in the
random sampling procedure. The variables must be exactly the same as in the
original data file. Such a procedure might be useful if you wanted to compare
the proportional representation of objects away from where they are produced
to the proportions in the production area (e.g., Bernardini 2005).
DLH-1 |
DLH-2a |
DLH-2b |
DLH-2c |
DLH-4 |
0.25 |
0.10 |
0.15 |
0.30 |
0.20 |
References Cited:
Bernardini, Wesley
2005 Hopi Oral Tradition and the Archaeology of Identity. University of Arizona
Press, Tucson.
DeBoer, Warren R., Keith W. Kintigh, and Arthur G. Rostoker
1996 Ceramic Seriation and Site Reoccupation in Lowland South America. Latin
American Antiquity, 7(3):263-278.
Huntley, Deborah L.
2004 Interaction, Boundaries, and Identities: A Multiscalar Approach to the
Organizational Scale of Pueblo IV Zuni Society. Ph.D. Dissertation, Arizona
State University.
Huntley, Deborah L.
2008 Ancestral Zuni Glaze-Decorated Pottery: Viewing Pueblo IV Regional Organization
through Ceramic Production and Exchange. Anthropological Papers No. 72. University
of Arizona Press, Tucson.
Script: