R Script for Seriation Using Correspondence Analysis
This document can be cited
as follows:
Society for American Archaeology style:
Peeples, Matthew A.
2011 R Script for Seriation Using Correspondence Analysis. Electronic document,
http://www.mattpeeples.net/ca.html, accessed
APA style:
Peeples, Matthew A. (2011) R Script for Seriation Using Correspondence Analysis.
[online]. Available: http://www.mattpeeples.net/ca.html. (
)
| return to my homepage | more
R scripts |
This document provides a brief overview of the ca.R script which can be used
to carry out CA on two-way tables. This script is designed to quickly produce
graphics that are useful for conducting and evaluating seriations of archaeological
materials using CA. The sample data used here come from an article by Kintigh
and others (2004) described below. You can download the sample
data file along with the script to follow
along with this example. Right click and click Save As for both of the files
above. Sample output can be downloaded here : graphical
output & text output.
Correspondence analysis (CA)
is a statistical method for reducing the dimensionality of multi-variable frequency
data that defines axes of variability on which both observations and variables
can be easily displayed. CA is similar to principal components analysis but
has several advantages which make it particularly usesful for frequency seriation.
CA utilizes the raw count data for each variable, reducing the possibility of
data loss associated with data transformations or distance calculations. In
addition to this, CA is similtaneously R and Q mode, which
means that both variables and cases can be evaluated and plotted in the same
mult-dimensional space. This allows for quick visual assessments of the relationships
between cases or groups of cases and specific variables (see Duff 1996; Kintigh
et al. 2004).
Kintigh and others (2004) describe in detail the use of CA in seriation and
the evaluation of chronological orderings. The authors started by placing all
habitation sites with systematic ceramic surface collections into a series of
ceramic complexes (i.e., groups with consistently associated ceramic types believed
to be contemporaneous). These sites are then assigned date ranges based on the
overlapping ranges of tree-ring dates associated with the ceramic types within
a complex. Finally, these ceramic complexes were independently evaluated based
on a frequency seriation using CA. Specifically, the CA frequency seriation
was used to check the case specific ceramic complex assignments and, in an iterative
process, allowed for the assignment of specific cases to be reassesed. The ca.R
script is designed to speed up the iterative process of comparing temporal assignments
(in this example ceramic complexes) with frequency seriation based on CA. This
method could similarly be used to evaluate chronological groupings based on
other information such as date ranges based on absolute dates or to evaluate
stratigraphic levels across a site.
File Format:
This script is designed to use the *.csv (comma separated value) file format.
Microsoft Excel as well as the open-source program Calc in the Open
Office suite can be used to produce files in this format from any tabular
data. For the purposes of this script, the file should be named "ca.csv".
Note that file names are case sensitive. The text of the script file may be
edited to change the input file name.
Table Format:
Tables should be formatted with each of the samples/observations as rows and
each of the variables to be included as columns. The first row of the spreadsheet
should be a header that labels each of the columns. The first column should
contain the name of each unit (i.e., level, unit, site, etc.). Row names may
not be repeated. The second column of the table should be the symbol designation
(letter) to be used in the CA plot. The example given here uses letters A-G
designating ceramic complexes. All of the remaining columns should contain count
data that will be used to conduct the correspondence analysis. This analysis
will not work if there are missing data in any rows or columns, so samples with
missing data should be removed before running the script. A sample table format
is shown below:
SITE |
Complex |
LINO |
KIAT |
RED |
PUBW |
TULA |
WING |
STJ |
Site
1 |
A |
16 |
3 |
0 |
0 |
0 |
0 |
0 |
Site 2 |
A |
28 |
5 |
1 |
1 |
0 |
0 |
0 |
Site 3 |
B |
5 |
16 |
8 |
2 |
0 |
0 |
0 |
Site 4 |
C |
0 |
5 |
25 |
8 |
3 |
1 |
4 |
Site 5 |
A |
28 |
15 |
0 |
0 |
0 |
0 |
0 |
Site 6 |
D |
0 |
1 |
35 |
18 |
4 |
5 |
0 |
Site 7 |
F |
0 |
0 |
1 |
15 |
38 |
9 |
5 |
Site 8 |
B |
15 |
35 |
17 |
6 |
0 |
0 |
0 |
Site 9 |
C |
1 |
15 |
40 |
13 |
10 |
0 |
0 |
Requirements
for Running the Script:
In order to run this
script, you must install the R statistical package (version 2.8). R can be downloaded
for free here. Follow the instructions
on the R site for installation procedures. In addition to this, this script
requires two specific R packages to be installed (ca and calibrate). In order
to install these two packages, simply click on the "packages" drop
down menu at the top of the R window and click on "Install package(s)".
Choose a CRAN mirror (it is best to choose the location closest to you). Select
the "ca" or "calibrate" package and click OK. Repeat the
process for the other package. For further instructions for installing packages,
check here.
Running the Script:
The first step for running the script is to place the script file "ca.R"
and the data file "ca.csv" in the working directory of R. To change
the working directory, click on "File" in the R window and select
"Change dir", then simply browse to the directory that you would like
to use as the working directory. Next, to actually run the script, type the
following line into the R command line:
source('ca.R')
Script Output:
The ca.R script runs
automatically with no need for user input. The script first displays a plot
of the first two eigenvectors of the CA showing both the observations and variables
labeled with the row name. Next, click on the console monitor to display the
next plot. The second plot displays, again, the first two eigenvectors of the
CA. In this plot, the observations are shown as characters and colors determined
by the single letter value in the second column of the data table. The variables
are shown on the plot with and labeled with the column name. The second plot
is scaled so that the entire contents (including labels) will fit within the
bounding box of the scatter plot. After the script is completed, all graphical
output is saved as a pdf file (ca.pdf).
The script also outputs a txt file (CA_out.txt) containing
information including the percent of the total variability explained by each
CA dimension as well as the raw row and column inertia values. Examples of the
output using the sample data are shown below.
click image for full size
Sample:
Principal inertias (eigenvalues):
dim value % cum% scree plot
[1,] 1 0.699812 34.1 34.1 *************************
[2,] 2 0.446337 21.7 55.8 ***************
[3,] 3 0.279855 13.6 69.5 *********
[4,] 4 0.163632 8.00 77.4 *****
[5,] 5 0.131942 6.40 83.9 ****
[6,] 6 0.119731 5.80 89.7 ***
[7,] 7 0.097513 4.80 94.5 **
[8,] 8 0.074401 3.60 98.1 *
[9,] 9 0.039444 1.90 100.0
[10,] -------- -----
[11,] Total: 2.052668 100.0
Rows:
name mass qlt inr k=1 cor ctr k=2 cor ctr
1 | LZ0501 | 12 817 8| -681 322 8 | 844 495 18|
2 | LZ0502 | 09 873 6| -827 517 8 | 687 356 09|
Reference:
Duff, A.I.
1996 Ceramic Micro-seriation: Types or Attributes? American Antiquity
61:89-101.
Kintigh, K. W., D. M.
Glowacki, and D. L. Huntley
2004 Long-Term Settlement History and the Emergence of Towns in the Zuni Area.
American Antiquity 69:432-456.
Script: