R Script for Seriation Using Correspondence Analysis

This document can be cited as follows:

Society for American Archaeology style:
Peeples, Matthew A.
2011 R Script for Seriation Using Correspondence Analysis. Electronic document, http://www.mattpeeples.net/ca.html, accessed

APA style:
Peeples, Matthew A. (2011) R Script for Seriation Using Correspondence Analysis. [online]. Available: http://www.mattpeeples.net/ca.html. ( )



| return to my homepage | more R scripts |

This document provides a brief overview of the ca.R script which can be used to carry out CA on two-way tables. This script is designed to quickly produce graphics that are useful for conducting and evaluating seriations of archaeological materials using CA. The sample data used here come from an article by Kintigh and others (2004) described below. You can download the sample data file along with the script to follow along with this example. Right click and click Save As for both of the files above. Sample output can be downloaded here : graphical output & text output.

Correspondence
analysis (CA) is a statistical method for reducing the dimensionality of multi-variable frequency data that defines axes of variability on which both observations and variables can be easily displayed. CA is similar to principal components analysis but has several advantages which make it particularly usesful for frequency seriation. CA utilizes the raw count data for each variable, reducing the possibility of data loss associated with data transformations or distance calculations. In addition to this, CA is similtaneously R and Q mode, which means that both variables and cases can be evaluated and plotted in the same mult-dimensional space. This allows for quick visual assessments of the relationships between cases or groups of cases and specific variables (see Duff 1996; Kintigh et al. 2004).

Kintigh and others (2004) describe in detail the use of CA in seriation and the evaluation of chronological orderings. The authors started by placing all habitation sites with systematic ceramic surface collections into a series of ceramic complexes (i.e., groups with consistently associated ceramic types believed to be contemporaneous). These sites are then assigned date ranges based on the overlapping ranges of tree-ring dates associated with the ceramic types within a complex. Finally, these ceramic complexes were independently evaluated based on a frequency seriation using CA. Specifically, the CA frequency seriation was used to check the case specific ceramic complex assignments and, in an iterative process, allowed for the assignment of specific cases to be reassesed. The ca.R script is designed to speed up the iterative process of comparing temporal assignments (in this example ceramic complexes) with frequency seriation based on CA. This method could similarly be used to evaluate chronological groupings based on other information such as date ranges based on absolute dates or to evaluate stratigraphic levels across a site.

File Format:
This script is designed to use the *.csv (comma separated value) file format. Microsoft Excel as well as the open-source program Calc in the Open Office suite can be used to produce files in this format from any tabular data. For the purposes of this script, the file should be named "ca.csv". Note that file names are case sensitive. The text of the script file may be edited to change the input file name.

Table Format:
Tables should be formatted with each of the samples/observations as rows and each of the variables to be included as columns. The first row of the spreadsheet should be a header that labels each of the columns. The first column should contain the name of each unit (i.e., level, unit, site, etc.). Row names may not be repeated. The second column of the table should be the symbol designation (letter) to be used in the CA plot. The example given here uses letters A-G designating ceramic complexes. All of the remaining columns should contain count data that will be used to conduct the correspondence analysis. This analysis will not work if there are missing data in any rows or columns, so samples with missing data should be removed before running the script. A sample table format is shown below:

SITE

Complex

LINO

KIAT

RED

PUBW
TULA
WING

STJ

Site 1
A
16
3
0
0
0
0
0
Site 2
A
28
5
1
1
0
0
0
Site 3
B
5
16
8
2
0
0
0
Site 4
C
0
5
25
8
3
1
4
Site 5
A
28
15
0
0
0
0
0
Site 6
D
0
1
35
18
4
5
0
Site 7
F
0
0
1
15
38
9
5
Site 8
B
15
35
17
6
0
0
0
Site 9
C
1
15
40
13
10
0
0

Requirements for Running the Script:
In order to run this script, you must install the R statistical package (version 2.8). R can be downloaded for free here. Follow the instructions on the R site for installation procedures. In addition to this, this script requires two specific R packages to be installed (ca and calibrate). In order to install these two packages, simply click on the "packages" drop down menu at the top of the R window and click on "Install package(s)". Choose a CRAN mirror (it is best to choose the location closest to you). Select the "ca" or "calibrate" package and click OK. Repeat the process for the other package. For further instructions for installing packages, check here.

Running the Script:
The first step for running the script is to place the script file "ca.R" and the data file "ca.csv" in the working directory of R. To change the working directory, click on "File" in the R window and select "Change dir", then simply browse to the directory that you would like to use as the working directory. Next, to actually run the script, type the following line into the R command line:


source('ca.R')

Script Output:
The ca.R script runs automatically with no need for user input. The script first displays a plot of the first two eigenvectors of the CA showing both the observations and variables labeled with the row name. Next, click on the console monitor to display the next plot. The second plot displays, again, the first two eigenvectors of the CA. In this plot, the observations are shown as characters and colors determined by the single letter value in the second column of the data table. The variables are shown on the plot with and labeled with the column name. The second plot is scaled so that the entire contents (including labels) will fit within the bounding box of the scatter plot. After the script is completed, all graphical output is saved as a pdf file (ca.pdf). The script also outputs a txt file (CA_out.txt) containing information including the percent of the total variability explained by each CA dimension as well as the raw row and column inertia values. Examples of the output using the sample data are shown below.


click image for full size

Sample:

Principal inertias (eigenvalues):

dim value % cum% scree plot
[1,] 1 0.699812 34.1 34.1 *************************
[2,] 2 0.446337 21.7 55.8 ***************
[3,] 3 0.279855 13.6 69.5 *********
[4,] 4 0.163632 8.00 77.4 *****
[5,] 5 0.131942 6.40 83.9 ****
[6,] 6 0.119731 5.80 89.7 ***
[7,] 7 0.097513 4.80 94.5 **
[8,] 8 0.074401 3.60 98.1 *
[9,] 9 0.039444 1.90 100.0
[10,] -------- -----
[11,] Total: 2.052668 100.0


Rows:
name mass qlt inr k=1 cor ctr k=2 cor ctr
1 | LZ0501 | 12 817 8| -681 322 8 | 844 495 18|
2 | LZ0502 | 09 873 6| -827 517 8 | 687 356 09|


Reference:

Duff, A.I.
1996 Ceramic Micro-seriation: Types or Attributes? American Antiquity 61:89-101.

Kintigh, K. W., D. M. Glowacki, and D. L. Huntley
2004 Long-Term Settlement History and the Emergence of Towns in the Zuni Area. American Antiquity 69:432-456.

Script: