Introduction

This document provides a brief example of the ceramic apportioning function which replicates methods previously published by Roberts et al. 2012. This script is designed to divide ceramic assemblages into a user specified number of periods based on the overlap in the occupation dates of a site, the production dates of each ceramic type, and some assumptions about the shape of the popularity curve of a ceramic type through time. Specifically, following Roberts et al. 2012, ceramic popularity curves are modeled as a normal distribution centered on the midpoint date and truncated at 2 standard deviations above and below the mean.

The code is designed to be run most efficiently on a data frame of a format similar to the example used here. Specifically, the data frame should have a row for every site/type/count (this means that site name and site beginning and ending dates may be repeated many times). The data frame at a minimum must include site name, type name, count, site occupation start date, site occupation end date, type production start date, and type production end date. The example data here can serve as a template. Note that this script can be run for multiple sites combined into a single dataframe.

ceramics <- read.csv('preapportion.csv') # read in formatted data
DT::datatable(ceramics)

The Apportioning Function

The following chunk of code includes the apportioning script in it’s entirety with embedded comments describing each procedure. The user required input includes Site, Type, Count, a beginning date for the period to be considered, an end date for the period to be considered, the interval size for each period, as well as site beginning and ending dates, and type beginning and ending dates. For more details on the calculations included below, see the Roberts et al. 2012 original publication which describes the method in detail.

This GitHub page includes a version of the script called “Apportion.R” that can be run directly from source. The code below is identical to that source file.

require(msm)

apportion <-
  function(Site,      # vector of site names with one entry for every site/type/count
           Type,      # vector of type names with one entry for every site/type/count
           ct,        # vector of counts with one entry for every site/type/count
           beg.date,  # beginning date of the interval to be considered
           end.date,  # end date of the interval to be considered
           interval,  # interval to define period length for apportioning
           site.beg,  # vector of site beginning dates with one entry for every site/type/count
           site.end,  # vector of site ending dates with one entry for every site/type/count
           type.beg,  # vector of type beginning dates with one entry for every site/type/count
           type.end)  # vector of type ending dates with one entry for every site/type/count
                    {
    
    
    # calculate the number of periods of length "interval" and round up to nearest whole number
    nperiods <- ceiling((end.date - beg.date) / interval)
    
    # Create column labels for apportioned  matrix for each interval
    c.lab <- NULL
    for (i in 1:nperiods) {
      c.lab[i] <- beg.date + ((i - 1) * interval)
    }
    
    # create short variable names to satisfy my need for tidiness (and to match variable names in Roberts et al. 2012 paper)
    t0 <- type.beg
    t1 <- type.end
    s0 <- site.beg
    s1 <- site.end
    
    # set truncation points for normal distribution at 2 standard deviations
    zsd <- 2
    
    # Create empty matrix of 0s with a row for each Site/Type/Count and columns for each period defined above and apply column labels
    mat <- matrix(0, length(Site), nperiods)
    colnames(mat) <- c.lab
    
    # Create an additional vector for ceramics not apportioned to any period considered here
    # This vector will contain counts for ceramics outside of the site occupation span or outside of the user defined period.
    unapportioned <- rep(0, length(Site))
    
    # round site occupation start dates to multiples of interval
    for (i in 1:length(s0)) {
      if ((s0[i] / interval) != (round(s0[i] / interval, 0))) {
        s0[i] <- interval * floor(s0[i] / interval)
      }
    }
    
    # round site occupation end dates to multiples of interval
    for (i in 1:length(s1)) {
      if ((s1[i] / interval) != (round(s1[i] / interval, 0))) {
        s1[i] <- interval * ceiling(s1[i] / interval)
      }
    }
    
    # Apply apportioning procedure in a loop across every Site/Type/Count
    for (j in 1:length(Site)) {
      
      # create a sequence by 1 for every year from the type begin date to end date
      type_seq <- seq(t0[j], t1[j], by = 1)
      # create a sequence from -2 to 2 at intervals of 0.0001 to define positions in the truncated normal distribution
      zscore <- seq(-zsd, zsd, by = 0.0001)
      # create a vector of probabilities for the truncated normal distribution at positions defined by "zscore"
      pdist <- ptnorm(zscore, 0, 1, -zsd, zsd)
      
      # define length of period for type
      g <- t1[j] - t0[j]
      # define midpoint type date
      m <- mean(type_seq)
      # define a sequence by 1 from site occupation start to end date
      site_seq <- seq(s0[j], s1[j])
      
      
      # determine the length of intersection between the type date and the site date
      ws_overlap <- intersect(type_seq, site_seq)
      
      # If type dates and sites dates intersect, do the following:
      if (length(ws_overlap) > 0) {
        vj0 <- min(ws_overlap) # minimum date of overlap
        vj1 <- max(ws_overlap) # maximum date of overlap
        
        z20 <- (vj0 - m) / (g / (2 * zsd)) # transform start date of overlap into z values
        z21 <- (vj1 - m) / (g / (2 * zsd)) # transform end date of overlap into z values
        i0 <- which(abs(zscore - z20) == min(abs(zscore - z20))) # find the closest z score value to z20 defined above
        i1 <- which(abs(zscore - z21) == min(abs(zscore - z21))) # find the closest z score value to z21 defined above
        z_phi20 <- pdist[i0] # assign the probability associated with the selected z score value for start of overlap
        z_phi21 <- pdist[i1] # assign the probability associated with the selected z score value for end of overlap
        
        
        # Loop through each period and assign apprtioned counts as ct * phi value 
        for (i in 1:nperiods) {
          # if statement determines if a particular interval should be considered
          if (length(intersect(seq(c.lab[i], (c.lab[i] + interval - 1)), ws_overlap[1:length(ws_overlap) - 1])) > 0) {
            zj0 <- (c.lab[i] - m) / (g / (2 * zsd))
            zj1 <- ((c.lab[i] + interval) - m) / (g / (2 * zsd))
            i0 <- which(abs(zscore - zj0) == min(abs(zscore - zj0)))
            i1 <- which(abs(zscore - zj1) == min(abs(zscore - zj1)))
            z_phi0 <- pdist[i0]
            z_phi1 <- pdist[i1]
            pjt_1a <- (z_phi1 - z_phi0) / (z_phi21 - z_phi20)
            mat[j, i] <- as.numeric(pjt_1a * ct[j])
          } # end if statement
        } # end for loop
      } # end if statement
      
      # add any portion of the ceramics that extends beyond the period considered to the unapportioned vector
      if (s0[j] < beg.date ||
          s1[j] > end.date) {
        unapportioned[j] <- ct[j] - sum(mat[j, ])
      }
      
      # assign any type count that doesn't overlap with site occupation to unapportioned vector
      if (length(ws_overlap) == 0) {
        unapportioned[j] <- unapportioned[j] + ct[j]
      }
    } # end for loop
    
    # round numbers to 3 digits combine vectors and output
    mat <- round(mat, 3)
    unapportioned <- round(unapportioned, 3)
    mat <- cbind(Site, Type, t0, t1, unapportioned, mat)
    mat <- as.data.frame(mat)
    colnames(mat)[1:4] <- c('Site','Type','Start_Date','End_Date')
    mat[,3:ncol(mat)] <- do.call(cbind,lapply(mat[,3:ncol(mat)],as.numeric))
    return(mat)
} # end function

Running the Function

The following chunk of code applies to the apportioning function to the data frame imported above with the period beginning date set to 1200, the end date set to 1450, and with a period interval at 25 years. The results are displayed as an interactive datatable.

results <- apportion(Site = ceramics$Site, Type = ceramics$Type, ct = ceramics$Count, beg.date=1200, end.date=1450, interval=25, site.beg=ceramics$SiteBeg, site.end = ceramics$SiteEnd, type.beg = ceramics$TypeBeg, type.end = ceramics$TypeEnd)

DT::datatable(results)

Note that if we use the same data but alter those three values (beg.date, end.date, and interval) we will get different results. For example, if we were to only consider the date range from 1200-1350, the sherds apportioned to later intervals in the first example would go in the “unapportioned” category.

results2 <- apportion(Site = ceramics$Site, Type = ceramics$Type, ct = ceramics$Count, beg.date=1200, end.date=1350, interval=25, site.beg=ceramics$SiteBeg, site.end = ceramics$SiteEnd, type.beg = ceramics$TypeBeg, type.end = ceramics$TypeEnd)

DT::datatable(results2)

Similarly, we can change the interval size and the apportioned ceramics will change accordingly.

results3 <- apportion(Site = ceramics$Site, Type = ceramics$Type, ct = ceramics$Count, beg.date=1200, end.date=1450, interval=50, site.beg=ceramics$SiteBeg, site.end = ceramics$SiteEnd, type.beg = ceramics$TypeBeg, type.end = ceramics$TypeEnd)

DT::datatable(results3)

If for some reason you want intervals of say, 42 years, you can do that too.

results4 <- apportion(Site = ceramics$Site, Type = ceramics$Type, ct = ceramics$Count, beg.date=1200, end.date=1450, interval=42, site.beg=ceramics$SiteBeg, site.end = ceramics$SiteEnd, type.beg = ceramics$TypeBeg, type.end = ceramics$TypeEnd)

DT::datatable(results4)

Iterative Proportional Fitting

Another analysis presented within the Roberts et al. 2012 paper allows the matrix of apportioned values to be fit to some externally defined population curve for a site. This is based on the assumption that, in general, we would expect more ceramic deposition during periods of greater population. The way this is achieved is using iterative proportional fitting. Essentially, this method adjusts the row and column totals so that the rows are the row sums of counts during the period of interest and the column totals are the population value multiplied by the total ceramic count. This new table finds the best fit cross-classification that preservers the odds ratios of the table while also respecting the desired totals. In some cases, this procedure will fail as it will generate impossible values by, for example, adding counts for a particular type outside of it’s production range.

The following chunk of code adapts a procedure for iterative proportional fitting that expects input that is exactly the output of the apportioning procedure used above. In this function, an IPF fit is generated and then evaluated to determine whether it produces any problems and then outputs either the new IPF fit dataset or the original dataset and a warning indicating that IPF produced impossible results. This procedure needs to be run one site at a time so rows from the ouptut from above need to be selected by site. Let’s look at the example. We are calling the function from source here as it is quite large. We will use the “results” object above which represents apportioned data at 25 year intervals from 1200-1450.

In order to run this procedure you need to supply a vector of population values for each of the apportioned periods. This can be based on anything from a model to absolute date counts or anything. In this case, we are using the simple population modeling assumptions used in the Roberts et al. 2012 article.

After displaying both datatables I then produce a figure showing the proportion of ceramics assigned to each interval in the original apportioned data and then the IPF fit data.

source('ipf.R')

orig <- results[which(results3$Site=="Swingle's Sample"),]
DT::datatable(orig)
pop <- c(0,0,0,0,13,19,23,23,21,0) 

res <- ipf_apportion(apportion_output=orig, pop=pop)
DT::datatable(res)
plot(seq(1200,1425,by=25),colSums(orig[,6:ncol(orig)])/sum(orig[,6:ncol(orig)]),type='l',col='red', xlab='Date',ylab='Proportion of Ceramic Count', ylim=c(0,0.25), xlim=c(1200,1500))
points(seq(1200,1425,by=25),colSums(res[,6:ncol(res)])/sum(res[,6:ncol(res)]),type='l',col='blue')
legend('topleft',col=c('red','blue'),legend=c('Apportioned','IPF'), lty=c(1,1), lwd=c(1,1))

Now let’s look at an example where the IPF procedure fails. Note that in this case I am intentionally generating this error by supplying an incorrect population curve for this site that indicates no population for intervals where this site was occupied.

orig <- results[which(results3$Site=="Leaverton"),]
DT::datatable(orig)
pop <- c(0,0,0,0,13,19,23,23,21,0) 

res <- ipf_apportion(apportion_output=orig, pop=pop)
## [1] "IPF produced impossible results and original apportioned data are returned"
DT::datatable(res)