Marascuillo Procedure

2017/11/22

Categories: Marascuillo Procedure Tags: R Marascuillo

Preface

In an attempt to determine which variables are explanatory or response for a data set of all categorical variables, I came across the marascuilo procedure, which isn’t used per se for determining whether a variable is explanatory or response, but it can tell me if a statistical significant difference exists between a pair. The data are from the General Social Survey.

Categorical Data of Interest

The categorical variables of interest are:

Statisticians prefer to express the intent of their experiments, studies before performing a set of analyses. Unfortunately, a lot of the time a statistician is asked to make sense of data after the fact. Though not a statistician but an applied mathematician, I am attempting to make sense of data after the fact by asking myself some questions.

Which of these variables could be explanatory or response variables?

Sex should be an explanatory variable. It should be pretty consistent among the decades in the US, as sex-selective abortion is not wide-scaled practiced.

A contingency table of the proportions of males and females surveyed per decade is displayed below.

crosstab(gss, row.vars = "decade", col.vars = "sex", type='r')
##        sex   Male Female    Sum
## decade                         
## 00s         44.68  55.32 100.00
## 10s         44.23  55.77 100.00
## 70s         45.76  54.24 100.00
## 80s         42.78  57.22 100.00
## 90s         43.35  56.65 100.00

Are these proportions changing with time?

The Marascuillo procedure can be used to compare pairwise differences between the decades. A pair proportion difference is statistically significant if its value exceeds the critical range value. Here, we state that the following hypotheses:

The critical range will be \(\chi^2\) statistic of 9.49 (0.95 CI, degrees of freedom = 4).

gender = crosstab(gss, row.vars = "decade", col.vars = "sex", type='f')

marascuillo_procedure = function(p){
  n = length(p[, c(1)]) - 1
  print(n)
  prop = p[,1][1:n]/p[,2][1:n]
  print(prop)
  names = row.names(p)
  N = length(prop)
  value = critical.range = compare= c()
  
  ## Compute critical values; small data set; efficiency non-issue
  for (i in 1:(N - 1))
     { for (j in (i + 1):N){
       value = c(value,(abs(prop[i] - prop[j])))
       critical.range = c(critical.range, 
          sqrt(qchisq(.95,4))*sqrt(prop[i]*(1-prop[i])/p[i,2] + prop[j]*(1-prop[j])/p[j,2]))
        compare = c(compare, paste(names[i], names[j], sep = '-'))
      }
  }
  
  return(data.frame(values = round(value, 3),
                    critical = round(critical.range,3), 
                    comparison = compare))
}  

marascuillo_procedure(gender$table[, c(1,3)])
## [1] 5
##       00s       10s       70s       80s       90s 
## 0.4468413 0.4422598 0.4575667 0.4278492 0.4334871
##    values critical comparison
## 1   0.005    0.027    00s-10s
## 2   0.011    0.019    00s-70s
## 3   0.019    0.018    00s-80s
## 4   0.013    0.018    00s-90s
## 5   0.015    0.028    10s-70s
## 6   0.014    0.027    10s-80s
## 7   0.009    0.028    10s-90s
## 8   0.030    0.020    70s-80s
## 9   0.024    0.020    70s-90s
## 10  0.006    0.018    80s-90s

The statistically significant differences are these pairwise comparisons: 80s and 00s; 70s and 80s; and 70s and 90s. The decade demarcation could afford some tweaking. Looking at year-by-year proportion differences, I observed that the years 1972, 73, and 84, when compared to other years, are statistically different.

# list of year contigency tables 
gender.yr = crosstab(gss, row.vars = "year", col.vars = "sex", type='f')
gender.prop.yr = marascuillo_procedure(gender.yr$table[, c(1,3)])
## [1] 29
##      1972      1973      1974      1975      1976      1977      1978      1980      1982      1983      1984      1985      1986      1987      1988      1989      1990      1991      1993      1994      1996      1998      2000      2002      2004      2006      2008      2010      2012 
## 0.5003100 0.4660904 0.4656334 0.4496644 0.4462975 0.4529412 0.4197128 0.4366485 0.4188172 0.4315197 0.4059742 0.4485007 0.4224490 0.4277075 0.4307900 0.4294079 0.4402332 0.4192485 0.4265255 0.4311497 0.4424931 0.4350282 0.4362797 0.4441230 0.4551920 0.4441242 0.4597133 0.4359100 0.4488349
gender.prop.yr[gender.prop.yr$values > gender.prop.yr$critical, ]
##     values critical comparison
## 6    0.081    0.055  1972-1978
## 7    0.064    0.055  1972-1980
## 8    0.081    0.052  1972-1982
## 9    0.069    0.054  1972-1983
## 10   0.094    0.055  1972-1984
## 12   0.078    0.055  1972-1986
## 13   0.073    0.052  1972-1987
## 14   0.070    0.055  1972-1988
## 15   0.071    0.055  1972-1989
## 16   0.060    0.056  1972-1990
## 17   0.081    0.055  1972-1991
## 18   0.074    0.054  1972-1993
## 19   0.069    0.047  1972-1994
## 20   0.058    0.048  1972-1996
## 21   0.065    0.048  1972-1998
## 22   0.064    0.048  1972-2000
## 23   0.056    0.048  1972-2002
## 25   0.056    0.045  1972-2006
## 27   0.064    0.051  1972-2010
## 37   0.060    0.056  1973-1984
## 63   0.060    0.056  1974-1984
## 251  0.054    0.052  1984-2008

What if I were to elminate them?

if I eliminate those years, then I see that there are no pairwise decade-by-decade differences in proportions that are significant. The years of potential exclusion were years during which additional tests, surveys were done that focused on increasing information from demographics undocumented in the past.

gender = crosstab(gss[gss$year > 1973 & gss$year != 1984,], 
                  row.vars = "decade", col.vars = "sex", type='f')
marascuillo_procedure(gender$table[, c(1,3)])
## [1] 5
##       00s       10s       70s       80s       90s 
## 0.4468413 0.4422598 0.4467153 0.4303728 0.4334871
##    values critical comparison
## 1   0.005    0.027    00s-10s
## 2   0.000    0.022    00s-70s
## 3   0.016    0.018    00s-80s
## 4   0.013    0.018    00s-90s
## 5   0.004    0.030    10s-70s
## 6   0.012    0.028    10s-80s
## 7   0.009    0.028    10s-90s
## 8   0.016    0.022    70s-80s
## 9   0.013    0.022    70s-90s
## 10  0.003    0.019    80s-90s

Gender could be an explanatory variable. More of an assessment is needed to determine whether exclusion of those years is warranted and perform subsequent analyses.

References:

  1. NIST: Comparing multiple proportions: The Marascuillo procedure,
    http://www.itl.nist.gov/div898/handbook/prc/section4/prc474.htm