Preface
In an attempt to determine which variables are explanatory or response for a data set of all categorical variables, I came across the marascuilo procedure, which isn’t used per se for determining whether a variable is explanatory or response, but it can tell me if a statistical significant difference exists between a pair. The data are from the General Social Survey.
Categorical Data of Interest
The categorical variables of interest are:
- Nominal (
sex
)- Male or Female
- Ordinal (
degree
)- Little High School, High School, Junior College, Bachelors, and Graduate
- Ordinal (
natenvir
)- Too Little, About Right, and Too Much
Statisticians prefer to express the intent of their experiments, studies before performing a set of analyses. Unfortunately, a lot of the time a statistician is asked to make sense of data after the fact. Though not a statistician but an applied mathematician, I am attempting to make sense of data after the fact by asking myself some questions.
Which of these variables could be explanatory or response variables?
Sex should be an explanatory variable. It should be pretty consistent among the decades in the US, as sex-selective abortion is not wide-scaled practiced.
A contingency table of the proportions of males and females surveyed per decade is displayed below.
crosstab(gss, row.vars = "decade", col.vars = "sex", type='r')
## sex Male Female Sum
## decade
## 00s 44.68 55.32 100.00
## 10s 44.23 55.77 100.00
## 70s 45.76 54.24 100.00
## 80s 42.78 57.22 100.00
## 90s 43.35 56.65 100.00
Are these proportions changing with time?
The Marascuillo procedure can be used to compare pairwise differences between the decades. A pair proportion difference is statistically significant if its value exceeds the critical range value. Here, we state that the following hypotheses:
- \(H_o\): there is no difference between the pairwise proportion differences
- \(H_a\): there is a difference between the pairwise proportion differences
The critical range will be \(\chi^2\) statistic of 9.49 (0.95 CI, degrees of freedom = 4).
gender = crosstab(gss, row.vars = "decade", col.vars = "sex", type='f')
marascuillo_procedure = function(p){
n = length(p[, c(1)]) - 1
print(n)
prop = p[,1][1:n]/p[,2][1:n]
print(prop)
names = row.names(p)
N = length(prop)
value = critical.range = compare= c()
## Compute critical values; small data set; efficiency non-issue
for (i in 1:(N - 1))
{ for (j in (i + 1):N){
value = c(value,(abs(prop[i] - prop[j])))
critical.range = c(critical.range,
sqrt(qchisq(.95,4))*sqrt(prop[i]*(1-prop[i])/p[i,2] + prop[j]*(1-prop[j])/p[j,2]))
compare = c(compare, paste(names[i], names[j], sep = '-'))
}
}
return(data.frame(values = round(value, 3),
critical = round(critical.range,3),
comparison = compare))
}
marascuillo_procedure(gender$table[, c(1,3)])
## [1] 5
## 00s 10s 70s 80s 90s
## 0.4468413 0.4422598 0.4575667 0.4278492 0.4334871
## values critical comparison
## 1 0.005 0.027 00s-10s
## 2 0.011 0.019 00s-70s
## 3 0.019 0.018 00s-80s
## 4 0.013 0.018 00s-90s
## 5 0.015 0.028 10s-70s
## 6 0.014 0.027 10s-80s
## 7 0.009 0.028 10s-90s
## 8 0.030 0.020 70s-80s
## 9 0.024 0.020 70s-90s
## 10 0.006 0.018 80s-90s
The statistically significant differences are these pairwise comparisons: 80s and 00s; 70s and 80s; and 70s and 90s. The decade demarcation could afford some tweaking. Looking at year-by-year proportion differences, I observed that the years 1972, 73, and 84, when compared to other years, are statistically different.
# list of year contigency tables
gender.yr = crosstab(gss, row.vars = "year", col.vars = "sex", type='f')
gender.prop.yr = marascuillo_procedure(gender.yr$table[, c(1,3)])
## [1] 29
## 1972 1973 1974 1975 1976 1977 1978 1980 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1993 1994 1996 1998 2000 2002 2004 2006 2008 2010 2012
## 0.5003100 0.4660904 0.4656334 0.4496644 0.4462975 0.4529412 0.4197128 0.4366485 0.4188172 0.4315197 0.4059742 0.4485007 0.4224490 0.4277075 0.4307900 0.4294079 0.4402332 0.4192485 0.4265255 0.4311497 0.4424931 0.4350282 0.4362797 0.4441230 0.4551920 0.4441242 0.4597133 0.4359100 0.4488349
gender.prop.yr[gender.prop.yr$values > gender.prop.yr$critical, ]
## values critical comparison
## 6 0.081 0.055 1972-1978
## 7 0.064 0.055 1972-1980
## 8 0.081 0.052 1972-1982
## 9 0.069 0.054 1972-1983
## 10 0.094 0.055 1972-1984
## 12 0.078 0.055 1972-1986
## 13 0.073 0.052 1972-1987
## 14 0.070 0.055 1972-1988
## 15 0.071 0.055 1972-1989
## 16 0.060 0.056 1972-1990
## 17 0.081 0.055 1972-1991
## 18 0.074 0.054 1972-1993
## 19 0.069 0.047 1972-1994
## 20 0.058 0.048 1972-1996
## 21 0.065 0.048 1972-1998
## 22 0.064 0.048 1972-2000
## 23 0.056 0.048 1972-2002
## 25 0.056 0.045 1972-2006
## 27 0.064 0.051 1972-2010
## 37 0.060 0.056 1973-1984
## 63 0.060 0.056 1974-1984
## 251 0.054 0.052 1984-2008
What if I were to elminate them?
if I eliminate those years, then I see that there are no pairwise decade-by-decade differences in proportions that are significant. The years of potential exclusion were years during which additional tests, surveys were done that focused on increasing information from demographics undocumented in the past.
gender = crosstab(gss[gss$year > 1973 & gss$year != 1984,],
row.vars = "decade", col.vars = "sex", type='f')
marascuillo_procedure(gender$table[, c(1,3)])
## [1] 5
## 00s 10s 70s 80s 90s
## 0.4468413 0.4422598 0.4467153 0.4303728 0.4334871
## values critical comparison
## 1 0.005 0.027 00s-10s
## 2 0.000 0.022 00s-70s
## 3 0.016 0.018 00s-80s
## 4 0.013 0.018 00s-90s
## 5 0.004 0.030 10s-70s
## 6 0.012 0.028 10s-80s
## 7 0.009 0.028 10s-90s
## 8 0.016 0.022 70s-80s
## 9 0.013 0.022 70s-90s
## 10 0.003 0.019 80s-90s
Gender could be an explanatory variable. More of an assessment is needed to determine whether exclusion of those years is warranted and perform subsequent analyses.
References:
- NIST: Comparing multiple proportions: The Marascuillo procedure,
http://www.itl.nist.gov/div898/handbook/prc/section4/prc474.htm