Making Ecological Inference for
R × C
Tables easy — Standard Errors for EMax
Martin Elff
Department of Political Science
University of Mannheim
68131 Mannheim, Germany
+49.621.181.2093 (voice)
+49.621.181.2099 (fax)
## email not listed ##
webrum.uni-mannheim.de/sowi/elff
www.sowi.uni-mannheim.de/lehrstuehle/lspwivs.htm
Thomas Gschwend
Mannheimer Zentrum fuer Europaeische Sozialforschung (MZES)
University of Mannheim
68131 Mannheim, Germany
+49.621.181.2809 (voice)
+49.621.181.2845 (fax)
Thomas.## email not listed ##
www.sowi.uni-mannheim.de/lehrstuehle/lspol1/gschwend.htm
For most substantive relevant ecological inference problems
scholars face a R × C table from which only the marginals can
be observed. We argue that most frequent used ecological in-
ference methods are computationally demanding and are inef-
ficient since they do not make use of all the information that is
available. We take a fresh look at an estimator that was precisely
developed for these kinds of problems: EMax. This estimator
utilizes more available information at the estimation stage than
previous ecological inference estimators in the literature. As
methodological innovation that remedy the main disadvantage
of EMax in substantive applications, we create model-based
standard errors.
Requirements for ecological inference estimators
• Making e
fficient use of all available data: less assumptions will be
necessary, results will be generalizable
• But: There is no best solution; only data- or application-specific solu-
tions are available
Limits of conventional ecological inference methods
• Computationally demanding
• Ine
fficient because not all available information is used
• Stable results only for the 2 × 2-case
EMax as data- and application-specific Solution
• Suitable in situations where aggregate data are supplemented by sur-
vey data
• Allows for estimation of cell probabilities and cell counts for each of
the aggregates in an e
fficient way
• Does not lead to out-of-bounds cell estimates
• Reliable results even in the R × C
Applications
• Useful for Estimation of Voter Transition or Split-Ticket Voting.
• Requires three bits of information: Besides district-level data also
national-level data.
• National-level information derived from cross-tabs of survey items.
• Examples:
–
Johnston, R. J. and Hay, A. M. 1983. “Voter Transition Proba-
bility Estimates: an Entropy-Maximizing Approach.” European
Journal of Political Research 11: 93-98.
–
Johnston, R. J. and Pattie, C. J. 2000. “Ecological Inference and
Entropy-Maximizing: an Alternative Estimation Procedure for
Split-Ticket Voting.” Political Analysis 8: 333-345.
Entropy-Maximizing (EMax) principle
• Entropy: measure of amount of uncertainty that is contained in a set
of quantities
• Entropy in case of discrete probabilities without constraints:
E
= −
i
π
i
log
π
i
• Principle of maximal entropy:
–
A principle for constructing probability models — not an esti-
mation procedure!
–
Maximal entropy subject to constraints reflects our uncertainty
about unknown quantities
–
Example: distribution over real line that maximizes entropy is
the normal distribution
–
In case of ecological inference problem: a distribution of maxi-
mal entropy describes the likelihood of all possible cell counts for
given marginal constraints, cell probabilities describe the most
likely configuration of cell proportions – How many possibilities
on individual-level are consistent with observed result given the
district level marginals and the national-level frequencies?
Our contribution: Standard errors for the EMax
approach to ecological inference
• Derivation of theoretical standard errors, that have not yet been avail-
able in EMax for ecological inference
• Standard errors, that are computable even for the case where one of the
margins is a survey sample
• Validation of theoretical standard errors by Monte Carlo study
• Validation of EMax application to real-world data
Derivation of an entropy maximizing probability
model for ecological inference
1. Complete data constitute a three-dimensional (I × J × K) contingency
table (n
i jk
).
2. Marginal tables (n
+jk
) and (n
i
+k
) are known.
3. Marginal table (n
i j
+
) is estimated as ( ˆn
i j
+
) by a survey sample.
4. Cell entries n
i jk
for any combinations of indices i
, j, k are a realization
of a multinomial random variable N
i jk
with parameters
π
i jk
(note that
this implies
i
,j,k
π
i jk
= 1).
5. Parameters
π
i jk
maximize entropy subject to restrictions
i
π
i jk
= n
+jk
/n =: p
+jk
j
π
i jk
= n
i
+k
/n =: p
i
+k
k
π
i jk
= ˆn
i j
+
/ˆn =: p
i j
+
Where ˆn denotes the sample size and n denotes the size of the popu-
lation.
Maximizing entropy under these restrictions is equivalent to maximizing
the following Lagrangian
E
r
= −
i
,j,k
π
i jk
log
π
i jk
−
τ(1 −
i
,j,k
π
i jk
) −
α
i j
(p
i j
+
−
k
π
i jk
)
−
β
ik
(p
i
+k
−
j
π
i jk
) −
γ
jk
(p
+jk
−
i
π
i jk
)
with Lagrange-multipliers
α
i j
, β
ik
, γ
jk
. The
π
i jk
maximize this function if
and only if
∂E
∂π
i jk
= − log π
i jk
− 1
+ τ + α
i j
+ β
ik
+ γ
jk
= 0
⇔ log
π
i jk
= τ − 1 + α
i j
+ β
ik
+ γ
jk
that is, if and only if the
π
i jk
comply to a log-linear model with no three-way
interactions. The parameters of such a model can thus be computed using
a standard procedure like iterated proportional fitting (IPF).
Derivation of the standard errors
Observational data is available only for the marginal tables. This is reflected
in the following log-likelihood of the marginal tables with given
π
i jk
:
=
i
,j
n
i j
+
log
π
i j
+
+
i
,k
n
i
+k
log
π
i
+k
+
i
,j
ˆn
+jk
log
π
+jk
+ C
with
π
i j
+
=
t
π
i jt
,
π
i
+k
=
s
π
isk
and
π
+jk
=
r
π
r jk
Second derivatives of are given, for example, by
∂
2
∂α
i j
∂α
i j
= − (n + n + ˆn)(δ
ii
δ
j j
π
i j
+
−
π
i j
+
π
i j
+
)
+
+ δ
ii
δ
j j
t
n
i
+t
π
i jt
π
i·t
−
π
i jt
π
i
+t
2
+
t
ˆn
+jt
π
i jt
π
+jt
−
π
i jt
π
+jt
2
∂
2
∂α
i j
∂β
i k
= − (n + n + ˆn)(δ
ii
π
i jk
−
π
i j
+
π
i
+k
)
+ δ
ii
ˆn
+jk
π
i jk
π
+jk
−
π
i jk
π
+jk
2
Application of the delta-method then leads to the standard error formula
SE( ˆ
π
i jk
)
= ˆπ
i jk
(1 − ˆ
π
i jk
) ˆ
G
i j
,ij
αα
+ ˆG
ik
,ik
ββ
+ ˆG
jk
,jk
γγ
+
+2 ˆG
i j
,ik
αβ
+ 2 ˆG
i j
,jk
αγ
+ 2 ˆG
ik
,jk
βγ
1
2
where ˆ
G
i j
,ij
αα
, ˆ
G
ik
,ik
ββ
, ˆ
G
jk
,jk
γγ
, ˆ
G
i j
,ik
αβ
, ˆ
G
i j
,jk
αγ
and ˆ
G
ik
,jk
βγ
are elements of the asymptotic
covariance matrix of
α
i j
, β
ik
and
γ
jk
based on the second derivatives of the
above likelihood.
A Monte Carlo study
1. Artificial data: Population size 20,000,000; “Sample size” for one of
the margins: 1,000; cell probabilities given, conforming to no-three-
way-interaction model; array of probabilities size: 40×5×5
2. Compute theoretical standard errors
3. 200 Replicatations:
a) Generate random numbers (multinomial distribution), compute
marginal counts and “sample” of one marginal table
b) Compute IPF estimates of cell probabilities
4. Compare root mean square error from simulations with theoretical
standard errors
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Theoretical standard errors for each cell
Root mean squared errors for each cell
Figure 1: Comparison of simulated root mean square errors for cell
probabilities and theoretical standard errors (Colors denote a 2D ker-
nel density estimate)
Simulation shows that standard errors and RMSE are of the same order
of magnitude, yet numerical instabilities lead to some deviations of RMSE
for some of the cells.
Application of EMax to the analysis of split-ticket
voting in New Zealand
Johnston and Pattie (2000) use data from New Zealand on split-ticket vot-
ing to validate EMax estimates. In our application, we supply standard
errors that are not given by Johnston and Pattie.
The numbers of split-ticket voters in individual voting districts is esti-
mated on the following known quantities
1. the numbers n
i j
+
of votes for candidates of the parties j in districts i,
2. the numbers n
i
+k
of votes of party lists k in districts i,
3. sample estimates ˆn
+jk
of numbers of combinations of candidate and
list votes on the national levels.
The following tables give examples of these data.
Table 1: List votes in New Zealand districts extract
District
Labour
National
Alliance
NZ First
· · ·
Albany
10271
13583
1967
1033
· · ·
Aoraki
14413
10393
2881
992
· · ·
Auckland Central
13647
7747
2321
671
· · ·
Banks Peninsula
14018
12643
2844
788
· · ·
Bay of Plenty
11342
11350
1769
3178
· · ·
Christchurch Central
13407
8887
3369
880
· · ·
Christchurch East
15084
7816
3665
719
· · ·
Clutha-Southland
9182
12882
1883
1043
· · ·
Coromandel
12390
10747
2241
2406
· · ·
Dunedin North
15052
6427
3902
401
· · ·
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Table 2: Candidate votes in New Zealand districts – extract
District
Labour
National
Alliance
NZ First
· · ·
Albany
8753
13701
3775
751
· · ·
Aoraki
17415
10276
2031
705
· · ·
Auckland Central
12645
7360
6129
0
· · ·
Banks Peninsula
15475
14020
1474
510
· · ·
Bay of Plenty
8679
15781
1338
4185
· · ·
Christchurch Central
17229
7825
2690
641
· · ·
Christchurch East
18157
6995
2127
528
· · ·
Clutha-Southland
9218
15619
1049
1131
· · ·
Coromandel
3892
13432
1217
1237
· · ·
Dunedin North
18856
6161
1968
224
· · ·
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Table 3: Survey data on New Zealand list and candidate votes – ex-
tract
Candidate votes
Labour
National
Alliance
NZ First
· · ·
List votes
National
197
26
6
7
· · ·
Labour
13
134
15
30
· · ·
NZ First
7
18
52
9
· · ·
Alliance
9
30
7
63
· · ·
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
On the base of these data, the application of the EMax procedure results
in the following estimates:
Table 4: Predicted list and candidate votes for the district
Albany
Candidate votes
Labour
National
Alliance
NZ First
· · ·
List votes
National
11949
1122
234
786
· · ·
Labour
403
3584
341
2476
· · ·
NZ First
219
486
1224
809
· · ·
Alliance
140
360
88
3088
· · ·
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5000
10000
15000
20000
25000
30000
5000
10000
15000
20000
25000
30000
Real number of straight tickets
Predicted number of straight tickets
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
Figure 2: Comparison of predicted and observed straight-ticket votes
in New Zealand districts
Software for EMax with standard errors
• Implemented in R
• Implementation in Stata is planned