All Academic, Inc. Research Logo

Info/CitationFAQResearchAll Academic Inc.
Document

Making Ecological Inference for R x C Tables easy - Standard Errors for EMax
Unformatted Document Text:  Making Ecological Inference for R × C Tables easy — Standard Errors for EMax Martin Elff Department of Political Science University of Mannheim 68131 Mannheim, Germany +49.621.181.2093 (voice) +49.621.181.2099 (fax) ## email not listed ## webrum.uni-mannheim.de/sowi/elff www.sowi.uni-mannheim.de/lehrstuehle/lspwivs.htm Thomas Gschwend Mannheimer Zentrum fuer Europaeische Sozialforschung (MZES) University of Mannheim 68131 Mannheim, Germany +49.621.181.2809 (voice) +49.621.181.2845 (fax) Thomas.## email not listed ## www.sowi.uni-mannheim.de/lehrstuehle/lspol1/gschwend.htm For most substantive relevant ecological inference problems scholars face a R × C table from which only the marginals canbe observed. We argue that most frequent used ecological in-ference methods are computationally demanding and are inef-ficient since they do not make use of all the information that isavailable. We take a fresh look at an estimator that was preciselydeveloped for these kinds of problems: EMax. This estimatorutilizes more available information at the estimation stage thanprevious ecological inference estimators in the literature. Asmethodological innovation that remedy the main disadvantageof EMax in substantive applications, we create model-basedstandard errors. Requirements for ecological inference estimators • Making e fficient use of all available data: less assumptions will be necessary, results will be generalizable • But: There is no best solution; only data- or application-specific solu- tions are available Limits of conventional ecological inference methods • Computationally demanding• Ine fficient because not all available information is used • Stable results only for the 2 × 2-case EMax as data- and application-specific Solution • Suitable in situations where aggregate data are supplemented by sur- vey data • Allows for estimation of cell probabilities and cell counts for each of the aggregates in an e fficient way • Does not lead to out-of-bounds cell estimates• Reliable results even in the R × C Applications • Useful for Estimation of Voter Transition or Split-Ticket Voting.• Requires three bits of information: Besides district-level data also national-level data. • National-level information derived from cross-tabs of survey items.• Examples: – Johnston, R. J. and Hay, A. M. 1983. “Voter Transition Proba-bility Estimates: an Entropy-Maximizing Approach.” EuropeanJournal of Political Research 11: 93-98. – Johnston, R. J. and Pattie, C. J. 2000. “Ecological Inference andEntropy-Maximizing: an Alternative Estimation Procedure forSplit-Ticket Voting.” Political Analysis 8: 333-345. Entropy-Maximizing (EMax) principle • Entropy: measure of amount of uncertainty that is contained in a set of quantities • Entropy in case of discrete probabilities without constraints: E = − i π i log π i • Principle of maximal entropy: – A principle for constructing probability models — not an esti-mation procedure! – Maximal entropy subject to constraints reflects our uncertaintyabout unknown quantities – Example: distribution over real line that maximizes entropy isthe normal distribution – In case of ecological inference problem: a distribution of maxi-mal entropy describes the likelihood of all possible cell counts forgiven marginal constraints, cell probabilities describe the mostlikely configuration of cell proportions – How many possibilitieson individual-level are consistent with observed result given thedistrict level marginals and the national-level frequencies? Our contribution: Standard errors for the EMaxapproach to ecological inference • Derivation of theoretical standard errors, that have not yet been avail- able in EMax for ecological inference • Standard errors, that are computable even for the case where one of the margins is a survey sample • Validation of theoretical standard errors by Monte Carlo study• Validation of EMax application to real-world data Derivation of an entropy maximizing probabilitymodel for ecological inference 1. Complete data constitute a three-dimensional (I × J × K) contingency table (n i jk ). 2. Marginal tables (n +jk ) and (n i +k ) are known. 3. Marginal table (n i j + ) is estimated as ( ˆn i j + ) by a survey sample. 4. Cell entries n i jk for any combinations of indices i , j, k are a realization of a multinomial random variable N i jk with parameters π i jk (note that this implies i ,j,k π i jk = 1). 5. Parameters π i jk maximize entropy subject to restrictions i π i jk = n +jk /n =: p +jk j π i jk = n i +k /n =: p i +k k π i jk = ˆn i j + /ˆn =: p i j + Where ˆn denotes the sample size and n denotes the size of the popu-lation. Maximizing entropy under these restrictions is equivalent to maximizing the following Lagrangian E r = − i ,j,k π i jk log π i jk − τ(1 − i ,j,k π i jk ) − α i j (p i j + − k π i jk ) − β ik (p i +k − j π i jk ) − γ jk (p +jk − i π i jk ) with Lagrange-multipliers α i j , β ik , γ jk . The π i jk maximize this function if and only if ∂E ∂π i jk = − log π i jk − 1 + τ + α i j + β ik + γ jk = 0 ⇔ log π i jk = τ − 1 + α i j + β ik + γ jk that is, if and only if the π i jk comply to a log-linear model with no three-way interactions. The parameters of such a model can thus be computed usinga standard procedure like iterated proportional fitting (IPF). Derivation of the standard errors Observational data is available only for the marginal tables. This is reflectedin the following log-likelihood of the marginal tables with given π i jk : = i ,j n i j + log π i j + + i ,k n i +k log π i +k + i ,j ˆn +jk log π +jk + C with π i j + = t π i jt , π i +k = s π isk and π +jk = r π r jk Second derivatives of are given, for example, by ∂ 2 ∂α i j ∂α i j = − (n + n + ˆn)(δ ii δ j j π i j + − π i j + π i j + ) + + δ ii δ j j t n i +t π i jt π i·t − π i jt π i +t 2 + t ˆn +jt π i jt π +jt − π i jt π +jt 2 ∂ 2 ∂α i j ∂β i k = − (n + n + ˆn)(δ ii π i jk − π i j + π i +k ) + δ ii ˆn +jk π i jk π +jk − π i jk π +jk 2 Application of the delta-method then leads to the standard error formula SE( ˆ π i jk ) = ˆπ i jk (1 − ˆ π i jk ) ˆ G i j ,ij αα + ˆG ik ,ik ββ + ˆG jk ,jk γγ + +2 ˆG i j ,ik αβ + 2 ˆG i j ,jk αγ + 2 ˆG ik ,jk βγ 1 2 where ˆ G i j ,ij αα , ˆ G ik ,ik ββ , ˆ G jk ,jk γγ , ˆ G i j ,ik αβ , ˆ G i j ,jk αγ and ˆ G ik ,jk βγ are elements of the asymptotic covariance matrix of α i j , β ik and γ jk based on the second derivatives of the above likelihood. A Monte Carlo study 1. Artificial data: Population size 20,000,000; “Sample size” for one of the margins: 1,000; cell probabilities given, conforming to no-three-way-interaction model; array of probabilities size: 40×5×5 2. Compute theoretical standard errors3. 200 Replicatations: a) Generate random numbers (multinomial distribution), compute marginal counts and “sample” of one marginal table b) Compute IPF estimates of cell probabilities 4. Compare root mean square error from simulations with theoretical standard errors 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 Theoretical standard errors for each cell Root mean squared errors for each cell Figure 1: Comparison of simulated root mean square errors for cellprobabilities and theoretical standard errors (Colors denote a 2D ker-nel density estimate) Simulation shows that standard errors and RMSE are of the same order of magnitude, yet numerical instabilities lead to some deviations of RMSEfor some of the cells. Application of EMax to the analysis of split-ticketvoting in New Zealand Johnston and Pattie (2000) use data from New Zealand on split-ticket vot-ing to validate EMax estimates. In our application, we supply standarderrors that are not given by Johnston and Pattie. The numbers of split-ticket voters in individual voting districts is esti- mated on the following known quantities 1. the numbers n i j + of votes for candidates of the parties j in districts i, 2. the numbers n i +k of votes of party lists k in districts i, 3. sample estimates ˆn +jk of numbers of combinations of candidate and list votes on the national levels. The following tables give examples of these data. Table 1: List votes in New Zealand districts extract District Labour National Alliance NZ First · · · Albany 10271 13583 1967 1033 · · · Aoraki 14413 10393 2881 992 · · · Auckland Central 13647 7747 2321 671 · · · Banks Peninsula 14018 12643 2844 788 · · · Bay of Plenty 11342 11350 1769 3178 · · · Christchurch Central 13407 8887 3369 880 · · · Christchurch East 15084 7816 3665 719 · · · Clutha-Southland 9182 12882 1883 1043 · · · Coromandel 12390 10747 2241 2406 · · · Dunedin North 15052 6427 3902 401 · · · ... ... ... ... ... Table 2: Candidate votes in New Zealand districts – extract District Labour National Alliance NZ First · · · Albany 8753 13701 3775 751 · · · Aoraki 17415 10276 2031 705 · · · Auckland Central 12645 7360 6129 0 · · · Banks Peninsula 15475 14020 1474 510 · · · Bay of Plenty 8679 15781 1338 4185 · · · Christchurch Central 17229 7825 2690 641 · · · Christchurch East 18157 6995 2127 528 · · · Clutha-Southland 9218 15619 1049 1131 · · · Coromandel 3892 13432 1217 1237 · · · Dunedin North 18856 6161 1968 224 · · · ... ... ... ... ... Table 3: Survey data on New Zealand list and candidate votes – ex-tract Candidate votes Labour National Alliance NZ First · · · List votesNational 197 26 6 7 · · · Labour 13 134 15 30 · · · NZ First 7 18 52 9 · · · Alliance 9 30 7 63 · · · ... ... ... ... ... On the base of these data, the application of the EMax procedure results in the following estimates: Table 4: Predicted list and candidate votes for the district Albany Candidate votes Labour National Alliance NZ First · · · List votesNational 11949 1122 234 786 · · · Labour 403 3584 341 2476 · · · NZ First 219 486 1224 809 · · · Alliance 140 360 88 3088 · · · ... ... ... ... ... 5000 10000 15000 20000 25000 30000 5000 10000 15000 20000 25000 30000 Real number of straight tickets Predicted number of straight tickets q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q q Figure 2: Comparison of predicted and observed straight-ticket votesin New Zealand districts Software for EMax with standard errors • Implemented in R • Implementation in Stata is planned

Authors: Elff, Martin. and Gschwend, Thomas.
first   previous   Page 1 of 1   next   last



background image
Making Ecological Inference for
R × C
Tables easy — Standard Errors for EMax
Martin Elff
Department of Political Science
University of Mannheim
68131 Mannheim, Germany
+49.621.181.2093 (voice)
+49.621.181.2099 (fax)
## email not listed ##
webrum.uni-mannheim.de/sowi/elff
www.sowi.uni-mannheim.de/lehrstuehle/lspwivs.htm
Thomas Gschwend
Mannheimer Zentrum fuer Europaeische Sozialforschung (MZES)
University of Mannheim
68131 Mannheim, Germany
+49.621.181.2809 (voice)
+49.621.181.2845 (fax)
Thomas.## email not listed ##
www.sowi.uni-mannheim.de/lehrstuehle/lspol1/gschwend.htm
For most substantive relevant ecological inference problems
scholars face a R × C table from which only the marginals can
be observed. We argue that most frequent used ecological in-
ference methods are computationally demanding and are inef-
ficient since they do not make use of all the information that is
available. We take a fresh look at an estimator that was precisely
developed for these kinds of problems: EMax. This estimator
utilizes more available information at the estimation stage than
previous ecological inference estimators in the literature. As
methodological innovation that remedy the main disadvantage
of EMax in substantive applications, we create model-based
standard errors.
Requirements for ecological inference estimators
• Making e
fficient use of all available data: less assumptions will be
necessary, results will be generalizable
• But: There is no best solution; only data- or application-specific solu-
tions are available
Limits of conventional ecological inference methods
• Computationally demanding
• Ine
fficient because not all available information is used
• Stable results only for the 2 × 2-case
EMax as data- and application-specific Solution
• Suitable in situations where aggregate data are supplemented by sur-
vey data
• Allows for estimation of cell probabilities and cell counts for each of
the aggregates in an e
fficient way
• Does not lead to out-of-bounds cell estimates
• Reliable results even in the R × C
Applications
• Useful for Estimation of Voter Transition or Split-Ticket Voting.
• Requires three bits of information: Besides district-level data also
national-level data.
• National-level information derived from cross-tabs of survey items.
• Examples:
Johnston, R. J. and Hay, A. M. 1983. “Voter Transition Proba-
bility Estimates: an Entropy-Maximizing Approach.” European
Journal of Political Research 11: 93-98.
Johnston, R. J. and Pattie, C. J. 2000. “Ecological Inference and
Entropy-Maximizing: an Alternative Estimation Procedure for
Split-Ticket Voting.” Political Analysis 8: 333-345.
Entropy-Maximizing (EMax) principle
• Entropy: measure of amount of uncertainty that is contained in a set
of quantities
• Entropy in case of discrete probabilities without constraints:
E
= −
i
π
i
log
π
i
• Principle of maximal entropy:
A principle for constructing probability models — not an esti-
mation procedure!
Maximal entropy subject to constraints reflects our uncertainty
about unknown quantities
Example: distribution over real line that maximizes entropy is
the normal distribution
In case of ecological inference problem: a distribution of maxi-
mal entropy describes the likelihood of all possible cell counts for
given marginal constraints, cell probabilities describe the most
likely configuration of cell proportions – How many possibilities
on individual-level are consistent with observed result given the
district level marginals and the national-level frequencies?
Our contribution: Standard errors for the EMax
approach to ecological inference
• Derivation of theoretical standard errors, that have not yet been avail-
able in EMax for ecological inference
• Standard errors, that are computable even for the case where one of the
margins is a survey sample
• Validation of theoretical standard errors by Monte Carlo study
• Validation of EMax application to real-world data
Derivation of an entropy maximizing probability
model for ecological inference
1. Complete data constitute a three-dimensional (I × J × K) contingency
table (n
i jk
).
2. Marginal tables (n
+jk
) and (n
i
+k
) are known.
3. Marginal table (n
i j
+
) is estimated as ( ˆn
i j
+
) by a survey sample.
4. Cell entries n
i jk
for any combinations of indices i
, j, k are a realization
of a multinomial random variable N
i jk
with parameters
π
i jk
(note that
this implies
i
,j,k
π
i jk
= 1).
5. Parameters
π
i jk
maximize entropy subject to restrictions
i
π
i jk
= n
+jk
/n =: p
+jk
j
π
i jk
= n
i
+k
/n =: p
i
+k
k
π
i jk
= ˆn
i j
+
/ˆn =: p
i j
+
Where ˆn denotes the sample size and n denotes the size of the popu-
lation.
Maximizing entropy under these restrictions is equivalent to maximizing
the following Lagrangian
E
r
= −
i
,j,k
π
i jk
log
π
i jk
τ(1 −
i
,j,k
π
i jk
) −
α
i j
(p
i j
+
k
π
i jk
)
β
ik
(p
i
+k
j
π
i jk
) −
γ
jk
(p
+jk
i
π
i jk
)
with Lagrange-multipliers
α
i j
, β
ik
, γ
jk
. The
π
i jk
maximize this function if
and only if
∂E
∂π
i jk
= − log π
i jk
− 1
+ τ + α
i j
+ β
ik
+ γ
jk
= 0
⇔ log
π
i jk
= τ − 1 + α
i j
+ β
ik
+ γ
jk
that is, if and only if the
π
i jk
comply to a log-linear model with no three-way
interactions. The parameters of such a model can thus be computed using
a standard procedure like iterated proportional fitting (IPF).
Derivation of the standard errors
Observational data is available only for the marginal tables. This is reflected
in the following log-likelihood of the marginal tables with given
π
i jk
:
=
i
,j
n
i j
+
log
π
i j
+
+
i
,k
n
i
+k
log
π
i
+k
+
i
,j
ˆn
+jk
log
π
+jk
+ C
with
π
i j
+
=
t
π
i jt
,
π
i
+k
=
s
π
isk
and
π
+jk
=
r
π
r jk
Second derivatives of are given, for example, by
2
∂α
i j
∂α
i j
= − (n + n + ˆn)(δ
ii
δ
j j
π
i j
+
π
i j
+
π
i j
+
)
+
+ δ
ii
δ
j j
t
n
i
+t
π
i jt
π
i·t
π
i jt
π
i
+t
2
+
t
ˆn
+jt
π
i jt
π
+jt
π
i jt
π
+jt
2
2
∂α
i j
∂β
i k
= − (n + n + ˆn)(δ
ii
π
i jk
π
i j
+
π
i
+k
)
+ δ
ii
ˆn
+jk
π
i jk
π
+jk
π
i jk
π
+jk
2
Application of the delta-method then leads to the standard error formula
SE( ˆ
π
i jk
)
= ˆπ
i jk
(1 − ˆ
π
i jk
) ˆ
G
i j
,ij
αα
+ ˆG
ik
,ik
ββ
+ ˆG
jk
,jk
γγ
+
+2 ˆG
i j
,ik
αβ
+ 2 ˆG
i j
,jk
αγ
+ 2 ˆG
ik
,jk
βγ
1
2
where ˆ
G
i j
,ij
αα
, ˆ
G
ik
,ik
ββ
, ˆ
G
jk
,jk
γγ
, ˆ
G
i j
,ik
αβ
, ˆ
G
i j
,jk
αγ
and ˆ
G
ik
,jk
βγ
are elements of the asymptotic
covariance matrix of
α
i j
, β
ik
and
γ
jk
based on the second derivatives of the
above likelihood.
A Monte Carlo study
1. Artificial data: Population size 20,000,000; “Sample size” for one of
the margins: 1,000; cell probabilities given, conforming to no-three-
way-interaction model; array of probabilities size: 40×5×5
2. Compute theoretical standard errors
3. 200 Replicatations:
a) Generate random numbers (multinomial distribution), compute
marginal counts and “sample” of one marginal table
b) Compute IPF estimates of cell probabilities
4. Compare root mean square error from simulations with theoretical
standard errors
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
Theoretical standard errors for each cell
Root mean squared errors for each cell
Figure 1: Comparison of simulated root mean square errors for cell
probabilities and theoretical standard errors (Colors denote a 2D ker-
nel density estimate)
Simulation shows that standard errors and RMSE are of the same order
of magnitude, yet numerical instabilities lead to some deviations of RMSE
for some of the cells.
Application of EMax to the analysis of split-ticket
voting in New Zealand
Johnston and Pattie (2000) use data from New Zealand on split-ticket vot-
ing to validate EMax estimates. In our application, we supply standard
errors that are not given by Johnston and Pattie.
The numbers of split-ticket voters in individual voting districts is esti-
mated on the following known quantities
1. the numbers n
i j
+
of votes for candidates of the parties j in districts i,
2. the numbers n
i
+k
of votes of party lists k in districts i,
3. sample estimates ˆn
+jk
of numbers of combinations of candidate and
list votes on the national levels.
The following tables give examples of these data.
Table 1: List votes in New Zealand districts extract
District
Labour
National
Alliance
NZ First
· · ·
Albany
10271
13583
1967
1033
· · ·
Aoraki
14413
10393
2881
992
· · ·
Auckland Central
13647
7747
2321
671
· · ·
Banks Peninsula
14018
12643
2844
788
· · ·
Bay of Plenty
11342
11350
1769
3178
· · ·
Christchurch Central
13407
8887
3369
880
· · ·
Christchurch East
15084
7816
3665
719
· · ·
Clutha-Southland
9182
12882
1883
1043
· · ·
Coromandel
12390
10747
2241
2406
· · ·
Dunedin North
15052
6427
3902
401
· · ·
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Table 2: Candidate votes in New Zealand districts – extract
District
Labour
National
Alliance
NZ First
· · ·
Albany
8753
13701
3775
751
· · ·
Aoraki
17415
10276
2031
705
· · ·
Auckland Central
12645
7360
6129
0
· · ·
Banks Peninsula
15475
14020
1474
510
· · ·
Bay of Plenty
8679
15781
1338
4185
· · ·
Christchurch Central
17229
7825
2690
641
· · ·
Christchurch East
18157
6995
2127
528
· · ·
Clutha-Southland
9218
15619
1049
1131
· · ·
Coromandel
3892
13432
1217
1237
· · ·
Dunedin North
18856
6161
1968
224
· · ·
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Table 3: Survey data on New Zealand list and candidate votes – ex-
tract
Candidate votes
Labour
National
Alliance
NZ First
· · ·
List votes
National
197
26
6
7
· · ·
Labour
13
134
15
30
· · ·
NZ First
7
18
52
9
· · ·
Alliance
9
30
7
63
· · ·
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
On the base of these data, the application of the EMax procedure results
in the following estimates:
Table 4: Predicted list and candidate votes for the district
Albany
Candidate votes
Labour
National
Alliance
NZ First
· · ·
List votes
National
11949
1122
234
786
· · ·
Labour
403
3584
341
2476
· · ·
NZ First
219
486
1224
809
· · ·
Alliance
140
360
88
3088
· · ·
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
5000
10000
15000
20000
25000
30000
5000
10000
15000
20000
25000
30000
Real number of straight tickets
Predicted number of straight tickets
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
q
Figure 2: Comparison of predicted and observed straight-ticket votes
in New Zealand districts
Software for EMax with standard errors
• Implemented in R
• Implementation in Stata is planned


Convention
All Academic Convention makes running your annual conference simple and cost effective. It is your online solution for abstract management, peer review, and scheduling for your annual meeting or convention.
Submission - Custom fields, multiple submission types, tracks, audio visual, multiple upload formats, automatic conversion to pdf.
Review - Peer Review, Bulk reviewer assignment, bulk emails, ranking, z-score statistics, and multiple worksheets!
Reports - Many standard and custom reports generated while you wait. Print programs with participant indexes, event grids, and more!
Scheduling - Flexible and convenient grid scheduling within rooms and buildings. Conflict checking and advanced filtering.
Communication - Bulk email tools to help your administrators send reminders and responses. Use form letters, a message center, and much more!
Management - Search tools, duplicate people management, editing tools, submission transfers, many tools to manage a variety of conference management headaches!
Click here for more information.

first   previous   Page 1 of 1   next   last

©2008 All Academic, Inc.