overdispersions ω
k
are constrained to the range (1, ∞), and so it is convenient to put a model on the inverses
1/ω
k
, which fall (0, 1).) The sample size of the McCarty et al. dataset is large enough that this noninformative
model works fine; in general, however, it would be more appropriate to model the ω
k
’s hierarchically also.
We complete the Bayesian model with a noninformative uniform prior distribution for the hyperparameters
µ
α
, µ
β
, σ
α
, σ
β
. The joint posterior density can then be written as,
p(α, β, ω, µ
α
, µ
β
, σ
α
, σ
β
|y) ∝
n
i=1
K
k=1
y
ik
+ ξ
ik
− 1
ξ
ik
− 1
1
ω
k
ξ
ik
ω
k
− 1
ω
k
y
ik
n
i=1
N(α
i
|µ
α
, σ
2
α
)
K
k=1
N(β
k
|µ
β
, σ
2
β
),
where ξ
ik
= e
α
i
+β
k
/(ω
k
− 1), from the definition of the negative binomial distribution.
The model as given has a nonidentifiability. Any constant C can be added to all the α
i
’s and subtracted
from all the β
k
’s, and the likelihood will remain unchanged (since it depends on these parameters only
through sums of the form α
i
+ β
k
). If we also add C to µ
α
and subtract C from µ
β
, then the prior density
also is unchanged as well. It would be possible to identify the model by anchoring it at some arbitrary
point—for example, setting µ
α
to zero—but we prefer to let all the parameters float, since including this
redundancy can speed the Gibbs sampler computation (van Dyk and Meng, 2001).
However, in summarizing the model we would like to identify the α and β’s so that each b
k
= e
β
k
represents the proportion of the links in the network that go to members of group k. We identify the model
in this way by renormalizing the b
k
’s for the rarest names (in the McCarty et al. survey, these are Jacqueline,
Christina, and Nicole) so that they line up to their proportions in the general population. We renormalize to
the rare names rather than to all 12 names because there is evidence that respondents have difficulty recalling
all their acquaintances with common names (see Killworth et al., 2003, and also Section 4.2 below). Finally,
since the rarest names asked about in our survey are female names—and people tend to know more persons
of their own sex—we further adjust by adding half the discrepancy between a set of intermediately-popular
male and female names in our dataset.
This procedure is complicated but is our best attempt at an accurate normalization for the general
population (which is roughly half women and half men) given the particularities of the data we have at
hand. In the future, it would be desirable to gather data on a balanced set of rare female and male names.
The left panel of Figure 5 illustrates how after renormalization, the rare names in the dataset have groups
sizes equal to their proportion in the population. This specific procedure is designed for the recall problems
that exist in the McCarty et al. dataset. Researchers working with different datasets may have to develop
a procedure that is appropriate to their specific data.
In summary, for each simulation draw of the vector of model parameters, we define the constant
C = C
1
+
1
2
C
2
,
(6)
where C
1
= log
k∈G
1
e
β
k
/P
G
1
adjusts for the rare girls’ names, and C
2
= log
k∈B
2
e
β
k
/P
B
2
−
log
k∈G
2
e
β
k
/P
G
2
represents the difference between boys’ and girls’ names. In these expressions, G
1
,
G
2
, and B
2
are the set of rare girls’ names (Jacqueline, Christina, and Nicole), somewhat popular girls’
9