Going beyond accuracy: estimating homophily in social networks using predictions

In online social networks, it is common to use predictions of node categories to estimate measures of homophily and other relational properties. However, online social network data often lacks basic demographic information about the nodes. Researchers must rely on predicted node attributes to estimate measures of homophily, but little is known about the validity of these measures. We show that estimating homophily in a network can be viewed as a dyadic prediction problem, and that homophily estimates are unbiased when dyad-level residuals sum to zero in the network. Node-level prediction models, such as the use of names to classify ethnicity or gender, do not generally have this property and can introduce large biases into homophily estimates. Bias occurs due to error autocorrelation along dyads. Importantly, node-level classification performance is not a reliable indicator of estimation accuracy for homophily. We compare estimation strategies that make predictions at the node and dyad levels, evaluating performance in different settings. We propose a novel"ego-alter"modeling approach that outperforms standard node and dyad classification strategies. While this paper focuses on homophily, results generalize to other relational measures which aggregate predictions along the dyads in a network. We conclude with suggestions for research designs to study homophily in online networks. Code for this paper is available at https://github.com/georgeberry/autocorr.

< l a t e x i t s h a 1 _ b a s e 6 4 = " x P j N e V I X 5 L O P Z + l h O U n F v T 1 8 2 4 k = " > A A A B 8 3 i c b V B N S w M x E J 3 1 s 9 a v q k c v w S L U S 9 m t g l 6 E o h e P F e y H t M u S T b N t a J J d k q x Q l v 4 N L x 4 U 8 e q f 8 e a / M W 3 3 o K 0 P B h 7 v z T A z L 0 w 4 0 8 Z 1 v 5 2 V 1 b X 1 j c 3 C V n F 7 Z 3 d v v 3 R w 2 N J x q g h t k p j H q h N i T T m T t G m Y 4 b S T K I p F y G k 7 H N 1 O / f Y T V Z r F 8 s G M E + o L P J A s Y g Q b K / U e A 4 a u U V T p B O w s K J X d q j s D W i Z e T s q Q o x G U v n r 9 m K S C S k M 4 1 r r r u Y n x M 6 w M I 5 x O i r 1 U 0 w S T E R 7 Q r q U S C 6 r 9 b H b z B J 1 a p Y + i W N m S B s 3 U 3 x M Z F l q P R W g 7 B T Z D v e h N x f + 8 b m q i K z 9 j M k k N l W S + K E o 5 M j G a B o D 6 T F F i + N g S T B S z t y I y x A o T Y 2 M q 2 h C 8 x Z e X S a t W 9 c 6 r t f u L c v 0 m j 6 M A x 3 A C F f D g E u p w B w 1 o A o E E n u E V 3 p z U e X H e n Y 9 5 6 4 q T z x z B H z i f P x k Q k G s = < / l a t e x i t > Y j = f (X j ) < l a t e x i t s h a 1 _ b a s e 6 4 = " b K i x t G j p d m f U 0 z D k g u O R o O p V I v A = " > A A A B 8 3 i c b V B N S 8 N A E J 3 U r 1 q / q h 6 9 L B a h X k r S C n o R i l 4 8 V r A f 0 o a w 2 W 7 a b T e b s L s R S u j f 8 O J B E a / + G W / + G 7 d t D t r 6 Y O D x 3 g w z 8 / y Y M 6 V t + 9 v K r a 1 v b G 7 l t w s 7 u 3 Figure 1: A depiction of how the models studied in this paper (node, dyad, and ego-alter) turn a simple network into a prediction task.The node model trains on each ground truth node using only ego features X i .The dyad model trains on each ground truth dyad using features from both ego and alter, X i , X j .The ego-alter model fits an "ego model" predicting ego's category for each link from ego, and a second "alter model" for each link incoming to each alter.Both ego and alter models incorporate features of ego and alter, X i and X j , for each prediction, producing a "family of predictions" for each node.
Online social networks present a particular challenge for understanding this fundamental aspect of networks: demographic and attitudinal information is often absent.The common strategy for addressing this is to predict demographics or attitudinal attributes [Cesare et al., 2017b, Messias et al., 2017] based on publicly available information such as names, profile photos, text, or other information [Barberá, 2016, Al Zamal et al., 2012, Hofstra and de Schipper, 2018, Choudhury, 2011, Messias et al., 2017].This information is combined with ground truth labels (known values of the category of interest for a set of individuals), and a supervised learning classifier [Molina and Garip, 2019] is then used to predict the node category for all nodes in the network.
Although predicted node attributes are widely used to empirically measure homophily and other relational properties [Cesare et al., 2017b, Messias et al., 2017, Himelboim et al., 2016, Boutyline and Willer, 2017, Hobbs et al., 2016, Bakshy et al., 2015, Colleoni et al., 2014, Choudhury, 2011], there is a lack of theoretical methodological work investigating when and to what extent the predictions produce reasonable estimates [Berry et al., 2018].The most common strategy is to choose a model which maximizes a node-level measure of classification performance.Because of the complexities of networks, this criterion is not sufficient for reliable estimation of relational measures such as homophily.
In this paper, we formalize the homophily estimation problem as a dyadic prediction problem.This expression clarifies the difficulty in using node-level predictions to draw larger inferences: node residuals are multiplied along edges, magnifying a node's residual in proportion to its degree, and opening the door to residual autocorrelation along dyads.Theoretically, we should expect correlated errors along edges due to latent homophily [DellaPosta et al., 2015], or the correlation of unobserved factors in the network.For example, name-based classifiers [Hofstra and de Schipper, 2018, Choudhury, 2011, Hobbs et al., 2016] have frequently been used for gender classification.If a name-based method codes "Leslie" as "woman" because this is more common in the population, yet for a specific network community the name "Leslie" tends to indicate "man", model errors will be correlated with the network and can bias overall gender homophily estimates.This issue is compounded by the highly skewed degree distributions of online networks [Kato et al., 2012], which introduces the additional possibility that misclassification for high degree nodes will bias the overall estimate.
We show that dyad-level predictions produce unbiased homophily estimates.However, such estimates are often high-variance for a given ground truth labeling budget1 .This motivates a two-step modeling procedure (called "ego-alter") which predicts the category of a node from the perspective of each one of its network neighbors.This allows incorporating network information beyond the ego level, while still using standard modeling tools such as logistic regression.This ego-alter approach is theoretically less biased than a node-level model, and across a range of simulations outperforms both node-level and dyad-level models in overall error.Figure 1 visually compares these three approaches.While we primarily study homophily with node demographics in mind, results here apply to a wide range of networked outcomes, such as estimating the fraction of people belonging to a certain race/ethnicity experience hate speech in their social media feeds [Davidson et al., 2019].

Homophily Measure
We study the average fraction of ego networks composed of ingroup members (visualized in Figure 2).Average ego network composition has been extensively studied in sociology, primarily in research concerning the General Social Survey  We use average ego network composition to capture homophily from the perspective of the dark blue nodes.We assume node categories (light and dark blue) must be predicted with a model.This estimand is expressed analytically in Equation 2.
network module [Marsden, 1987, McPherson et al., 2006].The average ego network composition measures what the network tends to look like from the perspective of members of a given group.For instance, Black respondents to the GSS have been found to have higher average racial heterogeneity in their core discussion networks than White respondents2 [Marsden, 1987].
Average egonet composition can be written as a sum over ego networks, taking into account the category of both ego and alter.Let Y i indicate the category of i, for instance in the case of racial homophily, category a may indicate Black, category b may indicate White, and so on.Without loss of generality, assume that we are studying a binary outcome where Y i ∈ {a, b}.For compactness, we write Y a i to denote 1 Yi=a .Then the average fraction of group a's ego networks which are composed of alters in group a (Figure 2) can be written, where D i indicates the degree of node i, N(i) is a function which returns the neighbors of i, and T [Y a i ] indicates the size of group a, i∈V Y a i .For example, if H aa = 0.7, it means that an average ego network for group a is composed of 70% ingroup members.
Note that equation 1 can be re-written as a sum over dyads in the network by rearranging the summation, where E are edges in graph G. Rewriting the edge-level outcome Y a i Y a j as a single random variable Y aa ij provides an expression in terms of edge categories.Estimating H aa can therefore be considered either a node or dyadic prediction task.Note that in the node case, predictions are multiplied.

Dyadic regression as an unbiased estimator
Equation 2 shows how homophily can be estimated with knowledge of edge categories Y aa ij .Assume edge categories Y aa ij are obtained for a random sample of the edges S, and features correlated with edge categories X ij are available both for the sample S and the population P .Assuming random sampling simplifies the argument, and we discuss deviations from this assumption in the Appendix.
Assume a model predicting Y aa ij is chosen which has the property that the residuals sum to zero in the population and are uncorrelated with features3 ij e ij = 0, Cov(e ij , X ij ) = 0. Then this model trained on the sample S provides an unbiased estimate of homophily in the population given that 1 Di is included in X ij as a feature.
The reason for including 1 Di as a variable can be seen by the following argument.First, recall the conditional expectation (CEF) function expression [Angrist and Pischke, 2009]: , where E[Y aa ij |X ij ] can be estimated with a model such as OLS.An estimator for Equation 2 can be written in terms of predictions, We now need to examine when using model predictions in Equation 3 provides an answer equal to Equation 2 in expectation.This can be done by substituting the CEF into Equation 3 to obtain, This indicates that the homophily estimate will be unbiased when the sum of residuals term is zero.
Di ] = 0, the expectation of the estimate equals the true value, E[ Ĥaa ] = H aa .
Since we assumed a model is used where ij e ij = 0 and Cov(e ij , X ij ) = 0, if This argument concerns model residuals, not error terms.No assumptions have been made about causality, true functional form, or predictive accuracy.With a random edge sample and OLS, 1 Di is the only required variable in for an unbiased estimate, although this can produce a high variance estimate.Including robust predictive features is therefore still important for variance reduction and address cases of non-random sampling.

Approximating dyadic regression with nodelevel data
Sampling and labeling limitations often make collecting a large set of groundtruth dyads infeasible.In this situation, node-level ground truth information can be employed to estimate homophily.We present a two-step modeling strategy which we term "ego-alter" which uses only node-level ground truth information, reduces bias over a standard node-level model, and reduces variance compared to the edge model in Section 3. The ego-alter approach is biased, although the magnitude of bias in simulations we examine below is generally substantially less than a node-level model.
The ego-alter model is a hybrid approach: it uses dyadic features X ij to predict the node-level ground truth Y i and Y j separately, producing one prediction per edge for both ego and alter (see Figure 1 for a visual representation).This has the benefit of reducing bias in two ways: first, predictions for Y i and Y j are improved by including a richer set of features which improves prediction accuracy; second, it reduces bias by reducing dyadic residual autocorrelation.
Since Y aa ij = Y a i Y a j , H aa can be estimated with the product of node predictions, where the second line follows by substituting the CEF.This can be expressed as the true homophily value plus two bias terms, Ĥaa = The bias terms R 1 and R 2 both indicate dyadic correlation of the model residuals with neighbor outcomes.Assuming 1 Di is included as a model feature, R 1 and R 2 can be thought of similarly: when model residuals are correlated with neighbor outcomes, the terms will be non-zero.This can happen when models produce pockets of similar errors in the network due to unobserved, network correlated features.When inverse ego degree 1 Di is not included as a model feature, the bias terms become substantially more complex because of the interaction between degree, errors by degree, and errors along dyads.
Equation 5 indicates that the estimate equals its true value when R 1 = R 2 = 0 or when R 1 = −R 2 .Note that R 1 = −R 2 is unlikely, this is because residuals for e i and e j have similar correlations with neighbor true outcomes Y j and Y i since both ego and alter models score the entire network4 .
Note that R 2 is the result of combining two terms, since E[Y a i |X ij ]e a j = Y a i e a j + e a i e a j .This suggests an "augmented" ego-alter model: first, fit an ego model for i, and then fit an alter model for j which includes the ego predictions E[Y a i |X ij ] as a feature.This, in expectation, eliminates the R 2 term and reduces bias assuming R 1 and R 2 have the same signs.

Simulation
We use simulations to evaluate the effectiveness of dyadic regression and egoalter regression for estimating average egonet composition (Equation 2; Figure 2).For an outcome Y i which takes on values {a, b}, the probability of Y i = a is simulated as follows: where X i and Z i represent individual and network-correlated features, respectively.The individual-level feature is drawn from a normal distribution, X i ∼ N (0, 1), while the network feature is the maximum of the individual feature among the neighbors of each node i: Z i = max j∈N(i) (X j ).Z i is then standardized to follow a normal distribution.This creates outcomes correlated along some dyads in the network, where nodes with large values of X i "influence" neighbors.If Z i is omitted, model residuals will be correlated along dyads and bias homophily estimates.The level of homophily generated by these parameters is moderate: the average ego network for group a contains 59% of nodes in group a (H aa = 0.59), while Coleman's homophily index for group a is 0.14.We choose Z i to be the maximum X j value among the alters j of ego i to provide a challenging setting for models: the true response is determined at the ego-network level yet models operate at the node or dyad levels, meaning a dyadic regression cannot capture the true functional form of the data generating process.This both approximates real-world scenarios where the data generating process is unknown, and demonstrates the argument about bias in Section 3. Five alternative simulation specifications are examined in the Appendix, with qualitatively similar results to this simulation.
Networks with 4000 nodes are generated according to a preferential attachment graph [Barabasi and Albert, 1999] with five links per node and a powerlaw exponent k = 0.8.Links are considered bidirected.Preferential attachment graphs have high degree disparities, providing a challenging setting for the estimation task considered here, since model errors on individual high degree nodes can bias estimates.We conduct simulations for both node and edge sampling, selecting 20% of nodes or 2.5% of edges randomly as ground truth cases.This produces roughly 800 ground truth nodes for both types of sampling.Note that sampling nodes into the ground truth set provides some ground truth dyads (and vice versa), meaning both node and dyad models can be fit with either type of sampling.
Using the ground truth sample to estimate a model, we classify all edges and estimate homophily across 500 simulation runs.Model performance is estimated in two ways: bias and absolute error.Bias is the average of ( Ĥaa − H aa )/H aa across all simulation runs, and represents the systematic deviation from the true value.Absolute error is the average absolute error relative to the true underlying value, or the average of | Ĥaa − H aa |/H aa across all simulation runs.It captures how far estimates tend to be from the true value.
Since both absolute error and bias are normalized, they have the interpretation of "percent error."The bias-variance tradeoff means that we should not expect the method with the lowest bias to also have the lowest absolute error.
We evaluate three types of models: node, dyad, and ego-alter (see Figure 1 for a depiction), all fit with logistic regression.For the node model, we examine models with and without network features.For the ego-alter model, we examine both the basic version and the "augmented" version.This gives a total of 5 models, which are described here in terms of their regression equations, where f (X i , X j ) indicates a main-effects linear model β The notation Y a i (j) indicates that we predict i's category for each neighbor j separately, using features of both i and j in the prediction.Ego and alter degree are included in models because they tend to reduce the bias and variance of estimates and are available to researchers conducting network studies.The bias and absolute error of homophily estimates using five different models, for random node and random edge sampling.Node level models without network variables display large biases in the presence of network-correlated unobserved features.Including network information reduces this bias, and using edge or ego-alter models reduces this bias further.Note that while dyadic regression is unbiased, it does not provide the lowest error estimates.Since roughly similar numbers of nodes are sampled in both edge and node sampling, edge sampling is more efficient.
As shown in Figure 3, the default approach of using node-level classifier with no network features performs poorly.Homophily is underestimated by between 10% and 20%, with average absolute error of about the same magnitude.Even when accounting for the inverse degree term 1 Di , the node-level approach still underestimates homophily by around 3%.This large reduction in bias indicates the importance of including network information in the model predicting node categories, while the remaining bias indicates the limitations of a node-level approach in the presence of network-correlated outcomes.
In this simulation, homophily is under-estimated.This indicates that errors tend to be positively correlated along dyads, increasing the sum of residuals term in Equation 4 and reducing the overall homophily estimate.In other words, there are pockets of the network where the model errors are similar.An alternative scenario exists where a model produces negatively correlated dyadic errors and an over-estimate of homophily.An instance where this happens is residual-degree correlation in the network.When high degree nodes have positive residuals and low degree nodes have negative residuals, the overall residual term in Equation 4 can be negative and cause an over-estimate of homophily5 .
A dyadic model produces an unbiased estimate of homophily, according with the argument in Section 3.However, the dyadic approach does not produce the lowest absolute error.Despite a small amount of bias (around 1%), the ego-alter approach produces lower absolute error on average than the dyadic approach.In alignment with the theoretical argument in Section 4, including ego predictions in the alter model reduces bias about 20% on average.While this simulation uses random sampling, the ego-alter model is also more robust to deviations from random sampling compared to other methods, as shown in the Appendix.In the presence of non-random edge samples, an edge model can be brittle.One potential corrective is weighting the ground truth data, but the often high-dimensional nature of edge features risks large design effects due to the curse of dimensionality [Iacus et al., 2012].Additionally, a "meta-network-correlation" problem can arise, where errors in weights are network correlated.
A more practical approach is to employ a modeling strategy such as ego-alter which can more flexibly learn class probabilities in a network-aware way.While we intentionally restrict models here to logistic regression with only main effects, more flexible functional forms can also be employed to better approximate class probabilities within subgroups.

Node-level performance and network-level estimands
Machine learning models are usually evaluated on observation-level performance metrics such as precision, recall, and area under the curve (AUC).When using predictions to estimate an aggregate such as homophily, strong observation-level performance is encouraging but not sufficient for high-quality estimates of the aggregate.An error-free model will by definition produce a perfect estimate of homophily, but even models with strong out of sample observation-level performance can make dyad-autocorrelated errors that bias homophily estimates.
This can be seen clearly in Figure 4, which plots model performance against bias in estimating homophily6 .Models differ only slightly on traditional performance measures, yet produce large differences in homophily bias.The best model's AUC is 0.8% better than the worst model, yet has a bias reduction of 95% (worst: 17.6% bias; best 0.8% bias).
Note that a meta-analysis of research on demographic classification on social media [Cesare et al., 2017a] found a median accuracy of 0.81 for predicting race/ethnicity, while simulations presented here have an average accuracy of around 0.77.This indicates that similar biases may be present with the type of classification performance found in real-world tasks.

Practical advice
When studying homophily in online communities, researchers can potentially improve the quality of estimates in five ways: including network information

Homophily bias compared to node−level performance
Figure 4: Models with similar node-level classification performance produce different levels of bias when estimating homophily.The three models in Figure 4 have average AUCs between 0.846 and 0.853, yet produce average biases ranging from 0.8% to 17.6%.This demonstrates that observation-level classification performance and estimation of relational measures are distinct tasks.
in models, using the ego-alter modeling strategy, improving model flexibility, sampling edges, and using cross-validation to check for the presence of networkresidual correlation.First and most importantly, network information should be incorporated into prediction models.Evidence from Sections 5 and 8.1 indicates the single largest improvement in model performance comes from including degree information ( 1Di ) in models.The specific information to include is dependent on the estimand, and can even extend to behavioral information when outcomes such as political affiliation are studied in networks.We give an example of applying the process from Section 3 to new estimands in the Appendix (Sections 8.2 and 8.3).
Second, the simulation results consistently demonstrate that the ego-alter modeling strategy performs well both in terms of bias and absolute error.This is true even in the presence of a non-random ground truth sample.Since the egoalter strategy is new, we recommend that researchers present results from both a node-level model and an ego-alter model, with network information included for both models.
Third, and closely related to the choice of modeling strategy is the choice of model itself: a logistic regression with only main effects in the presence of a non-random sample can produce large biases, as seen with the edge model and non-random edge sampling in the Appendix (Table 1).A more flexible model can mitigate this by better learning conditional class probabilities, although the performance will depend on having sufficient ground truth data.
Fourth, edges should be sampled instead of nodes when possible.A consistent finding of our simulations is that for the same labeling budget, a random edge sample outperforms a random node sample in terms of absolute error.In practical settings, such as Twitter, it is often much easier to randomly sample nodes than edges.One strategy for edge sampling is to use a result from the respondent driven sampling literature [Salganik and Heckathorn, 2004] that a random walk through an undirected network approximates an edge sample (see [Berry et al., 2018] for a discussion in the context of online networks).While this may not be feasible in some research settings, researchers may want to consider edge sampling if a random walk approach is possible.
Finally, researchers can obtain an estimate of network residual correlation by using cross-validation (see the discussion in [Molina and Garip, 2019] for a brief introduction to cross validation; see Chapter 7 of [Hastie et al., 2008] for a more extensive discussion).Cross validation splits the training data into a number of folds (usually 5 or 10), and uses all but one fold to train a model, with the held-out fold used to evaluate the model.This proceeds in a round-robin fashion so that the entire training set is scored in a way approximating outof-sample prediction.In the context of homophily estimation, estimating the residual term in Equation 4 can provide important information about network residual correlation.This can be accomplished in a cross validation setting by dividing up all dyads in the training set into folds, and performing cross validation on the ground truth dyads.If (i,j)∈Strain 1 Di e ij = 0, where S train is the training set, then models may need adjustment before providing reliable estimates of homophily.This strategy does not ensure unbiased homophily estimates, particularly in the presence of non-random ground truth sampling, but it does provide a useful diagnostic.

Conclusion
We have examined the problem of estimating homophily when predictions must be used for node attributes.While the problem is challenging, the results we present indicate that homophily can be studied in online networks when classification performance is strong and network information is incorporated directly into models.
The strategies outlined here also provide a pathway for the measurement of other network-level properties.Examples are triadic properties, such as social closure by demographic group.In studies of dynamic network processes such as contagion, models to reduce measurement error [Berry et al., 2019] may benefit from the results here.In the case of signed or multiplex networks, the distribution of different types of edges across groups may be important.Similarly to homophily estimation, consideration of how model errors intersect with graph properties is important for reliable use of predictions in networks.

Example extensions: Coleman's homophily index
We studied average egonet composition in the main text, but another popular measure of homophily is Coleman's homophily index [Coleman, 1958].This measure studies the fraction of within group links from the perspective of a certain group, relative to the proportion expected by chance.
The challenge is estimating the proportion of within-group links from the perspective of a given group a.This can be done in a manner similar to Equation 2, This turns out to be a simpler version of the egonet estimand considered in the main text, and can be addressed with similar modeling strategies.

Figure
Figure2: We use average ego network composition to capture homophily from the perspective of the dark blue nodes.We assume node categories (light and dark blue) must be predicted with a model.This estimand is expressed analytically in Equation2.
Figure3: The bias and absolute error of homophily estimates using five different models, for random node and random edge sampling.Node level models without network variables display large biases in the presence of network-correlated unobserved features.Including network information reduces this bias, and using edge or ego-alter models reduces this bias further.Note that while dyadic regression is unbiased, it does not provide the lowest error estimates.Since roughly similar numbers of nodes are sampled in both edge and node sampling, edge sampling is more efficient.