9 Network Centrality and Hierarchy
This tutorial walks through the analysis of centrality and hierarchy in R. We will begin by looking at various centrality measures, to determine how they are interrelated, and to discern what they mean. In this way students develop different metrics for node network positioning. We will then explore network-level measures of hierarchy, moving away from individual level positioning to look at the structure of the whole network. This tutorial will build directly on the material from previous tutorials, most clearly from the network measurement tutorial (Chapter 3) and the dyad/triad tutorial (Chapter 7).
9.1 Setting up the Session
We will work primarily with the igraph package for this tutorial.
library(igraph)
This tutorial uses classroom network data collected by Daniel McFarland. The class is a biology 2 class at a public high school. We will focus on two network relations, one based on social interaction (i talks in a social way with j) and another based on task-based interactions (i actively engages in a task with j). Let's go ahead and read in the network data for the social relation (read in from a URL).
<- "https://github.com/JeffreyAlanSmith/Integrated_Network_Science/raw/master/data/social_interactions_s641.csv"
url1
<- read.csv(file = url1) social_data
Looking at the first six rows:
head(social_data)
## ego alter social_tie
## 1 1 1 0.000
## 2 1 2 0.000
## 3 1 3 0.000
## 4 1 4 0.000
## 5 1 5 5.625
## 6 1 6 1.500
We now have a data frame called social_data
. The first column is the ego, the second column is the alter, and the third column shows the frequency of social interactions between the two students. We will reduce the data frame to just include those dyads where social interaction occurred:
<- social_data[social_data$social_tie > 0, ] edgelist_social
head(edgelist_social)
## ego alter social_tie
## 5 1 5 5.625
## 6 1 6 1.500
## 22 1 22 1.875
## 44 2 22 0.375
## 74 4 8 1.875
## 89 5 1 5.250
Now we can go ahead and create our igraph object based on the edgelist defined above. The size of the network is 22, so we will set the vertices input to define the ids of the nodes, running from 1 to 22.
<- graph_from_data_frame(d = edgelist_social, directed = T,
s641_social vertices = (id = 1:22))
Note that if we did not want the isolates included we could have done:
<- graph_from_data_frame(d = edgelist_social,
net641_social_noisolates directed = T)
And now we read in the task data:
<- "https://github.com/JeffreyAlanSmith/Integrated_Network_Science/raw/master/data/task_interactions_s641.csv"
url2
<- read.csv(file = url2) task_data
head(task_data)
## ego alter task_tie
## 1 1 1 0
## 2 1 2 0
## 3 1 3 0
## 4 1 4 0
## 5 1 5 0
## 6 1 6 0
The task_tie
variable shows the frequency of task-based interactions between nodes i and j. We will now reduce the data to just those dyads where a task interaction occurred and create the igraph object. We will treat the task network as undirected as if i does a task with j, j does a task with i. We will thus reduce the edgelist so that each edge is only listed once (accomplished by reducing the edgelist to rows where the ego id is smaller than the alter id).
<- task_data[task_data$task_tie > 0, ]
edgelist_task
<- edgelist_task[edgelist_task$ego < edgelist_task$alter, ]
edgelist_task
<- graph_from_data_frame(d = edgelist_task, directed = F,
s641_task vertices = (id = 1:22))
We will now plot both networks.
par(mfrow = c(1, 2))
plot(s641_social, vertex.frame.color = NA, edge.arrow.size = .25,
vertex.size = 8, main = "Social Interactions", margin = -.08)
plot(s641_task, vertex.frame.color = NA, edge.arrow.size = .25,
vertex.size = 8, main = "Task Interactions", margin = -.08)
From the figure alone we can see that the features of these two networks are very different. The task network would appear to have one very central node, with lots of ties, while the social network splits more clearly into groups with one node acting as a bridge between the groups. We will use measures of centrality and centralization to more formally explore these features. Our main substantive goal is to determine which nodes are most important in the classroom and how (or if) this varies across network relation and measure of centrality. Are individuals who are prominent in the task network also prominent in the social interaction network? Which nodes act as bridges? Are they the same nodes with the highest degree? We also want to uncover something about the overall level of inequality and hierarchy that exists in this classroom. Is this a world where one node dominates?
9.2 Centrality
9.2.2 Correlations between Centrality Measures
We have so far seen which nodes are the most central on different measures. We now want to formalize this a bit more by computing the correlations between the centrality scores, showing how closely these measures of centrality are interrelated. More substantively, we want to know which measures tend to yield the same nodes as central and which tend to disagree on the most important nodes. Here we generate a table of pairwise correlations. Again, we take out the first column of ids and the second column showing the network type when doing the correlation matrix (we also round the values in the correlation matrix when printing).
<- cor(central_social[, -c(1, 2)]) cor_tab1
round(cor_tab1, 3)
## indegree outdegree incloseness2 outcloseness2 between eigen
## indegree 1.000 0.960 0.874 0.860 0.629 0.940
## outdegree 0.960 1.000 0.822 0.874 0.738 0.914
## incloseness2 0.874 0.822 1.000 0.879 0.548 0.794
## outcloseness2 0.860 0.874 0.879 1.000 0.595 0.800
## between 0.629 0.738 0.548 0.595 1.000 0.507
## eigen 0.940 0.914 0.794 0.800 0.507 1.000
Indegree and outdegree are very closely correlated (rho = 0.96), indicating that social talk with others is almost always reciprocated (i.e., if you talk to others, they tend to talk back to you). Indegree and outdegree are also highly correlated with eigenvector centrality, indicating that the students that talk the most to others (or, relatedly, are talked to the most by others) are also the ones that are connected to other highly connected students -- possibly indicating high density cliques around these individuals. The degree centralities are less correlated with our closeness centrality scores, suggesting that nodes with high degree are not always (although often) close to other nodes.
Betweenness shows the highest correlation with outdegree, followed by indegree. In the case of this particular network, it seems that individuals that talk the most to others are the likeliest to serve as bridges between the particular cliques (see, e.g., 22 in the plot). Note that betweenness is not all that highly correlated with closeness centrality. This suggests that nodes may sit between groups, and thus have high betweenness, but not necessarily be close to all nodes, on average. For example, node 19 has high closeness centrality but not especially high betweenness centrality. If we look at the last plot, we can see that node is 19 deeply embedded in one social group and has ties to node 22, who has high betweenness, connecting different parts of the network. Thus, node 19 has high closeness as they can reach everyone else (through node 22) but low betweenness, as the shortest paths connecting different groups would not have to run through node 19.
Thus, if the process that we thought was most important was about information flow based on shortest paths, we may think that node 19 is well positioned to influence the rest of the network. If, however, the key is being the bridge itself, then 19 is clearly not as important as node 22. Thus, while there is much agreement between the centrality scores (with nodes 22, 16, 18 and 19 showing up consistently as central) it is possible for a node to be high on one measure and low on another.
9.2.3 Centrality for Task Interactions
We now repeat the analysis for the task interaction network. Note that the in and out measures will be the same as the network is undirected, meaning that we only need one calculation for measures like degree or closeness. It also means that we do not need to set mode as an argument.
<- degree(s641_task)
degree_task
<- distances(graph = s641_task)
dist_mat_task diag(dist_mat_task) <- NA
<- 1 / dist_mat_task
dist_mat_task_inverted
<- apply(dist_mat_task_inverted, MARGIN = 1,
closeness_task2 FUN = mean, na.rm = T)
<- betweenness(s641_task, normalized = F)
betweenness_task
<- evcent(s641_task)
ev_obj_task <- ev_obj_task$vector eigen_task
And now we put the results together, as before, into a data frame with all the centrality values.
<- data.frame(ids = ids, net = "task",
central_task degree = degree_task,
closeness2 = closeness_task2,
between = betweenness_task,
eigen = eigen_task)
head(central_task)
## ids net degree closeness2 between eigen
## 1 1 task 1 0.42857143 0 0.2154856
## 2 2 task 1 0.42857143 0 0.2154856
## 3 3 task 0 0.00000000 0 0.0000000
## 4 4 task 1 0.04761905 0 0.0000000
## 5 5 task 1 0.42857143 0 0.2154856
## 6 6 task 1 0.42857143 0 0.2154856
We will now quickly take a look at the nodes with the top centrality scores for the task network.
apply(central_task[, -c(1, 2)], MARGIN = 2, FUN = order, decreasing = T)
## degree closeness2 between eigen
## [1,] 22 22 22 22
## [2,] 18 18 18 18
## [3,] 17 17 19 17
## [4,] 19 19 1 21
## [5,] 21 21 2 19
## [6,] 13 13 3 13
## [7,] 16 16 4 20
## [8,] 20 20 5 16
## [9,] 1 1 6 1
## [10,] 2 2 7 5
## [11,] 4 5 8 6
## [12,] 5 6 9 7
## [13,] 6 7 10 10
## [14,] 7 9 11 14
## [15,] 8 10 12 11
## [16,] 9 11 13 15
## [17,] 10 14 14 2
## [18,] 11 15 15 9
## [19,] 14 4 16 8
## [20,] 15 8 17 3
## [21,] 3 3 20 4
## [22,] 12 12 21 12
In this case, we can see nodes 22, 18 and 17 are consistently the most important nodes, but node 22 is by the far most central. This becomes clear if we plot the network, scaling the nodes by degree:
plot(s641_task, vertex.size = central_task$degree,
vertex.label = V(s641_social)$name,
edge.arrow.size = 0.25, layout = layout.fruchterman.reingold,
main = "Classroom S641 Task Interactions", margin = -.08)
9.3 Centralization
We have so far seen which nodes are most important in the classroom using different definitions of centrality. We have also seen how this differs across social and task interactions. To flesh out the story more clearly, it will be useful to formally summarize the distribution of the centrality measures, telling us something about the network as a whole. For example, a network with one node capturing the vast majority of ties is highly centralized, or highly unequal, as all of the activity in the network is focused on a particular node. A highly centralized network is also relatively fragile, as removing the one central node would greatly reduce the connectivity of the network. We could examine the distribution of any of the centrality scores we calculated above. Here, let's focus on degree as a way of exploring the level of centralization in the two networks. Let's start with a summary of the degree distributions (indegree for social and degree for task).
summary(indegree_social)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 2.500 2.591 3.750 7.000
summary(degree_task)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.000 1.000 1.000 2.182 2.000 17.000
sd(indegree_social)
## [1] 2.03912
sd(degree_task)
## [1] 3.459099
We can see that the mean and median of degree is higher in the social interaction network than in the task network. In contrast, the maximum value, as well as the standard deviation is much higher in the task network. This confirms our story from above, where the task network is centered strongly on one node, while the social interaction network is based more on groups, where a single node won't necessarily dominate. Note that a simple standard deviation score can serve as an effective measure of centralization. It is also possible to employ traditional centralization measures.
To calculate centralization, we take the centrality scores of interest and sum up the total deviations from the highest value. We then typically divide the total summation by the maximum possible level of centralization in a network of that size (i.e., the centralization we would have observed in a hub and spoke structure).
igraph has different centralization functions for each centrality score. For degree the function is cent_degree()
. The arguments are graph (network of interest), mode (in, out, total) loops (T/F should self-loops be considered), and normalized (T/F should divide by theoretical max?) Here we calculate indegree centralization for the social interaction network, ignoring self loops and dividing by the theoretical max.
<- centr_degree(graph = s641_social, mode = "in",
cent_social loops = FALSE, normalized = TRUE)
cent_social
## $res
## [1] 3 1 0 1 3 3 1 1 1 2 2 3 0 0 1 5 4 7 5 3 5 6
##
## $centralization
## [1] 0.2199546
##
## $theoretical_max
## [1] 441
We could also calculate this directly by doing:
sum(max(indegree_social) - indegree_social) / sum(21 - rep(0, 21))
## [1] 0.2199546
The code simply takes the max centrality score and subtracts the centrality of each node in the network, summing over all nodes. We then divide by the theoretical max, the centralization score if one node received nominations from everyone (indegree = 21 in this case) and everyone else received none (indegree = 0).
And now we do the same thing for the task network.
<- centr_degree(graph = s641_task,
cent_task loops = FALSE, normalized = TRUE)
cent_task
## $res
## [1] 1 1 0 1 1 1 1 1 1 1 1 0 2 1 1 2 3 4 3 2 3 17
##
## $centralization
## [1] 0.7761905
##
## $theoretical_max
## [1] 420
Note that the theoretical max is a little different as we treated the task network as undirected. Clearly, the task network is considerably more centralized. In fact, the task network almost approaches maximum centralization, or a perfect hub and spoke structure.
Now, let's do a simple plot of the two degree distributions. We will put degree on the x-axis and plot a smoothed density curve for each distribution. First, we need to get the density curves for each network, starting with the social interaction network (we set from to 0 in the density()
function as indegree cannot be less than 0).
<- density(indegree_social, from = 0) den_social
And now for the task network:
<- density(degree_task, from = 0) den_task
And now we set up the plot, plot the two lines and add a legend.
plot(range(den_social$x, den_task$x), range(den_social$y, den_task$y),
type = "n", xlab = "degree",
ylab = "density",
main = "Degree Distribution for Social and Task Networks")
lines(den_social, col = "red" , lty = 2, lwd = 2)
lines(den_task, col = "light blue", lty = 2, lwd = 2)
legend("topright", c("Social", "Task"),
col = c("red", "light blue"), lty = 2, lwd = 2)
Here we see that for the task network most people have one or two ties and one person has a very high degree. The social interaction network has a much more even distribution, with many people close to the mean. The story is clear that the task network is highly centralized, with one node being the focal point of all of the task interactions. Social interactions are much evenly dispersed, occurring within groups of people but not centered on a single well-connected node.
More generally, our centrality and centralization analyses paint a picture of two different kinds of interactional tendencies. For the social interaction network, we have a set of divided groups bridged by one focal node with high betweenness. Within each group there are prominent nodes with high degree, closeness, etc. but only one node holds the whole network together. For the task network, there is only one focal node, with everyone doing task interactions with them and few interactions happening otherwise.
9.4 Clustering and Hierarchy
We have so far used centrality and centralization to explore the classroom networks. Centrality is focused on individual positions in the network and can tell us who holds important positions and who does not. Centralization helps us understand how unequally distributed centrality is in the network. Neither measure (centrality nor centralization) can tell us much about hierarchy at the group level. We may, however, want to know if the groups that exist in our classroom are themselves hierarchically arranged.
To explore hierarchy at the group-level, it will be useful to consider other kinds of measures. Here, we will use the tau statistic. The tau statistic captures how micro processes aggregate to create different macro structures. The basic idea is to create hypotheses in the form of different triad counts (the micro processes), that should yield different features at the macro-level. Thus, different micro hypotheses (about which triads should be in the network at high/low rates) correspond to different kinds of emergent features at the macro level. By comparing the observed triad counts to that expected under a null hypothesis, we can see what kinds of hierarchical arrangements exist in the network of interest. Formally, we compare the observed triad counts to the expectations under a null hypothesis of a random network with the same dyad census. This is analogous to the kinds of tests we explored in Chapter 7, but the tau statistic is presented as a z-score (how many standard deviations from the null hypothesis is the observed value), making it akin to more traditional statistical tests.
Let's first transform our network from an igraph object into a network object using the intergraph package (as the function we need assumes a network object).
library(intergraph)
For this analysis we will focus on the social interaction network.
<- asNetwork(s641_social) s641_social_network
Here we read in a function to calculate the tau statistic.
source(file = "https://github.com/JeffreyAlanSmith/Integrated_Network_Science/raw/master/R/tau_functions.R")
Now we load the ergm and sna packages which is used by the tau function read in above.
library(ergm)
library(sna)
Let's test different hypotheses about the distribution of triads in the network, telling us something about the macro structure in terms of hierarchy. We will consider a ranked clustering hypothesis, a clustering hypothesis and a balance hypothesis. Each hypothesis is represented by a vector, indicating which triads should be summed up and compared to our baseline expectations. The triad types are:
- 003 A, B, C, empty triad.
- 012 A->B, C
- 102 A<->B, C
- 021D A<-B->C
- 021U A->B<-C
- 021C A->B->C
- 111D A<->B<-C
- 111U A<->B->C
- 030T A->B<-C, A->C
- 030C A<-B<-C, A->C.
- 201 A<->B<->C.
- 120D A<-B->C, A<->C.
- 120U A->B<-C, A<->C.
- 120C A->B->C, A<->C.
- 210 A->B<->C, A<->C.
- 300 A<->B<->C, A<->C, completely connected.
A ranked clustering hypothesis poses that 003, 102, 021D, 021U, 030T, 120D, 120U and 300 should be present in the network at higher rates than we what we expect based on dyadic processes alone. The idea is that a network with these triads will tend to create macro structures that correspond to ranked clustering, where there are mutual ties within groups and asymmetric ties across groups; where the lower status groups send ties to higher status groups but not vice versa. Let's create a vector that corresponds to the ranked clustering hypothesis, putting a 1 in each spot of the triad census (following the order above) that corresponds to a triad in that hypothesis.
<- c(1, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1) weights_rankedcluster
A clustering hypothesis poses that 003, 102, and 300 should be present in the network at higher rates than we what we expect based on dyadic processes alone. The clear difference with the ranked clustering hypothesis is that triads that create hierarchies (021U, 120U, etc.) are not included here. The macro network structure implied by a triad census fitting the clustering model is one with a number of social groups, with mutual ties within groups and few ties between groups. Thus, there are a number of groups differentiated by high internal rates of interaction (see Chapter 8) but there is no clear hierarchy between the groups.
<- c(1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1) weights_cluster
A balance hypothesis is the simplest hypothesis and only includes 102 and 300. This is very similar to the clustering hypothesis but differs in the exclusion of the null triad, 003. The key macro structural difference is that the clustering hypothesis implies a number of social groups to emerge (with no hierarchy) while the balance hypothesis implies that only two groups should emerge, with mutual ties within the groups and few ties between.
<- c(0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1) weights_balance
The function is tau_stat_function()
. The arguments are network and weight.vector (the vector of weights). For ranked clustering:
<- tau_stat_function(network = s641_social_network,
tau_rankedcluster weight.vector = weights_rankedcluster)
tau_rankedcluster
## $tau
## [,1]
## [1,] 2.968397
##
## [[2]]
## observed.triads expected.triads weight.vector
## triadcensus.003 1029 1.012569e+03 1
## triadcensus.012 37 4.579457e+01 0
## triadcensus.102 403 4.121511e+02 1
## triadcensus.021D 1 1.144864e-01 1
## triadcensus.021U 0 1.144864e-01 1
## triadcensus.021C 0 2.289728e-01 0
## triadcensus.111D 4 6.182267e+00 0
## triadcensus.111U 10 6.182267e+00 0
## triadcensus.030T 0 5.695842e-04 1
## triadcensus.030C 0 1.898614e-04 0
## triadcensus.201 38 5.357965e+01 0
## triadcensus.120D 0 1.537877e-02 1
## triadcensus.120U 0 1.537877e-02 1
## triadcensus.120C 0 3.075755e-02 0
## triadcensus.210 7 7.996962e-01 0
## triadcensus.300 11 2.221378e+00 1
The output is a list, with the first element the tau statistic and the second a data frame with the observed and expected triads, as well as the weighting vector. Now for the clustering hypothesis:
<- tau_stat_function(network = s641_social_network,
tau_cluster weight.vector = weights_cluster)
tau_cluster
## $tau
## [,1]
## [1,] 2.867246
##
## [[2]]
## observed.triads expected.triads weight.vector
## triadcensus.003 1029 1.012569e+03 1
## triadcensus.012 37 4.579457e+01 0
## triadcensus.102 403 4.121511e+02 1
## triadcensus.021D 1 1.144864e-01 0
## triadcensus.021U 0 1.144864e-01 0
## triadcensus.021C 0 2.289728e-01 0
## triadcensus.111D 4 6.182267e+00 0
## triadcensus.111U 10 6.182267e+00 0
## triadcensus.030T 0 5.695842e-04 0
## triadcensus.030C 0 1.898614e-04 0
## triadcensus.201 38 5.357965e+01 0
## triadcensus.120D 0 1.537877e-02 0
## triadcensus.120U 0 1.537877e-02 0
## triadcensus.120C 0 3.075755e-02 0
## triadcensus.210 7 7.996962e-01 0
## triadcensus.300 11 2.221378e+00 1
Now for the balance hypothesis:
<- tau_stat_function(network = s641_social_network,
tau_balance weight.vector = weights_balance)
tau_balance
## $tau
## [,1]
## [1,] -0.03377649
##
## [[2]]
## observed.triads expected.triads weight.vector
## triadcensus.003 1029 1.012569e+03 0
## triadcensus.012 37 4.579457e+01 0
## triadcensus.102 403 4.121511e+02 1
## triadcensus.021D 1 1.144864e-01 0
## triadcensus.021U 0 1.144864e-01 0
## triadcensus.021C 0 2.289728e-01 0
## triadcensus.111D 4 6.182267e+00 0
## triadcensus.111U 10 6.182267e+00 0
## triadcensus.030T 0 5.695842e-04 0
## triadcensus.030C 0 1.898614e-04 0
## triadcensus.201 38 5.357965e+01 0
## triadcensus.120D 0 1.537877e-02 0
## triadcensus.120U 0 1.537877e-02 0
## triadcensus.120C 0 3.075755e-02 0
## triadcensus.210 7 7.996962e-01 0
## triadcensus.300 11 2.221378e+00 1
In general, larger values offer support for the hypothesis in question. We can see here that there is little support for the balance hypothesis compared to the other hypotheses. This suggests that the balance hypothesis is too simple. More specifically, it looks like there are many more null triads (003) than we would expect in a network where everyone falls into two groups (under the balance hypothesis). The tau statistics are similar between the ranked clustering and clustering hypotheses. A value of 2.968 (for ranked clustering) suggests that the observed (summed) counts are about 3 standard deviations away from what we expect under the null. Values over 2 offer support for the hypothesis in question under traditional hypothesis testing criteria.
Let's take a closer look at the results for the ranked clustering model. We will focus on the expected and observed triad counts, particularly those triads that are in the ranked clustering model but not the clustering model (021D, 021U, 030T, 120D, 120U). The idea is to grab the data frame from the ranked cluster results, only keeping those rows for certain triads.
<- c("triadcensus.021D", "triadcensus.021U", "triadcensus.030T",
triad_names "triadcensus.120D","triadcensus.120U")
2]][rownames(tau_rankedcluster[[2]]) %in% triad_names, ] tau_rankedcluster[[
## observed.triads expected.triads weight.vector
## triadcensus.021D 1 0.1144864249 1
## triadcensus.021U 0 0.1144864249 1
## triadcensus.030T 0 0.0005695842 1
## triadcensus.120D 0 0.0153787735 1
## triadcensus.120U 0 0.0153787735 1
In every case but 021D, the observed counts are basically the same as that expected by the null model. In fact, we see 0 observed triads for 021U, 030T, 120D and 120U. This would suggest that the ranked clustering model really isn't offering much over the clustering model. The ranked clustering model offers similar fit to the clustering model but is more complicated, adding 5 triads that do not seem to deviate much from chance expectations. We may then have good reason to interpret the network in terms of the clustering model, where there are multiple groups but few asymmetries.
Overall, the analysis shows that the social interaction network is best characterized as a network with multiple groups without a clear hierarchical arrangement. Given the very high levels of reciprocity in social interactions, asymmetries are rare and do not consistently emerge between groups. The tau statistic reinforces our story of the social interaction network consisting of distinct social groups with one bridge and no clear hierarchy. Compare this to the task network, which has a clear hub and spoke structure, but no emergent groups.
We end the tutorial by noting that centrality and hierarchy will come up again in a number of tutorials; for example, in Chapter 11 (two-mode), Chapter 13 (statistical network models), Chapter 14 (diffusion) and Chapter 15 (social influence).