Lecture 7 — Stochastic Block Models and Continuous Latent Space Models

Agenda

Reminder about block models
Stochastic block models
SBMs and community discovery
Continuous latent space models
Extensions and side-lights (time permitting)

Notation for today

\(m =\) total number of edges
\(k_i =\) degree of node \(i\) in undirected graph
- \(\sum_{i}{k_i} = 2m\)

Block Models

\[ \newcommand{\Prob}[1]{\mathbb{P}\left( #1 \right)} \DeclareMathOperator*{\logit}{logit} \DeclareMathOperator*{\Tr}{Tr} \]

\(n\) nodes, divided into \(k\) blocks, \(Z_i =\) block of node \(i\), \(k\times k\) affinity matrix \(\mathbf{b}\)

\[ \Prob{ A_{ij}=1| Z_i = r, Z_j = s } = b_{rs} \]

Independence across edges
Inference as easy as could be hoped
Presumes: block assignments are known

Stochastic Block Models (SBMs)

"SBM" means:

\[ \begin{eqnarray} Z_i & \sim_{IID} & \mathrm{Multinomial}(\rho)\\ A_{ij} | Z_i, Z_j & \sim_{ind} & \mathrm{Bernoulli}(b_{Z_i Z_j}) \end{eqnarray} \]

i.e., block assignment is stochastic (but IID)

The log-likelihood gets complicated

\[ \ell(b, \rho) = \log{\sum_{z \in \{1:k\}^n}{\left[\prod_{i=1}^{n}{\prod_{j=1}^{n}{b_{z_i z_j}^{A_{ij}} {(1-b_{z_i z_j})}^{(1-A_{ij})}}} \prod_{i=1}^{n}{\rho_{z_i}}\right]}} \]

Define \(n_r(z)\), \(e_{rs}(z)\), \(n_{rs}(z)\) in the obvious ways

\[ \ell(b, \rho) = \log{\sum_{z \in \{1:k\}^n}{\left[\prod_{r,s}{b_{rs}^{e_{rs}(z)} (1-b_{rs})^{n_{rs}(z) - e_{rs}(z)}} \prod_{r}{\rho_r^{n_r(z)}}\right]}} \]

and \(\log{\sum} \neq \sum{\log} \ldots\)

How do we get out of this mess?

If we knew \(Z\), estimating \(\mathbf{b}\) and \(\rho\) would be easy

If knew \(\mathbf{b}\) and \(\rho\), getting \(\Prob{Z|A}\) is at least conceivable

EM algorithm
EM algorithm with "belief propagation"
- Node \(i\) takes in current guesses about blocks of its neighbors, \(\rho\)
- Node \(i\) finds posterior distribution for \(Z_i\); iterate
- Usually special handling of non-edges
Gibbs sampling
Treat \(Z\) as fixed parameter, maximize

And after all that…

SBM is not identified!
Swap any two of the block labels:
- Exchange those rows and columns of \(\mathbf{b}\)
- Also exchange those entries in \(\rho\)
- Distribution over graphs is unchanged
Measure differences in \(Z\)s between estimates in permutation-invariant ways
- e.g., min over permuting \(1:k\)
- or use mutual information

Modularity

Assortative mixing in networks = nodes with same value of discrete characteristic have more links than you'd expect
How many is that?

\[ \begin{eqnarray} \kappa_{rs} & \equiv & e_{rs}/2m\\ \kappa_{r} & \equiv & \sum_{s}{\kappa_{rs}}\\ Q & \equiv & \sum_{r}{\kappa_{rr} - \kappa_r^2}\\ \end{eqnarray} \]

Note: \(\Tr{\mathbf{\kappa}}\) maximized if all nodes are in one block!
Assortativity usually refers to observed characteristics

Modularity (cont'd)

We can use \(Q\) when \(z\) is something we make up:

\[ Q(z) = \sum_{r}{\kappa_{rr}(z) - \kappa_r(z)^2} \]

This is the (Newman-Girvan) modularity of the block-assignment vector \(z\)

Equivalent (exercise!) to a sum over node pairs:

\[ Q = \frac{1}{2m}\sum_{i,j}{\left[A_{ij} - \frac{k_i k_j}{2m}\right]\delta_{Z_i Z_j}} \]

Break this down:
- \(k_i k_j / 2m =\) probability of an \((i,j)\) edge if nodes are paired randomly but degrees are preserved
- \(A_{ij} - k_i k_j/2m > 0\) for \(A_{ij} = 1\), \(<0\) for \(A_{ij} = 0\)
- \(Q\) likes within-block edges, dislikes within-block non-edges
- Substitute other null models to taste
- Substitute divergences other than \(-\) to taste

Community Discovery

Community or module: group of nodes with dense internal connections, but few connections to other communities
Community discovery: given a graph, divide it into good communities
"Good" often means: high modularity \(Q\)
Huge literature since Newman and Girvan 2003

Community Discovery (cont'd.)

General maximization problem is NP
Many, many heuristics
- Find highest betweenness edge, remove, recalculate betweenness afterwards
- Turn into an eigen-problem
- Assign random initial communities, take majority vote (like HW 1 Prob 3)
- Find most likely \(Z\) in an SBM
- Many of these built in to igraph

Consistency of Community Discovery

Theoretical literature has focused on a very strong form of consistency: as \(n\rightarrow \infty\), \[ \Prob{\hat{Z} \neq Z} \rightarrow 0 \] i.e., probability that all nodes are correctly assigned to communities goes to 1
- Could instead imagine something like "proportion of mis-assigned nodes goes to zero in probability"
- Permuting over community labels always allowed
Growing theoretical literature, typically assuming:
- Graph really comes from SBM
- Expected degree grows sufficiently rapidly with \(n\)
- \(\mathbf{b}\) is diagonally dominated
- Columns of \(\mathbf{b}\) are sufficiently different from each other

Continuous Latent Space Models

The classic approach, due to Hoff, Raftery and Handcock:

Node \(i\) lives at a (latent) point \(Z_i \in \mathbb{R}^d\)
- HRH proposed these are IID \(\sim \mathcal{N}(0, \mathbf{I}_d)\)
Edges become unlikely as nodes separate
- HRH proposed \(\logit{\Prob{ A_{ij}=1|Z_i, Z_j}} = \beta_0 - \| Z_i - Z_j\|\)
All \(A_{ij}\) are dependent
All \(A_{ij}\) are independent given locations

Symmetry again

Why just \(\beta_0 - \| Z_i - Z_j \|\)? Why not \(\beta_0 - \beta_1 \| Z_i - Z_j\|\)?
Why \(Z_i \sim \mathcal{N}(0, \mathbf{I}_d)\), instead of some other variance?
- If we multiply all the \(Z_i\) by the same scalar \(r\), and \(\beta_1\) by \(1/r\), nothing observable changes
- Thus fix \(\beta_1 = 1\), and prior variance at unity
The \(Z_i\) are still not identified:
- Nothing changes if rotate all the \(Z_i\) the same way
- Or if translate all the \(Z_i\) along the same vector
- Or if we reflect all the \(Z_i\) about the same plane
- Or combine rotations, translations and reflections

Isometry

Isometry = translations which leave all distances (metric) the same (iso-)
- For Euclidean space, isometry group built from rotations, translations and reflections
- The \(Z\)s are "identified up to isometry"
Procrustes problem = given two sets of points in \(\mathbb{R}^d\), find isometry which minimizes the distance between them
- Good algorithms for this (especially if not too many points and \(d\) small)
- Often useful as an intermediate stage in working with continuous-space models

What to do with continuous-space models?

Embedding: given \(A_{ij}\), guess at \(Z\)
Inference: on \(\beta_0\) and/or posterior distribution of \(Z\)
Of course, easy to simulate

Variants

Add in node covariates
Other distributions for locations
- Isometry: set mean at 0, variance at \(\mathbf{I}_d\) w.o.lo.g.
- Why think anything is Gaussian?
Other link-probability functions
- Why think anything is logit-linear?
- Zero outside maximum radius?
- Step-function ("Heaviside") link probabilities?
Motion over time (Moore and Sarkar)
Other latent spaces
- Smooth manifolds
- Positively-curved (spherical) spaces
- Negatively-curved (hyperbolic) spaces

The cycle

\(\Rightarrow\) \(\Rightarrow\)

(D. Asta)

Hyperbolic spaces

Lots of real networks are tree-like; this leads to non-Euclidean, hyperbolic spaces

Hierarchical, tree-like structures embed isometrically into hyperbolic spaces
The origin is like the root of the tree
Volume within \(r\) of the origin grows exponentially with \(r\)
Shortest paths between points far from the origin curve back towards the origin

Geodesic paths in the hyperbolic disk

(D. Asta)

Geodesic paths in the hyperbolic disk

(M. C. Escher)

Use a hyperbolic space, with link probabilities decaying in distance, and (Krioukov et al. 2005):

Highly skewed degree distribution (higher degrees closer to origin)
Lots of clustering
Core-periphery structure

Inference

Lots of algorithmic work on embeddings that maximize particular likelihoods, minimize some distortion, etc.
HRH and related: MCMC for the posterior distribution of \(Z\)
- Consistency: who knows?
First proof that MLE is consistent: Shalizi & Asta forthcoming
- General metric spaces with not-too-complex isometry groups
- Presumes smooth, known link function
- No assumption on distribution of \(Z\)

The general picture

Each node gets an IID latent variable \(Z_i\)
\(\Prob{A_{ij} = 1|Z_i=u, Z_j=v} = w(u,v)\)
Edges are independent given \(Z\)
It turns out all exchangeable graph models take this form
- For details, take 781 in mini-2

Time permitting…

Some physics jargon

Analogy to magnetism; \(Z_i\) = "spin" of atom or molecule \(i\)
Nearby spins interact; all spins coupled to external magnetic fields
Energy ("Hamiltonian") of the state \(z\) has the form \[ h(z) = \sum_{i,j}{c_{ij}(z_i, z_j)} + \sum_{i}{r(z_i, \rho)} \]
\(\Prob{z} \propto e^{-\beta h(z)}\), with \(\beta=\) inverse temperature
- "Boltzmann distribution", "canonical ensemble" (= exponential family)
- Low temperature = low-energy states strongly preferred
- High temperature = all states tend towards being equally probable
Lowest-energy state = ground state = state of maximum likelihood
free energy = energy that could be extracted, above thermal noise \(=\log{\sum_{z}{e^{-\beta h(z)}}}\)

Rarer approach to SBM inference

Prior distributions over \(\mathbf{b}\), \(\rho\) and MCMC
- Priors are devices for smoothing, i.e., adding bias and reducing variance
- What might be good biases to have here? How would you know?
Simulation-based inference
- Simulate many networks from each candidate \(\mathbf{b}, \rho\)
- Compute summary statistics on simulations
- Adjust parameters to match observed graph

SBM Variant I: Degree-Corrected SBM

Each node gets a popularity \(\theta_i\)
Then edge probabilities follow \[ \Prob{A_{ij} = 1 | Z_i=r, Z_j=s, \theta_i, \theta_j} = \theta_i \theta_j b_{rs} \]
\(\theta\) helps account for broad degree distributions
\(\theta\) does nothing to explain those degree distributions

Degree-Corrected SBMs (cont'd.)

Math simplifies if we pretend \(A_{ij} \sim \mathrm{Poisson}(\theta_i \theta_j b_{Z_i Z_j})\)
- Little difference in distribution when means are \(\ll 1\)
Symmetry under "dilation"
- \(b_{rs} \mapsto c b_{rs}\), \(\theta_i \mapsto \theta_i/c\) for all \(i: Z_i = r\) changes nothing
- \(\therefore\) impose one linear constraint on \(\theta_i\) per block
Fix \(\sum_{i: Z_i = r}{\theta_i} = 1\), then \(\hat{\theta} = k_i/\sum_{j: Z_j = r}{k_j}\)
High-dimensional problem: the dimension of \(\theta\) grows with \(n\)!
- OK in a dense graph, with \(O(n)\) d.o.f. per \(\theta_i\)
- Standard asymptotics break down for sparse graphs

SBM Variant II: Mixed-Membership SBMs

Nodes don't have fixed-but-random \(Z_i\) any more
Instead, each node has a distribution \(\rho_i\) over \(1:k\)
When pairing with node \(j\), node \(i\) draws \(Z_{i(j)}\) from \(\rho_i\)
- Similarly node \(j\) draws \(Z_{j(i)}\) independently from \(\rho_j\)
Then look up edge probability from \(b_{Z_{i(j)} Z_{j(i)}}\)

Mixed-Membership SBMs (cont'd.)

Origin myth:
- Blocks = social roles
- \(\rho_i =\) distribution of \(i\)'s social life over different roles
- \(Z_{i(j)} =\) role \(i\) takes on when meeting \(j\)
Myth is random switching, not gradual transitions
OTOH, marginalize over \(Z_{i(j)}, Z_{j(i)}\): \[ \Prob{A_{ij}=1|\rho_i, \rho_j} = \sum_{r,s}{b_{rs} \rho_{ir} \rho_{js}} \]
Degree-corrected MMSBM left as exercise

Force-Directed Layout

Force-directed layout is a classic way to draw graphs:
- Each node is represented by a point in space
- Attractive forces between nodes with edges
- Repulsive forces between nodes without edges
- Run until equilibrium \(\equiv\) minimize total energy
Look at modularity by node pairs again:

\[ Q = \frac{1}{2m}\sum_{i,j}{\left[A_{ij} - \frac{k_i k_j}{2m}\right]\delta_{Z_i Z_j}} \]

This is an energy:
- "Attraction" between nodes with edges (if in same block)
- "Repulsion" between nodes without edges (if in same block)
- Modularity is actually a special case of the usual energy function for force-directed layout (Noack, 2009)

Going beyond classic continuous-latent-space models

Add covariates, etc., etc.
CLS conflates stochastic equivalence and homophily
- Homophily = preference for friends who are like you
- Stochastic equivalence = two nodes have the same link probabilities
- Diagonally-dominated SBMs also conflate these
Hoff (2007) introduces a more general model: replace \(-\| Z_i - Z_j \|\) with \(Z_i^T \mathbf{\Lambda} Z_j\) for some diagonal matrix \(\mathbf{\Lambda}\)
- Allows attraction on some dimensions but repulsion on others
- Allows for stochastic equivalence without homophily
- General SBM a special case