Belief Propagation

一个图模型包含 NN 个随机变量 x=(x1,,xN)\underline{x}=(x_1,\dots,x_N),从有限字母表 X\mathcal{X} 中取值. 需要解决一些问题,例如:

但是我们不会去直接计算联合分布 p(x)p(\underline{x}),因为它的状态空间是指数级别 XN|\mathcal{X}|^N 个取值. BP 将这个问题转化为局部计算,利用图的稀疏性/特殊结构来避免指数级别的计算.

Example 1: Ising chain

Ferromagnetic Ising model has σ=(σ1,,σN),σi{1,1}\underline{\sigma}=(\sigma_1,\dots, \sigma_N), \sigma_i\in\{1,-1\} with joint distribution

μβ(σ)=1Zexp(βi=1N1σiσi+1+βBi=1Nσi)\mu_\beta(\underline{\sigma}) = \frac{1}{Z} \exp\left(\beta \sum_{i=1}^{N-1} \sigma_i \sigma_{i+1}+\beta B\sum_{i=1}^N\sigma_i\right)

where ZZ is the partition function.

我们计算 σj\sigma_j 的 marginal distribution, 这里用 \propto 来省略归一化因子:

μ(σj)σ1,,σj1,σj+1,,σNexp(βi=1N1σiσi+1+βBi=1Nσi)=σ1,,σj1,σj+1,,σNexp(βi=1j1σiσi+1+βBi=1j1σi)exp(βi=jN1σiσi+1+βBi=j+1Nσi)exp(βBσj)\mu(\sigma_j)\propto\sum_{\sigma_1,\dots,\sigma_{j-1},\sigma_{j+1},\dots,\sigma_{N}}\exp\left(\beta \sum_{i=1}^{N-1} \sigma_i \sigma_{i+1}+\beta B\sum_{i=1}^N\sigma_i\right) \\=\sum_{\sigma_1,\dots,\sigma_{j-1},\sigma_{j+1},\dots,\sigma_{N}}\exp\left(\beta \sum_{i=1}^{j-1} \sigma_i \sigma_{i+1}+\beta B\sum_{i=1}^{j-1}\sigma_i\right)\exp\left(\beta\sum_{i=j}^{N-1} \sigma_i \sigma_{i+1}+\beta B\sum_{i=j+1}^{N}\sigma_i\right)\exp\left(\beta B\sigma_{j}\right)

分配拆开即可,是两个和的乘积. 定义 "message":

This decomposition is interesting because the various messages can be computed iteratively.

而这两个部分都是 local 信息,左边的求和和右边无关. 例如 ν^1(σ1)\hat{\nu}_{\to1}(\sigma_1)ν^n(σn)\hat{\nu}_{\leftarrow n}(\sigma_n) 就是均匀二项分布. 这样就可以递推求出各个分布了.

ν^i(σi)σi1ν^(i1)(σi1)exp ⁣(βσi1σi+βBσi1)\hat{\nu}_{\to i}(\sigma_i) \propto \sum_{\sigma_{i-1}} \hat{\nu}_{\to(i-1)}(\sigma_{i-1}) \exp\!\left( \beta \, \sigma_{i-1}\sigma_i +\beta B \sigma_{i-1}\right)

ν^i(σi)σi+1ν^(i+1)(σi+1)exp ⁣(βσiσi+1+βBσi+1)\hat{\nu}_{\leftarrow i}(\sigma_i) \propto \sum_{\sigma_{i+1}} \hat{\nu}_{\leftarrow(i+1)}(\sigma_{i+1}) \exp\!\left( \beta \, \sigma_{i}\sigma_{i+1} +\beta B \sigma_{i+1}\right)

Example 2. a tree-parity-check code

We start from a concrete example.

左图是这个方程组的 factor graph.

已知纠错矩阵,将一个 xx 通过 BSC(pp) 信道传输(每一 bit 以 pp 的概率翻转),接收方得到 yy. 考虑 distribution conditioned on yy,令 Q(ii)=1p,Q(ij)=p,  ijQ(i|i)=1-p,Q(i|j)=p,\forall\;i\neq j.

我们可以逐 bit 分析,需要在给定 yy 的情况下计算每个 xix_i 的 marginal distribution. 假设 y=(1000010)y=(1000010).

μ(x1)x0x2,,x6I(x0x1x2=0)I(x0x3x4=0)I(x0x5x6=0)i=06Q(yixi)=x0,x2I(x0x1x2=0)Q(y0x0)Q(y2x2)×x3,x4I(x0x3x4=0)Q(y3x3)Q(y4x4)×x5,x6I(x0x5x6=0)Q(y5x5)Q(y6x6)×Q(y1x1)=Q(y1x1)x0,x2I(x0x1x2=0)Q(y0x0)Q(y2x2)ν^b0(x0)ν^c0(x0)\mu(x_1)\propto\sum_{x_0x_2,\dots,x_6} \mathbb{I}(x_{0} \oplus x_{1} \oplus x_{2} = 0)\, \mathbb{I}(x_{0} \oplus x_{3} \oplus x_{4} = 0)\, \mathbb{I}(x_{0} \oplus x_{5} \oplus x_{6} = 0)\, \prod_{i=0}^{6} Q(y_i | x_i) \,\\ =\sum_{x_0,x_2}\mathbb{I}(x_{0} \oplus x_{1} \oplus x_{2} = 0)Q(y_0|x_0)Q(y_2|x_2)\\\times \sum_{x_3,x_4}\mathbb{I}(x_{0} \oplus x_{3} \oplus x_{4} = 0)Q(y_3|x_3)Q(y_4|x_4)\\\times \sum_{x_5,x_6}\mathbb{I}(x_{0} \oplus x_{5} \oplus x_{6} = 0)Q(y_5|x_5)Q(y_6|x_6)\\\times Q(y_1 | x_1) \, \\ = Q(y_1|x_1) \sum_{x_0,x_2}\mathbb{I}(x_{0} \oplus x_{1} \oplus x_{2} = 0)Q(y_0|x_0)Q(y_2|x_2)\hat{\nu}_{b\to0}(x_0)\hat{\nu}_{c\to0}(x_0)

μ(x0)x1,,x6I(x0x1x2=0)I(x0x3x4=0)I(x0x5x6=0)i=06Q(yixi)=x1,x2I(x0x1x2=0)Q(y1x1)Q(y2x2)×x3,x4I(x0x3x4=0)Q(y3x3)Q(y4x4)×x5,x6I(x0x5x6=0)Q(y5x5)Q(y6x6)×Q(y0x0)\mu(x_0)\propto\sum_{x_1,\dots,x_6} \mathbb{I}(x_{0} \oplus x_{1} \oplus x_{2} = 0)\, \mathbb{I}(x_{0} \oplus x_{3} \oplus x_{4} = 0)\, \mathbb{I}(x_{0} \oplus x_{5} \oplus x_{6} = 0)\, \prod_{i=0}^{6} Q(y_i | x_i) \, \\ =\sum_{x_1,x_2}\mathbb{I}(x_{0} \oplus x_{1} \oplus x_{2} = 0)Q(y_1|x_1)Q(y_2|x_2)\\\times \sum_{x_3,x_4}\mathbb{I}(x_{0} \oplus x_{3} \oplus x_{4} = 0)Q(y_3|x_3)Q(y_4|x_4)\\\times \sum_{x_5,x_6}\mathbb{I}(x_{0} \oplus x_{5} \oplus x_{6} = 0)Q(y_5|x_5)Q(y_6|x_6)\\\times Q(y_0 | x_0) \,

乘积的第一项,就是指 x0x_0{0,a,1,2}\{0,a,1,2\} 子图中的 marginal distribution, 它和其他节点无关. 可以引入 ν^a0(x0)x1,x2I(x0x1x2=0)Q(y1x1)Q(y2x2)\hat{\nu}_{a\to0}(x_0)\propto\sum_{x_1,x_2}\mathbb{I}(x_{0} \oplus x_{1} \oplus x_{2} = 0)Q(y_1|x_1)Q(y_2|x_2) 等等,得到

μ(x1)Q(y1x1)ν^a1(x1)\mu(x_1)\propto Q(y_1|x_1)\hat{\nu}_{a\to1}(x_1)

μ(x0)Q(y0x0)ν^a0(x0)ν^b0(x0)ν^c0(x0)\mu(x_0)\propto Q(y_0|x_0)\hat{\nu}_{a\to0}(x_0)\hat{\nu}_{b\to0}(x_0)\hat{\nu}_{c\to0}(x_0)

Belief Propagation on Trees

In this section,

μ(x)=1Zψa(xa)\mu(\underline{x})=\frac{1}{Z}\psi_a(\underline{x}_{\partial a})

where ψa\psi_a has input in {xi:ia}\{x_i:i\in\partial a\}, i.e. the neighborhood of aa. Similar sets are defined for notation i\partial i.

We have BP rules:

νja(xj)    bjaν^bj(xj)ν^aj(xj)    xajψa(xa)kajνka(xk).\nu_{j \to a}(x_j) \;\cong\; \prod_{b \in \partial j \setminus a} \hat{\nu}_{b \to j}(x_j)\\ \hat{\nu}_{a \to j}(x_j) \;\cong\; \sum_{\underline{x}_{\partial a \setminus j}} \psi_a(\underline{x}_{\partial a}) \prod_{k \in \partial a \setminus j} \nu_{k \to a}(x_k).

Simple computation is able to show that the example of parity check code is consistent with the equations above.

If jj has only 11 neighbor aa, then νja(xj)1\nu_{j \to a}(x_j)\propto1, thus uniform.

Let us verify ν^a0(x0)x1,x2I(x0x1x2=0)Q(y1x1)Q(y2x2)\hat{\nu}_{a\to0}(x_0)\propto\sum_{x_1,x_2}\mathbb{I}(x_{0} \oplus x_{1} \oplus x_{2} = 0)Q(y_1|x_1)Q(y_2|x_2):

LHS=x1,x2I(x0x1x2=0)Q(y1x1)Q(y2x2)k=1,2νka(xk)\text{LHS}=\sum_{x_1,x_2}\mathbb{I}(x_{0} \oplus x_{1} \oplus x_{2} = 0)Q(y_1|x_1)Q(y_2|x_2)\prod_{k=1,2}\nu_{k\to a}(x_k)

A graph: and a clarifying lecnote:

实际上,以上的 BP 应该按照迭代的方式理解.

这个有点像 Bellmann-ford, 明明我们要对整体进行分析(比如计算边缘分布)但是我们可以对局部(也就是每一条边)进行迭代得到结果.

We have converging theorem:

Correlations and Energy

现在我们更进一步,想要计算某两个变量的联合边缘分布 μij(xi,xj)\mu_{ij}(x_i,x_j). 因为我们可以计算 μ(xi)\mu(x_i), 只需计算 Pxμ(xj=xjxi=xi)\mathbb{P}_{\underline{x}\sim\mu}(\underline{x}_j=x_j|\underline{x}_i=x_i) 即可. (注意这里稍微有点符号混用,你可以将 x\underline{x} 看成一个随机变量). 可以定义

μ(xxi=b)a=1Mψ(xa)I(xi=b).\mu(\underline{x}|x_i=b)\propto\prod_{a=1}^M\psi(\underline{x}_{\partial a})\mathbb{I}(x_i=b).

其实相当于加一条约束 xi=bx_i=b, 在 factor graph 中的体现为 ii 号节点挂出去了一个方块节点.

FRF_R 是某个由方块节点构成的集合(也就是由约束节点构成),VR=FRV_R=\partial F_R 是和 FRF_R 相连的变量节点,其中 RR 代表诱导子图. 令 xR\underline{x}_R 代表 RR 中变量构成的联合向量. 不妨设 RR 是连通图.

根据 message 的物理意义,可以计算 xR\underline{x}_R 的边缘联合分布,它由内部因子 ϕa(aFR)\phi_a(a\in F_R) 和外部 message (相当于对子树代表的变量积分) 构成:

μ(xR)aFRψa(xa)aRν^ai(a)(xi(a))\mu(\underline{x}_R)\propto\prod_{a\in F_R}\psi_a(\underline{x}_{\partial a})\prod_{a\in\partial R}\hat{\nu}_{a\to i(a)}(x_{i(a)})

其中 i(a)i(a) 表示 aaRR 中的唯一邻居(如果有两个邻居就总会成环,可以对连通图 RR 证明这一点).

Internal Energy

In physics problems, the compatibility functions ψa\psi_a take the form ψa(xa)=eβEa(xa)\psi_a(\underline{x}_{\partial a})=e^{-\beta E_a(\underline{x}_{\partial a})}.

The internal Energy is defined to be the expectation of total energy among all possible configurations x\underline{x}.

U=E=xμ(x)a=1MEa(xa)=1βxμ(x)a=1Mlogψa(xa)U=\langle E\rangle=\sum_{\underline{x}}\mu(\underline{x})\sum_{a=1}^ME_a(\underline{x}_{\partial a})=-\frac{1}{\beta}\sum_{\underline{x}}\mu(\underline{x})\sum_{a=1}^M\log \psi_a(\underline{x}_{\partial a})

Exchange the summing order, with fixed aa our aim is to compute

xμ(x)logψa(xa)=xaxiaμ(x)logψa(xa)=xaμ(xa)logψa(xa)\sum_{\underline{x}}\mu(\underline{x})\log\psi_a(\underline{x}_{\partial a})\\ =\sum_{\underline{x}_{\partial a}}\sum_{\underline{x}_{i\notin\partial a}}\mu(\underline{x})\log\psi_a(\underline{x}_{\partial a})\\ =\sum_{\underline{x}_{\partial a}}\mu(\underline{x}_{\partial a})\log\psi_a(\underline{x}_{\partial a})

这里我们要算 μ(xa)\mu(\underline{x}_{\partial a}), 相当于在上一节中我们取 FR={a}F_R=\{a\}, 得到最终结果

U=a=1M1Zaxa(ψa(xa)logψa(xa)iaνia(xi))U=-\sum_{a=1}^M\frac{1}{Z_a}\sum_{\underline{x}_{\partial a}}\left(\psi_a(\underline{x}_{\partial a})\log\psi_a(\underline{x}_{\partial a})\prod_{i\in\partial a}\nu_{i\to a}(x_i)\right)

Entropy

First, a surprising Thm:

这样一来就可以算 H(x)=logμ(x)μH(\underline{x})=\langle\log\mu(\underline{x})\rangle_\mu 了.

有自由熵 Φ=HU\Phi=H-U.(第一眼量纲不对;在热统中,有定义 Helmholtz 自由能 F=UTSF=U-TS,实际上这个差了一个负号和一些因数:Φ=βF=SkBβU=HβU\Phi=-\beta F=\frac{S}{k_B}-\beta U=H-\beta U, 用 β=1\beta=1 简化).


参考资料