Structure Learning: Likelihood Scores

Using Likelihood scores to find best structure

Likelihood Structure Scores lecture introduces the likelihood scores. The overall concept is quite clear: we can evaluate, how good is a random graph, if we use its structure to estimate some data (using MLE, of course).

What I find exciting here is the following:

  • one can see a certain strong connection between Machine Learning and Information Theory.
  • if two graphs have a very similar score, can we run inference in a sparser graph instead of inference in other more complex graphs?
  • inference optimization can be replaced by another type of optimization, e.g. with a focus to memory

Decomposible score theorem

Theorem says, that:

\[\operatorname{score}_{L}(\mathcal{G}: \mathcal{D})=M \sum_{i=1}^{n} \mathbf{I}_{\hat{P}}\left(X_{i} ; \mathbf{P a}_{X_{i}}^{G}\right)-M \sum_{i=1}^{n} \boldsymbol{H}_{\hat{P}}\left(X_{i}\right)\]

Property

Let’s assume, that \(G_0\) is a graph, where all variables are independent. The likelihood score can be written as:

\[\operatorname{score}_{L}(\mathcal{G_0}: \mathcal{D})= - M \sum_{i=1}^{n} \boldsymbol{H}_{\hat{P}}\left(X_{i}\right)\]

Let’s connect a pair of nodes (e.g \(X_i\) -> \(X_j\)). Denote new graph as \(G_1\). The likelihood score will be:

\[\operatorname{score}_{L}(\mathcal{G_1}: \mathcal{D})=M \mathbf{I}_{\hat{P}}\left(X_{i};X_{j}\right)-M \sum_{i=1}^{n} \boldsymbol{H}_{\hat{P}}\left(X_{i}\right)\]

If can be seen, that one connection adds contribution that is equivalent to exactly \(M\mathbf{I}_{\hat{P}}\left(X_{i};X_{j}\right)\).

If \(X_i \perp\!\!\!\perp X_j\), then \(\mathbf{I}_{\hat{P}}\left(X_{i};X_{j}\right) = 0\) and we get the same likelihood score as for graph \(G_0\).

Does direction plays role?

The direction doesn’t play any role. But if the parents list was affected(!), the likelihood score changes

Interesting facts

\(\mathbf{I}_{\hat{P}}\left(X_{i}\right) = 0\) and \(\mathbf{I}_{\hat{P}}\left(X_{i};\emptyset\right) = 0\)

Last update:16 August 2020