Conditional relative information and its axiomatizations#

In this post, we will study the conditional form of relative information. We will also look at how conditional relative information can be axiomatized and extended to non-real-valued measures.

This post is a continuation from our series on spiking networks, path integrals and motivic information.

What is conditional relative information?#

Suppose we have two random variables $X, Y$ and probability measures $P, Q$ on $Ω$ . We are interested in how far the model conditional $P_{Y | X}$ is to the true conditional $Q_{Y | X}$ on average over $Q_{X}$ , and we want to ignore the model marginal $P_{X}$ .

Let $P_{X Y}, Q_{X Y}$ be the induced joint distributions for $X, Y$ . We first construct a distribution $R_{X Y}$ which has the same conditional distribution $R_{Y | X} = P_{Y | X}$ as the model but has a marginal $R_{X} = Q_{X}$ equal to that of the true distribution [Gra11]. Namely,

R_{X Y} (F \times G) = \int_{F} P_{Y | X} (G | x) d Q_{X} (x)

where $F$ and $G$ are measurable sets over the state spaces of $X$ and $Y$ respectively.

We then define the conditional relative information to be

I_{Q ‖ P} (Y | X) = I_{Q_{X Y} ‖ R_{X Y}} .

In the case where the corresponding densities are well-defined, we have

\begin{array}{r} \begin{aligned} I_{Q ‖ P} (Y | X) & = \int \int q (y | x) \log \frac{q (y | x)}{p (y | x)} d y q (x) d x \\ = \int \int q (y, x) \log \frac{q (y | x)}{p (y | x)} d y d x \end{aligned} . \end{array}

which is the relative information to $q (y | x)$ from $p (y | x)$ averaged over $q (x)$ .

Whats is the chain rule for conditional relative information?#

In statistics and machine learning, we often think of $Q$ as a true distribution that we are trying to uncover and $P$ as a model distribution for approximating $Q$ . The relative information $I_{Q ‖ P}$ measures how far the model is to the truth.

To uncover the truth, it makes strategic sense to study different facets $X, Y$ of reality, and to build up reality one facet at a time. For example, we may want to know how far our model is to reality in modeling $X$ and focus on modeling $X$ , before moving on to what our model says about both $X, Y$ . The chain rule of conditional relative information says that the divergence of our model to reality for $X, Y$ is simply the sum of the divergences for $X$ and for $Y | X$ :

\begin{matrix} (CR) & I_{Q ‖ P} (Y, X) = I_{Q ‖ P} (Y | X) + I_{Q ‖ P} (X) . \end{matrix}

Therefore, to get a good model of $X, Y$ , we could attempt to minimize the divergences for $X$ and for $Y | X$ in parallel.

How do we derive conditional entropy from conditional relative information?#

Just as the entropy of a random variable $X$ with distribution $P_{X}$ can be defined as the relative information to the dependent distribution $P_{X X}$ from the independent distribution $P_{X} \times P_{X},$ we will do the same for conditional entropy.

Given random variables $X, Y$ with joint distribution $P_{X Y}$ , we define the conditional entropy of $Y | X$ as

H (Y | X) = I_{P_{X Y, X Y} ‖ P_{X Y} \times P_{X Y}} (Y | X)

the conditional relative information of $Y | X$ to the dependent distribution $P_{X Y} \times P_{X Y}$ from the independent distribution $P_{X Y, X Y} .$

According to the chain rule of conditional relative information,

I_{P_{X Y, X Y} ‖ P_{X Y} \times P_{X Y}} (Y, X) = I_{P_{X Y, X Y} ‖ P_{X Y} \times P_{X Y}} (Y | X) + I_{P_{X Y, X Y} ‖ P_{X Y} \times P_{X Y}} (X) .

By the definition of entropy in our introduction, the first and third terms are the entropies $H (Y, X)$ and $H (X)$ respectively, while the second term is the conditional entropy $H (Y | X)$ . Thus, we recover the classical chain rule for conditional entropy

H (Y, X) = H (Y | X) + H (X) .

Is there an axiomatization of conditional entropy?#

As described in our previous post, we allow the total measures of $P, Q$ to be different from one, but we require their total measures to be the same.

We start with an axiomatizations of conditional entropy with the hope of deriving axiomatizations of conditional relative information. I like the following categorical view of conditional entropy [BFL11]. I’ve taken the liberty of rewriting it in our notations.

Given a measured space $(Ω, B, P)$ , a finite measurable function $Y : Ω \to S_{Y}$ and a morphism $f : S_{Y} \to S_{X}$ between finite sets, let $X = f \circ Y$ and let $P_{Y}, P_{X}$ be the induced measures on $S_{Y}, S_{X}$ .

In this case, the conditional entropy of $Y | X$ is

\begin{array}{r} \begin{aligned} H (Y | X) & = H (Y, X) - H (X) \\ = H (Y) - H (X) \\ = - T \sum_{y} {\bar{P}}_{Y} (y) \log {\bar{P}}_{Y} (y) + T \sum_{x} {\bar{P}}_{X} (x) \log {\bar{P}}_{X} (x) \\ = - \sum_{y} P_{Y} (y) \log P_{Y} (y) + \sum_{x} P_{X} (x) \log P_{X} (x) \end{aligned} \end{array}

where $T$ is the total measure of $P$ , and ${\bar{P}}_{X} = P_{X} / T$ and ${\bar{P}}_{Y} = P_{Y} / T$ are probability measures.

Given two measured spaces $(Ω_{1}, B_{1}, P_{1})$ and $(Ω_{2}, B_{2}, P_{2})$ , let $(Ω_{1} ⊔ Ω_{2}, B_{1} \oplus B_{2}, P_{1} \oplus P_{2})$ be their direct sum. Here, $Ω_{1} ⊔ Ω_{2}$ is the disjoint union, and $S \in B_{1} \oplus B_{2}$ if and only if $S \cap Ω_{1} \in B_{1}$ and $S \cap Ω_{2} \in B_{2}$ . The measure of $S$ is the sum of that of $S \cap Ω_{1}$ and of $S \cap Ω_{2}$ .

Let $F_{P}$ be a family of maps indexed by measures $P$ such that each $F_{P}$ sends morphisms $f : S_{X} \to S_{Y}$ between finite sets to $[0, \infty)$ . Suppose that the family $F_{P}$ satisfies the following three axioms.

Functoriality. $F_{P} (f \circ g) = F_{P} (f) + F_{P} (g)$
Homogeneity. $F_{λ P} (f) = λ F_{P} (f)$
Additivity. $F_{P_{1} \oplus P_{2}} (f_{1} \oplus f_{2}) = F_{P_{1}} (f_{1}) + F_{P_{2}} (f_{2})$
Continuity. $F$ is continuous

Then, $F_{P} (f)$ must be $c H (Y | X)$ for some constant $c \geq 0$ .

Given a classical conditional entropy $H (Y | X)$ , we can now write this as $F_{P} (f)$ where $f$ is the projection $(Y, X) \mapsto X$ .

The nice thing about the above categorical axiomatization of conditional entropy is that it fits into the view where the objects of study are spaces $E, B$ and fibrations $π : E \to B$ equipped with measures. The conditional entropy is the sum of the entropies of the fibers $π^{- 1} (b)$ weighted by $P_{B} (b)$ .

Is there an axiomatization of conditional relative information?#

We prefer to work with conditional relative information rather than conditional entropy. Its axiomatization should tell us how it behaves with respect to products and coproducts of the measures being compared.

Our axioms. Note the addition of the product rule. I’m not sure if the product axiom can be derived from the others when the state spaces are not finite. Perhaps it will follow from continuity and the fact that the limit of coproducts is the product of limits.

Functoriality. $G_{Q ‖ P} (f \circ g) = G_{Q ‖ P} (f) + G_{Q ‖ P} (g)$
Homogeneity. $G_{λ Q ‖ λ P} (f) = λ G_{Q ‖ P} (f)$
Coproduct. $G_{Q_{1} \oplus Q_{2} ‖ P_{1} \oplus P_{2}} (f_{1} \oplus f_{2}) = G_{Q_{1} ‖ P_{1}} (f_{1}) + G_{Q_{2} ‖ P_{2}} (f_{2})$
Product. $G_{Q_{1} \times Q_{2} ‖ P_{1} \times P_{2}} (f_{1}, f_{2}) = T_{2} G_{Q_{1} ‖ P_{1}} (f_{1}) + T_{1} G_{Q_{2} ‖ P_{2}} (f_{2})$
Continuity. $G$ is continuous.

Here, $T_{1}$ and $T_{2}$ are the total measures of $Q_{1}$ and $Q_{2}$ respectively.

The axioms for conditional entropy follow immediately from these axioms for conditional relative information, because we can write

F_{P_{X}} (f) = G_{T P_{X X} ‖ P_{X} \times P_{X}} (f)

where $T$ is the total measure of $P_{X}$ .

References#

[BFL11]

John C Baez, Tobias Fritz, and Tom Leinster. A characterization of entropy in terms of information loss. Entropy, 13(11):1945–1957, 2011.

[Gra11]

Robert M Gray. Entropy and information theory. Springer Science & Business Media, 2011.

Building foundations of information theory on relative information Zeta functions, Mellin transforms and the Gelfand-Leray form

18 September 2020

Recent Posts

Archives