Conditional relative information and its axiomatizations#

In this post, we will study the conditional form of relative information. We will also look at how conditional relative information can be axiomatized and extended to non-real-valued measures.

This post is a continuation from our series on spiking networks, path integrals and motivic information.

What is conditional relative information?#

Suppose we have two random variables X,Y and probability measures P,Q on Ω. We are interested in how far the model conditional PY|X is to the true conditional QY|X on average over QX, and we want to ignore the model marginal PX.

Let PXY,QXY be the induced joint distributions for X,Y. We first construct a distribution RXY which has the same conditional distribution RY|X=PY|X as the model but has a marginal RX=QX equal to that of the true distribution [Gra11]. Namely,

RXY(F×G)=FPY|X(G|x)dQX(x)

where F and G are measurable sets over the state spaces of X and Y respectively.

We then define the conditional relative information to be

IQP(Y|X)=IQXYRXY.

In the case where the corresponding densities are well-defined, we have

IQP(Y|X)=q(y|x)logq(y|x)p(y|x)dyq(x)dx=q(y,x)logq(y|x)p(y|x)dydx.

which is the relative information to q(y|x) from p(y|x) averaged over q(x).

Whats is the chain rule for conditional relative information?#

In statistics and machine learning, we often think of Q as a true distribution that we are trying to uncover and P as a model distribution for approximating Q. The relative information IQP measures how far the model is to the truth.

To uncover the truth, it makes strategic sense to study different facets X,Y of reality, and to build up reality one facet at a time. For example, we may want to know how far our model is to reality in modeling X and focus on modeling X, before moving on to what our model says about both X,Y. The chain rule of conditional relative information says that the divergence of our model to reality for X,Y is simply the sum of the divergences for X and for Y|X:

(CR)IQP(Y,X)=IQP(Y|X)+IQP(X).

Therefore, to get a good model of X,Y, we could attempt to minimize the divergences for X and for Y|X in parallel.

How do we derive conditional entropy from conditional relative information?#

Just as the entropy of a random variable X with distribution PX can be defined as the relative information to the dependent distribution PXX from the independent distribution PX×PX, we will do the same for conditional entropy.

Given random variables X,Y with joint distribution PXY, we define the conditional entropy of Y|X as

H(Y|X)=IPXY,XYPXY×PXY(Y|X)

the conditional relative information of Y|X to the dependent distribution PXY×PXY from the independent distribution PXY,XY.

According to the chain rule of conditional relative information,

IPXY,XYPXY×PXY(Y,X)=IPXY,XYPXY×PXY(Y|X)+IPXY,XYPXY×PXY(X).

By the definition of entropy in our introduction, the first and third terms are the entropies H(Y,X) and H(X) respectively, while the second term is the conditional entropy H(Y|X). Thus, we recover the classical chain rule for conditional entropy

H(Y,X)=H(Y|X)+H(X).

Is there an axiomatization of conditional entropy?#

As described in our previous post, we allow the total measures of P,Q to be different from one, but we require their total measures to be the same.

We start with an axiomatizations of conditional entropy with the hope of deriving axiomatizations of conditional relative information. I like the following categorical view of conditional entropy [BFL11]. I’ve taken the liberty of rewriting it in our notations.

Given a measured space (Ω,B,P), a finite measurable function Y:ΩSY and a morphism f:SYSX between finite sets, let X=fY and let PY,PX be the induced measures on SY,SX.

In this case, the conditional entropy of Y|X is

H(Y|X)=H(Y,X)H(X)=H(Y)H(X)=TyP¯Y(y)logP¯Y(y)+TxP¯X(x)logP¯X(x)=yPY(y)logPY(y)+xPX(x)logPX(x)

where T is the total measure of P, and P¯X=PX/T and P¯Y=PY/T are probability measures.

Given two measured spaces (Ω1,B1,P1) and (Ω2,B2,P2), let (Ω1Ω2,B1B2,P1P2) be their direct sum. Here, Ω1Ω2 is the disjoint union, and SB1B2 if and only if SΩ1B1 and SΩ2B2. The measure of S is the sum of that of SΩ1 and of SΩ2.

Let FP be a family of maps indexed by measures P such that each FP sends morphisms f:SXSY between finite sets to [0,). Suppose that the family FP satisfies the following three axioms.

  1. Functoriality. FP(fg)=FP(f)+FP(g)

  2. Homogeneity. FλP(f)=λFP(f)

  3. Additivity. FP1P2(f1f2)=FP1(f1)+FP2(f2)

  4. Continuity. F is continuous

Then, FP(f) must be cH(Y|X) for some constant c0.

Given a classical conditional entropy H(Y|X), we can now write this as FP(f) where f is the projection (Y,X)X.

The nice thing about the above categorical axiomatization of conditional entropy is that it fits into the view where the objects of study are spaces E,B and fibrations π:EB equipped with measures. The conditional entropy is the sum of the entropies of the fibers π1(b) weighted by PB(b).

Is there an axiomatization of conditional relative information?#

We prefer to work with conditional relative information rather than conditional entropy. Its axiomatization should tell us how it behaves with respect to products and coproducts of the measures being compared.

Our axioms. Note the addition of the product rule. I’m not sure if the product axiom can be derived from the others when the state spaces are not finite. Perhaps it will follow from continuity and the fact that the limit of coproducts is the product of limits.

  1. Functoriality. GQP(fg)=GQP(f)+GQP(g)

  2. Homogeneity. GλQλP(f)=λGQP(f)

  3. Coproduct. GQ1Q2P1P2(f1f2)=GQ1P1(f1)+GQ2P2(f2)

  4. Product. GQ1×Q2P1×P2(f1,f2)=T2GQ1P1(f1)+T1GQ2P2(f2)

  5. Continuity. G is continuous.

Here, T1 and T2 are the total measures of Q1 and Q2 respectively.

The axioms for conditional entropy follow immediately from these axioms for conditional relative information, because we can write

FPX(f)=GTPXXPX×PX(f)

where T is the total measure of PX.

References#

[BFL11]

John C Baez, Tobias Fritz, and Tom Leinster. A characterization of entropy in terms of information loss. Entropy, 13(11):1945–1957, 2011.

[Gra11]

Robert M Gray. Entropy and information theory. Springer Science & Business Media, 2011.