Exercise 3.2: Chain Rule for Kullback-Leibler Divergence
It will be sufficient to prove the case \(n = 2\) as the general \(n \in \N\) follows by induction.
Denote the spaces in which \(X_1\) and \(X_2\) take values by \(\mathcal{X}_1\) and \(\mathcal{X}_2\) respectively. Let us assume \(\mathcal{X}_1\) and \(\mathcal{X}_2\) are non-empty Polish spaces, and \((X_1 , X_2)\) are jointly measurable w.r.t. to the Borel product \(\sigma\)-algebra \(\mathcal{B}_{X_1, X_2}\) on \(\mathcal{X}_1 \times \mathcal{X}_2\) (see the notes below for further discussion). Then for each \(x_1 \in \mathcal{X}_1\), there exists a set of finite measures \(\{ \mathbb{Q}_{x_1} \}_{x_1 \in \mathcal{X}_1}\) s.t. \begin{align}\label{eq:iterated_integral} \int f \, \sd \mathbb{Q} = \int \left[ \int f(x_1, x_2) \, \sd \mathbb{Q}_{x_1} (x_2) \right] \, \sd \mathbb{Q}_{1} (x_1) \, , \end{align} whenever \(f \in L^1 (\mathbb{Q})\), and analogously for \(\{ \mathbb{P}_{x_1} \}_{x_1 \in \mathcal{X}_1}\) (see, e.g., theorems 10.2.1 and 10.2.2 by Dudley (2002)).
By the Radon-Nikodym theorem (e.g., theorem 5.5.4 by Dudley (2002)), \(\tfrac{\sd \mathbb{Q}}{\sd \mathbb{P}}\) exists if \(\mathbb{Q}\) is absolutely continuous w.r.t. \(\mathbb{P}\) (denoted \(\mathbb{Q} \ll \mathbb{P}\)). We treat the two cases when \(\tfrac{\sd \mathbb{Q}}{\sd \mathbb{P}}\) does and does not exist separately.
(i): \(\mathbb{Q}\) is not absolutely continuous w.r.t. \(\mathbb{P}\)
In this case the case, \(\KL (\mathbb{Q} \, \| \, \mathbb{P}) = \infty\) by definition, and we thus have to show \begin{align} \E_{\mathbb{Q}_{1}} [ \KL (\mathbb{Q}_{X_1} \, \| \, \mathbb{P}_{X_1}) ] + \KL (\mathbb{Q}_{1} \, \| \, \mathbb{P}_{1} ) \, , \end{align} is also infinite. Take a set \(E \in \mathcal{B}_{X_1, X_2}\) such that \(\mathbb{P} (E) = 0\) but \(\mathbb{Q} (E) > 0\), and assume both \(\mathbb{Q}_{1} \ll \mathbb{P}_1\) and \(\mathbb{Q}_{X_1} \ll \mathbb{P}_{X_1}\) (\(\mathbb{Q}_1\)-a.s. in the latter case). Then by the iterated integration identity from Equation \eqref{eq:iterated_integral} \begin{align} \mathbb{Q}(E) &= \int \! \! \int \ind_E (x_1 , x_2) \, \sd \mathbb{Q}_{x_1} (x_2) \, \sd \mathbb{Q}_{1} (x_1) \newline &= \int \! \! \int \ind_E (x_1 , x_2) \tfrac{\sd \mathbb{Q}_{x_1}}{\sd \mathbb{P}_{x_1}} (x_2) \tfrac{\sd \mathbb{Q}_1}{\sd \mathbb{P}_1} (x_1) \, \sd \mathbb{P}_{x_1}(x_2) \, \sd \mathbb{P}_{1}(x_1) \newline &= \int \ind_E (x_1 , x_2) \tfrac{\sd \mathbb{Q}_{x_1}}{\sd \mathbb{P}_{x_1}} (x_2) \tfrac{\sd \mathbb{Q}_1}{\sd \mathbb{P}_1} (x_1) \, \sd \mathbb{P} (x_1 , x_2) \, , \end{align} where the last equality holds again by the iterated integration identity combined with the boundedness of \(\ind_E\) and integrability of the Radon-Nikodym derivatives (which we assumed exist, and are integrable by definition). Using the integrability again and recalling \(\mathbb{P} (E) = 0\), the last integral is equal to zero which contradicts \(\mathbb{Q} (E) > 0\), implying the desired \begin{align} \E_{\mathbb{Q}_{1}} [ \KL (\mathbb{Q}_{X_1} \, \| \, \mathbb{P}_{X_1}) ] + \KL (\mathbb{Q}_{1} \, \| \, \mathbb{P}_{1} ) = \infty \, . \end{align}
(ii): \(\mathbb{Q}\) is absolutely continuous w.r.t. \(\mathbb{P}\)
Using the rectangles \(A \times \mathcal{X}_2\), \(A \subseteq \mathcal{X}_1\) measurable, it is clear \(\mathbb{Q}_1 \ll \mathbb{P}_1\). To show \(\mathbb{Q}_{X_1} \ll \mathbb{P}_{X_1}\) holds \(\mathbb{Q}_1\)-a.s., we can take a measurable rectangle \(A \times B \subset \mathcal{X}_1 \times \mathcal{X}_2\) and use the definition of \(\{ \mathbb{Q}_{x_1} \}_{x_1 \in \mathcal{X}_1}\) \begin{align} \mathbb{Q} (A \times B) &\overset{\text{(i)}}{=} \int_{A} \int_{B} \tfrac{\sd \mathbb{Q}}{\sd \mathbb{P}}(x_1 , x_2) \biggl( \ind_{\tfrac{\sd \mathbb{Q}_1}{\sd \mathbb{P}_1} > 0} (x_1) \tfrac{ \tfrac{\sd \mathbb{Q}_1}{\sd \mathbb{P}_1} (x_1) }{ \tfrac{\sd \mathbb{Q}_1}{\sd \mathbb{P}_1} (x_1) } + \ind_{\tfrac{\sd \mathbb{Q}_1}{\sd \mathbb{P}_1} = 0} (x_1) \biggr) \, \sd \mathbb{P}_{x_1}(x_2) \, \sd \mathbb{P}_{1} (x_1) \newline &\overset{\text{(ii)}}{=} \int_{A} \biggl[ \int_{B} \ind_{\tfrac{\sd \mathbb{Q}_1}{\sd \mathbb{P}_1} > 0} (x_1) \tfrac{ \tfrac{\sd \mathbb{Q}}{\sd \mathbb{P}} (x_1, x_2) }{ \tfrac{\sd \mathbb{Q}_1}{\sd \mathbb{P}_1} (x_1) } \, \sd \mathbb{P}_{x_1} (x_2) \biggr] \, \sd \mathbb{Q}_1(x_1) + \int_{A} \int_{B} \ind_{\tfrac{\sd \mathbb{Q}_1}{\sd \mathbb{P}_1} = 0} \tfrac{\sd \mathbb{Q}_1}{\sd \mathbb{P}_1} \, \sd \mathbb{Q}_{x_1} (x_2) \, \sd \mathbb{P}_{1} (x_1) \, , \end{align} where \(\tfrac{0}{0}\) should be interpreted as zero, (i) uses the iterated integration identity from Equation \eqref{eq:iterated_integral} combined with \(\mathbb{Q} \ll \mathbb{P}\), and (ii) follows by the the fact \(\bigl\{ \tfrac{\sd \mathbb{Q}_1}{\sd \mathbb{P}_1} = 0 \bigr\}\) is measurable and the Radon-Nikodym derivatives are integrable. Noticing the second integral in (ii) above vanishes for any \(A \times B\), and that \(\mathbb{Q}(A \times B) = \int_A \int_B \sd\mathbb{Q}_{x_1} \sd\mathbb{Q}_1\), we can select a version of \(\tfrac{\sd \mathbb{Q}_1}{\sd \mathbb{P}_1}\) (assuming the axiom of choice), and define \begin{align} \tfrac{\sd \mathbb{Q}_{x_1}}{\sd \mathbb{P}_{x_1}} (x_2) = \begin{cases} \bigl[ \tfrac{\sd \mathbb{Q}_{1}}{\sd \mathbb{P}_{1}}(x_1) \bigr]^{-1} \tfrac{\sd \mathbb{Q}}{\sd \mathbb{P}}(x_1 , x_2) & \text{if } \tfrac{\sd \mathbb{Q}_{1}}{\sd \mathbb{P}_{1}}(x_1) > 0 \, , \newline 1 & \text{otherwise,} \end{cases} \end{align} which by the above will \(\mathbb{Q}_1\)-a.s. correspond to a Radon-Nikodym derivative of \(\mathbb{Q}_{X_1}\) w.r.t. \(\mathbb{P}_{X_1}\). Since the measurable rectangles \(A \times B\) form a generating algebra for the product \(\sigma\)-algebra, they are measure determining (e.g., theorem 3.2.7 in [1]), implying \begin{align} \tfrac{\sd \mathbb{Q}}{\sd \mathbb{P}} = \tfrac{\sd \mathbb{Q}_{X_1}}{\sd \mathbb{P}_{X_1}} \tfrac{\sd \mathbb{Q}_{1}}{\sd \mathbb{P}_{1}} \, , \qquad \mathbb{Q}\text{-a.s.} \end{align}
If \(\log \tfrac{\sd \mathbb{Q}}{\sd \mathbb{P}} \in L^{1} (\mathbb{Q})\), and either \(\KL (\mathbb{Q}_1 \, \| \, \mathbb{P}_1) < \infty\) or \(\E_{\mathbb{Q}_1} [ \KL(\mathbb{Q}_{X_1} \, \| \, \mathbb{P}_{X_1} ) ] < \infty\) \begin{align} \KL (\mathbb{Q} \, \| \, \mathbb{P}) &= \int \log \left( \tfrac{\sd \mathbb{Q}_{X_1}}{\sd \mathbb{P}_{X_1}} \tfrac{\sd \mathbb{Q}_1}{\sd \mathbb{P}_1} \right) \, \sd \mathbb{Q} \newline &\overset{\text{(i)}}{=} \int \left[ \int \log \tfrac{\sd \mathbb{Q}_{X_1}}{\sd \mathbb{P}_{X_1}} + \log \tfrac{\sd \mathbb{Q}_1}{\sd \mathbb{P}_1} \, \sd \mathbb{Q}_{2 \given 1} \right] \, \sd \mathbb{Q}_1 \newline &\overset{\text{(ii)}}{=} \E_{\mathbb{Q}_1} [ \KL (\mathbb{Q}_{X_1} \, \| \, \mathbb{P}_{X_1}) ] + \KL (\mathbb{Q}_1 \, \| \, \mathbb{P}_1) \, , \end{align} where (i) follows by the iterated integration identity from Equation \eqref{eq:iterated_integral}, and (ii) by the assumed integrability conditions combined with the definition of KL-divergence.
Notes
-
Throughout, we are implicitly using that joint measurability of \((X_1, X_2)\) w.r.t. the product \(\sigma\)-algebra implies marginal measurability of \(X_1\) and \(X_2\) on the Borel \(\sigma\)-algebras associated with the (by assumption Polish) spaces \(\mathcal{X}_1\) and \(\mathcal{X}_2\) respectively (just use the measurable rectangles \(A \times \mathcal{X}_1\) and \(\mathcal{X}_2 \times B\)).
-
Since we made the assumption \(\mathcal{X}_1\) and \(\mathcal{X}_2\) are both Polish, they are both automatically second-countable, i.e., there exists a countable base for their topology. Borel \(\sigma\)-algebras on second-countable spaces are thus generated by a countable number of sets, a useful fact when constructing regular conditional probabilities. This was implicitly used when invoking theorem 10.2.2 by Dudley (2002) which exploits this fact in its proof. It is a question whether the Polish space assumption we make can be relaxed, but existence of a countable measure determining class of sets generating the \(\sigma\)-algebra at least on \(\mathcal{X}_2\) is likely to be necessary.
-
Second-countable spaces have the nice property that the \(\sigma\)-algebra generated by the product topology agrees with the product \(\sigma\)-algebra generated by the Borel \(\sigma\)-algebras on the constituent spaces (see, e.g., proposition 4.1.7 by Dudley (2002)), which allowed us to ignore the distinction throughout.