On Hidden Markov Processes with Infinite Excess Entropy
We investigate stationary hidden Markov processes for which mutual information between the past and the future is infinite. It is assumed that the number of observable states is finite and the number of hidden states is countably infinite. Under this assumption, we show that the block mutual information of a hidden Markov process is upper bounded by a power law determined by the tail index of the hidden state distribution. Moreover, we exhibit three examples of processes. The first example, considered previously, is nonergodic and the mutual information between the blocks is bounded by the logarithm of the block length. The second example is also nonergodic but the mutual information between the blocks obeys a power law. The third example obeys the power law and is ergodic.

60J10 94A17 37A25
1 Introduction
In recent years, there has been a surge of interdisciplinary interest in excess entropy,
which is the Shannon mutual information between the past and the future of a
stationary discretetime process. The initial motivation for this interest was a paper by Hilberg
[22], who supposed that certain processes with infinite excess entropy may be useful
for modeling texts in natural language. Subsequently, it was noticed that processes with
infinite excess entropy appear also in research of other, so called, complex systems
[1,5,6,13,14,19,23,25]. Also from a purely mathematical point of view, excess
entropy is an interesting measure of dependence for nominal valued random processes,
where the analysis of autocorrelation does not provide sufficient insight into process
memory.
Briefly reviewing earlier works, let us mention that excess entropy has been already
studied for several classes of processes. The most classical results concern Gaussian
processes, where Grenander and Szego [20, Sect. 5.5] gave an integral formula for
excess entropy (in disguise) and Finch [18] evaluated this formula for autoregressive
moving average (ARMA) processes. In the ARMA case excess entropy is finite. A
few more papers concern processes over a finite alphabet with infinite excess entropy.
For instance, Bradley [3] constructed the first example of a mixing process having this
property. Gramss [19] investigated a process which is formed by the frequencies of 0s
and 1s in the rabbit sequence. Travers and Crutchfield [26] researched some hidden
Markov processes with a countably infinite number of hidden states. Some attempts
were also made to generalize excess entropy to twodimensional random fields [4,17].
Excess entropy is an intuitive measure of memory stored in a stochastic process.
Although this quantity only measures the memory capacity, without characterizing
how the process future depends on the past, it can be given interesting general
interpretations. Mahoney, Ellison, and Crutchfield [15,24] developed a formula for excess
entropy in terms of predictive and retrodictive machines, which are minimal unifilar
hidden Markov representations of the process [23,25]. In our previous works [912],
we also investigated excess entropy of stationary processes that model texts in natural
language. We showed that a powerlaw growth of mutual information between
adjacent blocks of text arises when the text describes certain facts in a logically consistent
and highly repetitive way. Moreover, if the mutual information between blocks grows
according to a power law then a similar power law is obeyed by the number of distinct
words, identified formally as codewords in a certain text compression [7]. The latter
power law is known as Herdans law [21], which is an integral version of the famous
Zipf law observed for natural language [28].
In this paper, we will study several examples of stationary hidden Markov processes
over a finite alphabet for which excess entropy is infinite.The first study of such
processes was developed by Travers and Crutchfield [26]. A few more words about
the adopted setting are in need. First, excess entropy is finite for hidden Markov chains
with a finite number of hidden states. This is the usually studied case [16], for which
the name of finitestate sources is also used. To allow for hidden Markov processes
with unbounded mutual information, we need to assume that the number of hidden
states is at least countably infinite. Second, we want to restrict the class of studied
models. If we admitted an uncountable number of hidden states or a nonstationary
distribution over the hidden states then the class of hidden Markov processes would
cover all processes (over a countable alphabet). For that reason, we will assume that
the underlying Markov process is stationary and the number of hidden states is exactly
countably infinite. In contrast, the number of observable states is fixed as finite to focus
on nontrivial examples. In all these assumptions we follow [26].
The modest aim of this paper is to demonstrate that powerlaw growth of mutual
information between adjacent blocks may arise for very simple hidden Markov
processes. Presumably, stochastic processes which exhibit this power law appear in
modeling of natural language [10,22]. But the processes that we study here do not
have a clear linguistic interpretation. They are only mathematical instances presented
to show what is possible in theory. Although these processes are simple to define, we
perceive them as somehow artificial because of the way how the memory of the past
is stored in the present and revealed in the future. Understanding what are
acceptable mechanisms of memory in realistic stochastic models of complex systems is an
important challenge for future research.
The further organization of the paper is as follows: In Sect. 2, we present the results,
whereas the proofs are deferred to Sect. 3.
2 Results
Now we begin the formal presentation of our results. First, let (Yi )iZ be a stationary
Markov process on ( , J , P) where variables Yi : Y take values in a countably
infinite alphabet Y. This process is called the hidden process. Next, for a function
f : Y X, where the alphabet X = {0, 1, . . . , D 1} is finite, we construct process
(Xi )iZ with
Xi = f (Yi ).
Process (Xi )iZ will be called the observable process. The process is called unifilar
if Yi+1 = g(Yi , Xi+1) for a certain function g : Y X Y. Such a construction of
hidden Markov processes, historically the oldest one [2], is called stateemitting (or
Moore) in contrast to another construction named edgeemitting (or Mealy). The Mealy
construction, with a requirement of unifilarity, has been adopted in previous works [5,
23,26]. Here, we adopt the Moore construction and we drop the unifilarity assumption
since it leads to a simpler presentation of processes. It should be noted that the standard
definition of hidden Markov processes in statistics and signal processing is yet up to a
degree different, namely the observed process (Xi )iZ depends on the hidden process
(Yi )iZ via a probability distribution and Xi is conditionally independent of the other
observables given Yi . All the presented definitions are, however, equivalent and the
terminological discussion can be put aside.
In the following turn, we inspect the mutual information. Having entropy H (X ) =
E [ log P(X )] with log denoting the binary logarithm throughout this paper, mutual
information is defined as I (X ; Y ) = H (X ) + H (Y ) H (X, Y ). Here, we will be
interested in the block mutual information of the observable process
E (n) := I (X 0n+1; X1n ),
where X kl denotes the block (Xi )kil . More specifically, we are interested in
processes for which excess entropy E = limn E (n) is infinite and E (n) diverges
at a powerlaw rate. We want to show that such an effect is possible for very simple
hidden Markov processes. (Travers and Crutchfield [26] considered some examples of
nonergodic and ergodic hidden Markov processes with infinite excess entropy but they
did not investigate the rate of divergence of E (n).) Notice that by the data processing
inequality for the Markov process (Yi )iZ, we have
E (n) I (Y0n+1; Y1n) = I (Y0; Y1) H (Y0).
Thus, the block mutual information E (n) may diverge only if the entropy of the hidden
state is infinite. To achieve this effect, the hidden variable Y0 must necessarily assume
an infinite number of values.
Now, we introduce our class of examples. Let us assume that hidden states nk may
be grouped into levels
that comprise equiprobable values. Moreover, we suppose that the level indicator
Ni := n
Yi Tn
The interesting question becomes whether there exist hidden Markov processes that
achieve the upper bound established in Theorem 1. If so, can they be ergodic? The
answer to both questions is positive and we will exhibit some simple examples of such
processes.
The first example that we present is nonergodic and the mutual information diverges
slower than expected from Theorem 1.
For (1, 2], entropy H (Ni ) is infinite and so is H (Yi ) H (Ni ) since Ni is a
function of Yi . In the following, we work with this specific distribution of Yi .
As we will show, the rate of growth of the block mutual information E (n) is bounded
in terms of exponent from Eq. (6). Let us write f (n) = O(g(n)) if f (n) K g(n)
for a K > 0 and f (n) = ( g(n)) if K1g(n) f (n) K2g(n) for K1, K2 > 0.
Theorem 1 Assume that Y = {nk }1kr(n),n2, where function r (n) satisfies r (n) =
O(n p) for a p N. Moreover, assume that
E (n) =
Example 1 (Heavy Tailed Periodic Mixture I) This example has been introduced in
[26]. We assume Y = {nk }1kr(n),n2, where r (n) = n. Then we set the transition
probabilities
We can see that the transition graph associated with the process (Yi )iZ consists of
disjoint cycles on levels Tn. The stationary distribution of the Markov process is not
unique and the process is nonergodic if more than one cycle has a positive probability.
Here, we assume the cycle distribution (6) so the stationary marginal distribution of
Yi equals (7). Moreover, the observable process is set as
E (n) =
Xi =
In the above example, the level indicator Ni has infinite entropy and is measurable
with respect to the shift invariant algebra of the observable process (Xi )iZ. Hence,
E (n) tends to infinity by the ergodic decomposition of excess entropy [8, Theorem 5].
A more precise bound on the block mutual information is given below.
Proposition 1 For Example 1, we have
The next example is also nonergodic but the rate of mutual information reaches the
upper bound. It seems to happen so because the information about the hidden state
level is coded in the observable process in a more concise way.
Example 2 (Heavy Tailed Periodic Mixture II) We assume that Y = {nk }1kr(n),n2,
where r (n) = s(n) is the length of the binary expansion of number n. Then we set the
transition probabilities
Again, the transition graph associated with the process (Yi )iZ consists of disjoint
cycles on levels Tn. As previously, we assume the cycle distribution (6) and the
marginal distribution (7). Moreover, let b(n, k) be the kth digit of the binary expansion of
number n. (We have b(n, 1) = 1.) The observable process is set as
Xi =
In the third example, the rate of mutual information also reaches the upper bound
and the process is additionally ergodic. The process resembles the branching copy (BC)
process introduced in [26]. There are three main differences between the BC process
and our process. First, we discuss a simpler nonunifilar presentation of the process
rather than a more complicated unifilar one. Second, we add strings of separators
(s(m) + 1) 3 in the observable process. Third, we put slightly different transition
probabilities to obtain a simpler stationary distribution. All these changes lead to a
simpler computation of mutual information.
Example 3 (Heavy Tailed Mixing Copy) Let Y = {nk }1kr(n),n2 with r (n) =
3s(n) and s(n) being the length of the binary expansion of number n. Then we set the
transition probabilities
and D1 = n=2(r (n) n log n)1. This time levels Tn communicate through
transitions mr(m) n1 happening with probabilities p(n). The transition graph of the
process (Yi )iZ is strongly connected and there is a unique stationary distribution.
Hence the process is ergodic. It can be easily verified that the stationary distribution
is (7) so the levels are distributed according to (6). As previously, let b(n, k) be the
kth digit of the binary expansion of number n. The observable process is set as
Proposition 2 For Example 2, we have
E (n) =
Xi =
Yi = nk , s(n) + 1 k 2s(n) + 1,
Proposition 3 For Example 3, E (n) satisfies (14).
Resuming our results, we make this comment. The powerlaw growth of block
mutual information has been previously considered a hallmark of stochastic processes
that model complex behavior, such as texts in natural language [1,5,22]. However,
the constructed examples of hidden Markov processes feature quite simple transition
graphs. Consequently, one may doubt whether powerlaw growth of mutual
information is a sufficient reason to call a given stochastic process a model of complex
behavior, even when we restrict the class of processes to processes over a finite
alphabet. Based on our experience with other processes with rapidly growing block mutual
information [912], which are more motivated linguistically, we think that infinite
excess entropy is just one of the necessary conditions. Identifying other conditions for
stochastic models of complex systems is a matter of further interdisciplinary research.
We believe that these conditions depend on a particular system to be modeled.
3 Proofs
We begin with two simple bounds.
m=2
(ln2 2) log log n,
2ln2 log2 n 1 , (1, 2),
m=n
(ln2 2) log log n,
2ln2 log2 n 1 , (1, 2),
Hence the claims follow.
For an event B, let us introduce conditional entropy H (X B) and mutual
information I (X ; Y B), which are, respectively, the entropy of variable X and mutual
information between variables X and Y taken with respect to probability measure
P(B). The conditional entropy H (X Z ) and information I (X ; Y Z ) for a variable Z
are the averages of expressions H (X Z = z) and I (X ; Y Z = z) taken with weights
P(Z = z). That is the received knowledge. Now comes a handy fact that we will also
use. Let IB be the indicator function of event B. Observe that
I (X ; Y ) = I (X ; Y IB ) + I (X ; Y ; IB )
= P(B)I (X ; Y B) + P(Bc)I (X ; Y Bc) + I (X ; Y ; IB ),
where the triple information I (X ; Y ; IB ) satisfies I (X ; Y ; IB ) H (IB ) 1 by the
information diagram [27].
Proof of Theorem 1 Consider the event B = (No 2n), where N0 is the level
indicator of variable Y0. On the one hand, by Markovianity of (Yi )iZ, we have
I (X 0n+1; X1nB) I (Y0n+1; Y1nB)
I (Y0; Y1B) H (Y0B).
On the other hand, for Bc, the complement of B, we have
I (X 0n+1; X1nBc) H (X 0n+1Bc) n log X ,
where X, the cardinality of set X, is finite. Hence, using (20), we obtain
P(B)H (Y0B) + n P(Bc) log X + 1,
Using (18) yields further
P(B) =
m=2
m=2
P(B)H (Y0B) = P(B)
m=2
2n
m=2
+ P(B) log P(B)
On the other hand, by (19), we have
m=2n+1
n P(Bc) = n
Plugging both bounds into (21) yields the requested bound (8).
Now we prove Propositions 13. The proofs are very similar and consist in
constructing variables Dn that are both functions of X 0n+1 and functions of X1n. Given
this property, we obtain
= H (Dn) + I (X 0n+1; X1n Dn).
Hence, some lower bounds for the block mutual information E (n) follow from the
respective bounds for the entropies of Dn.
Proof of Proposition 1 Introduce random variable
Equivalently, we have
Dn =
Dn =
It can be seen that Dn is both a function of X 0n+1 and a function of X1n. On the
one hand, observe that if 2N0 n then we can identify N0 given X 0n+1 because the
full period is visible in X 0n+1, bounded by two delimiters 1. On the other hand, if
2N0 > n then given X 0n+1 we may conclude that the periods length N0 exceeds n/2,
regardless whether the whole period is visible or not. Hence variable Dn is a function
of X 0n+1. In a similar way, we show that Dn is a function of X1n. Given both facts,
we derive (22).
Next, we bound the terms appearing on the right hand side of (22). For a given N0,
variable X 0n+1 assumes at most N0 distinct values, which depend on N0. Hence
H (X 0n+1Dn = m) log m for 2 m n/2 .
On the other hand, if we know that N0 > n then the number of distinct values of variable
X 0n+1 equals n + 1. Consequently, if we know that Dn = 0, i.e., N0 n/2 + 1,
then the number of distinct values of X 0n+1 is bounded above by
m= n/2 +1
m = n + 1 +
In this way, we obtain
H (X 0n+1Dn = 0) log(25n2/8).
Hence, by (18) and (19), the conditional mutual information may be bounded
I (X 0n+1; X1nDn)
P(Dn = m)H (X 0n+1Dn = m)
m=2
m=2
m= n/2 +1
P(Dn = 0) log P(Dn = 0)
The entropy of Dn may be bounded similarly,
H (Dn) =
m=2
Hence, because E (n) satisfies (22), we obtain (11).
Proof of Proposition 2 Introduce random variable
Equivalently, we have
Dn =
Dn =
As in the previous proof, the newly constructed variable Dn is both a function of
X 0n+1 and a function of X1n. If 2s(N0) n then we can identify N0 given X 0n+1
because the full period is visible in X 0n+1, bounded by two delimiters 2. If 2s(N0) > n
then given X 0n+1 we may conclude that the periods length s(N0) exceeds n/2,
regardless whether the whole period is visible or not. Hence variable Dn is a function
of X 0n+1. In a similar way, we demonstrate that Dn is a function of X1n. By these two
facts, we infer (22).
Observe that the largest m such that s(m) = log m +1 n/2 is m = 2 n/2 1.
Using (18), the entropy of Dn may be bounded as
H (Dn) =
2 n/2 1
m=2
Thus (14) follows by (22) and Theorem 1.
Proof of Proposition 3 Introduce random variable
P(Dn = 0) log P(Dn = 0)
Dn =
Equivalently, we have
Dn =
The way of computing Dn given X 0n+1 is as follows. If
0
Xn+1 = (. . . , 2, b(m, 2), b(m, 3), . . . , b(m, s(m)), 3, . . . , 3)
for some m such that 2s(m) n and 1 l s(m) then we return Dn = m. Otherwise
we return Dn = 0. The recipe for Dn given X1n is mirrorlike. If
X n
1 = (3, . . . , 3, b(m, 2), b(m, 3), . . . , b(m, s(m)), 2, . . . )
for some m such that 2s(m) n and 1 l s(m) then we return Dn = m. Otherwise
we return Dn = 0. In view of these observations we derive (22), as in the previous
two proofs.
Now, for m = 0 and s(m) n/2, the distribution of Dn is
Notice that the largest m such that s(m) = log m
Hence, by (18), the bound for the entropy of Dn is
+ 1
n/2 is m = 2 n/2 1.
H ( Dn ) =
2 n/2 1
m=2
P ( Dn = 0) log P ( Dn = 0)
Consequently, (14) follows by (22) and Theorem 1.
Acknowledgments I thank Nick Travers, Jan Mielniczuk, and an anonymous referee for comments and
remarks.
Open Access This article is distributed under the terms of the Creative Commons Attribution License
which permits any use, distribution, and reproduction in any medium, provided the original author(s) and
the source are credited.