Optimal Behavior is Easier to Learn than the Truth
Minds & Machines
Optimal Behavior is Easier to Learn than the Truth
Ronald Ortner 0
0 Department Mathematik und Informationstechnologie, Montanuniversita ̈t Leoben , Franz-Josef-Straße 18, 8700 Leoben , Austria
We consider a reinforcement learning setting where the learner is given a set of possible models containing the true model. While there are algorithms that are able to successfully learn optimal behavior in this setting, they do so without trying to identify the underlying true model. Indeed, we show that there are cases in which the attempt to find the true model is doomed to failure. In reinforcement learning problems, an agent acts in an unknown environment that allows the agent to take actions that are followed by a response of the environment. The paradigm for representing such reinforcement learning problems are Markov decision processes, where starting in some initial state s1, the agent at time steps t ¼ 1; 2; . . . chooses an action at from a set of actions A, obtains a random reward depending on the current state st and the chosen action at, and then moves to state stþ1 according to transition probabilities that also depend on the state-action pair ðst; atÞ. Formally, a Markov decision process is defined as follows.
Markov decision processes
-
Definition 1 A Markov decision process (MDP) M consists of a set of states S with
some distinguished initial state s1, a set of actions A, reward distributions with mean
r(s, a) for the reward when choosing action a in state s, and transition probabilities
pðs0js; aÞ for the probability of moving to state s0 when choosing action a in state s.
However, in many practical reinforcement learning problems (e.g. applications in
robotics) the underlying state space can either be huge or even unknown. Thus, a
chess playing robot may be confronted with the same board position on two
different occasions, but the respective video signals of the board may be different.
Thus, it makes sense to distinguish between observations and states. In more
complex scenarios as we will consider here, the agent only has direct access to
observations (like the video signal of the robot) but has no information about the
underlying state (the respective board position). In the motivating example, states
could be considered as sets of observations (corresponding to the same state), or
equivalently, mappings from the set of observations O to a state space S. Without
prior knowledge the agent has to consider various such models that map
observations to states. In our example, the true model would map all images
showing the same board position to the same state, but of course when learning from
scratch it is not clear that this is the correct model. Rather, it seems that the learner
has to learn the true model or a good approximation of it as well. Actually, we will
consider more general models that need not aggregate observations but more
generally histories, that is, the sequences of observations, rewards, and chosen
actions. This notion of models has been introduced by Hutter (2009).
Definition 1 In a reinforcement learning problem, the history ht after t time steps
is the sequence ht :¼ o1; a1; r1; o2; a2; r2; . . .; at; rt; otþ1 of observations os 2 O,
collected rewards rs 2 R, and chosen actions as 2 A at time steps s ¼ 1; . . .; t.
That is, similar to our motivating example, a model assigns a respective state to
each situation in which the agent can find himself in (i.e., a history). In the example of
the chess playing robot, the observation of the board is actually not always sufficient
to decide whose move it is, and one has to take into account also the recent history of
observations to determine the correct state of the game. Note that the notion of model
in Definition 2 is still rather modest, as we do not demand that a model also specifies
the precise values of the mean rewards and transition probabilities of all state-action
pairs, which would make things obviously much harder.
Now, we assume that there is a true model utrue that maps histories to states with
respect to which the environment behaves like a Markov decision process.1
However, this model is unknown to the learner. Rather, the learner has a set of
possible models U, each mapping histories to states, at her disposal which we
assume to contain the true model utrue.
In this paper we are interested in the following questions: Is it possible to identify
the true model? Can an agent learn to behave optimally in the underlying true
Markov decision process? Is the identification of the true model a necessary
prerequisite for optimal behavior? Surprisingly, it turns out that not only is the
1 Note that the crucial property of an MDP is its Markovian behavior, that is, rewards and transitions only
depend on the current state and not on the history.
answer to the latter question negative, in general it is not possible to identify the true
model, while it is still possible to learn optimal behavior.
Learnin (...truncated)