Optimal Behavior is Easier to Learn than the Truth (pdf)

A PDF file should load here. If you do not see its contents the file may be temporarily unavailable at the journal website or you do not have a PDF plug-in installed and enabled in your browser.

Alternatively, you can download the file locally and open with any standalone PDF reader:

https://link.springer.com/content/pdf/10.1007%2Fs11023-016-9389-y.pdf

Optimal Behavior is Easier to Learn than the Truth

Minds & Machines Optimal Behavior is Easier to Learn than the Truth Ronald Ortner 0 0 Department Mathematik und Informationstechnologie, Montanuniversita ̈t Leoben , Franz-Josef-Straße 18, 8700 Leoben , Austria We consider a reinforcement learning setting where the learner is given a set of possible models containing the true model. While there are algorithms that are able to successfully learn optimal behavior in this setting, they do so without trying to identify the underlying true model. Indeed, we show that there are cases in which the attempt to find the true model is doomed to failure. In reinforcement learning problems, an agent acts in an unknown environment that allows the agent to take actions that are followed by a response of the environment. The paradigm for representing such reinforcement learning problems are Markov decision processes, where starting in some initial state s1, the agent at time steps t ¼ 1; 2; . . . chooses an action at from a set of actions A, obtains a random reward depending on the current state st and the chosen action at, and then moves to state stþ1 according to transition probabilities that also depend on the state-action pair ðst; atÞ. Formally, a Markov decision process is defined as follows. Markov decision processes - Definition 1 A Markov decision process (MDP) M consists of a set of states S with some distinguished initial state s1, a set of actions A, reward distributions with mean r(s, a) for the reward when choosing action a in state s, and transition probabilities pðs0js; aÞ for the probability of moving to state s0 when choosing action a in state s. However, in many practical reinforcement learning problems (e.g. applications in robotics) the underlying state space can either be huge or even unknown. Thus, a chess playing robot may be confronted with the same board position on two different occasions, but the respective video signals of the board may be different. Thus, it makes sense to distinguish between observations and states. In more complex scenarios as we will consider here, the agent only has direct access to observations (like the video signal of the robot) but has no information about the underlying state (the respective board position). In the motivating example, states could be considered as sets of observations (corresponding to the same state), or equivalently, mappings from the set of observations O to a state space S. Without prior knowledge the agent has to consider various such models that map observations to states. In our example, the true model would map all images showing the same board position to the same state, but of course when learning from scratch it is not clear that this is the correct model. Rather, it seems that the learner has to learn the true model or a good approximation of it as well. Actually, we will consider more general models that need not aggregate observations but more generally histories, that is, the sequences of observations, rewards, and chosen actions. This notion of models has been introduced by Hutter (2009). Definition 1 In a reinforcement learning problem, the history ht after t time steps is the sequence ht :¼ o1; a1; r1; o2; a2; r2; . . .; at; rt; otþ1 of observations os 2 O, collected rewards rs 2 R, and chosen actions as 2 A at time steps s ¼ 1; . . .; t. That is, similar to our motivating example, a model assigns a respective state to each situation in which the agent can find himself in (i.e., a history). In the example of the chess playing robot, the observation of the board is actually not always sufficient to decide whose move it is, and one has to take into account also the recent history of observations to determine the correct state of the game. Note that the notion of model in Definition 2 is still rather modest, as we do not demand that a model also specifies the precise values of the mean rewards and transition probabilities of all state-action pairs, which would make things obviously much harder. Now, we assume that there is a true model utrue that maps histories to states with respect to which the environment behaves like a Markov decision process.1 However, this model is unknown to the learner. Rather, the learner has a set of possible models U, each mapping histories to states, at her disposal which we assume to contain the true model utrue. In this paper we are interested in the following questions: Is it possible to identify the true model? Can an agent learn to behave optimally in the underlying true Markov decision process? Is the identification of the true model a necessary prerequisite for optimal behavior? Surprisingly, it turns out that not only is the 1 Note that the crucial property of an MDP is its Markovian behavior, that is, rewards and transitions only depend on the current state and not on the history. answer to the latter question negative, in general it is not possible to identify the true model, while it is still possible to learn optimal behavior. Learnin (...truncated)