This dissertation describes sequential decision making problems in non-stationary environments. Online learning algorithms deal with non-stationary environments, but generally there is no notion of a dynamic state to model future impacts of past actions. State-based models are common in stochastic control settings, but well-known frameworks such as Markov decision processes (MDPs) assume a known stationary environment. In recent years, there has been a growing interest in fusing the above two important learning frameworks and considering an MDP setting in which the cost function is allowed to change arbitrarily over time. A number of online MDP algorithms have been designed to work under various assumptions about the dynamics of state transitions so far and provide performance guarantees, i.e. bounds on the regret defined as the performance gap between the total cost incurred by the learner and the total cost of the best available stationary policy that could have been chosen in hindsight.
However, most of the work in this area has been algorithmic: given a problem, one
would develop an algorithm almost from scratch and prove the performance guarantees on a case-by-case basis. Moreover, the presence of the state and the assumption of an arbitrarily varying environment complicate both the theoretical analysis and the development of computationally efficient methods. Another potential issue is that, by removing distributional assumptions about the mechanism generating the cost sequences, the existing methods have to consider the worst-case scenario, which may render their solutions too conservative in situations where the environment exhibits some degree of predictability.
This dissertation contributes several novel techniques to address the above challenges of the online MDP framework and opens up new research directions for online MDPs.
Our proposed general framework for deriving algorithms in the online MDP setting leads to a unifying view of existing methods and provides a general procedure for constructing new ones. Several new algorithms are developed and analyzed using this framework. We develop convex-analytical algorithms that take advantage of possible regularity of observed sequences, yet maintain the worst case performance guarantees. To further study the convex-analytic methods we applied above, we take a step back to consider the traditional MDP problem and extend the LP approach to MDPs by adding a relative entropy regularization term. A computationally efficient algorithm for this class of MDPs is constructed under mild assumptions on the state transition models. Two-player zero-sum stochastic games are also investigated in this dissertation as an important extension of the online MDP setting. In short, this dissertation provides in-depth analysis of the online MDP problem and answers several important questions in this field.