An ϵ -Greedy Multiarmed Bandit Approach to Markov Decision Processes

[摘要] We present REGA, a new adaptive-sampling-based algorithm for the control of finite-horizon Markov decision processes (MDPs) with very large state spaces and small action spaces. We apply a variant of theϵ -greedy multiarmed bandit algorithm to each stage of the MDP in a recursive manner, thus computing an estimation of the “reward-to-go” value at each stage of the MDP. We provide a finite-time analysis of REGA. In particular, we provide a bound on the probability that the approximation error exceeds a given threshold, where the bound is given in terms of the number of samples collected at each stage of the MDP. We empirically compare REGA against another sampling-based algorithm called RASA by running simulations against the SysAdmin benchmark problem with 2 10 states. The results show that REGA and RASA achieved similar performance. Moreover, REGA and RASA empirically outperformed an implementation of the algorithm that uses the “original”ϵ -greedy algorithm that commonly appears in the literature.

[发布日期] [发布机构]

[效力级别] [学科分类] 农艺学与作物科学

[关键词] multiarmed bandits;epsilon-greedy method;Markov decision process (MDP);sampling;optimization under uncertainties [时效性]

浏览次数：8

统一登录查看全文激活码登录查看全文