Policy Learning for Robust Markov Decision Process with a Mismatched Generative Model

Jialian Li; Tongzheng Ren; Dong Yan; Hang Su; Jun Zhu

Policy Learning for Robust Markov Decision Process with a Mismatched Generative Model

Jialian Li, Tongzheng Ren, Dong Yan, Hang Su, Jun Zhu

[AAAI-22] Main Track

Keywords
Poster Session 6 @ Blue 1, Poster Session 12 @ Blue 1, Poster Session 6, Poster Session 12

Download Paper

Enter the Virtual Venue

Abstract: In high-stake scenarios like medical treatment and auto-piloting, it's risky or even infeasible to collect online experimental data to train the agent. Simulation-based training can alleviate this issue, but probably suffers from its inherent mismatches from the simulator and real environment. It is therefore imperative to utilize the simulator to learn a robust policy for the real-world deployment.

In this work, we consider the policy learning for Robust Markov Decision Processes (RMDP), where the agent tries to seek a robust policy with respect to unexpected perturbations on the environments. Specifically, we focus on the setting where the training environment can be characterized as a generative model and a constrained perturbation can be added to the model during testing. Our goal is to identify a near-optimal robust policy for the perturbed testing environment, which introduces additional technical difficulties as we need to simultaneously estimate the training environment uncertainty from samples and find the worst-case perturbation for testing. To solve this issue, we propose a generic method which formalizes the perturbation as an opponent to obtain a two-player zero-sum game, and further show that the Nash Equilibrium corresponds to the robust policy. We prove that, with a polynomial number of samples from the generative model, our algorithm can find a near-optimal robust policy with a high probability. Our method is able to deal with general perturbations under some mild assumptions and can also be extended to more complex problems like robust partial-observable Markov decision process, thanks to the game-theoretical formulation.

Introduction Video

Sessions where this paper appears

Timezone

Poster Session 6

Sat, February 26 8:45 AM - 10:30 AM (+00:00)

Blue 1

Add to Calendar
Apple
Google
iCal File
Microsoft 365
Outlook.com
Yahoo

Poster Session 6
Poster Session 12

Mon, February 28 8:45 AM - 10:30 AM (+00:00)

Blue 1

Add to Calendar
Apple
Google
iCal File
Microsoft 365
Outlook.com
Yahoo

Poster Session 12