Abstract
We consider the problem of learning models of stochastic environments for planning, assuming that the transitions on the various state factors are independent. We suppose we have example trajectories of policies for random problems executing in an unknown environment, and we would like to match the rate of success of those policies. We present a polynomial-time algorithm that, given a polynomial number of trajectories, produces a PPDDL model such that optimal policies under the PPDDL model will provably achieve a success rate at future goals that is almost as good or better than the policies that produced the training trajectories. Moreover, we also consider a variant of PPDDL in which there is uncertainty about the transition probabilities, specified by an interval for each factor that contains the respective true transition probabilities. We also give a polynomial-time algorithm that, when given the example trajectories, produces such an imprecise-PPDDL environment model and guarantees that with high probability, the true environment is indeed captured by the uncertain parameters, while also guaranteeing that the optimal bound on the success probability in this model is almost as good or better than the success rate of the policies that produced the training examples.
Keywords
poster session 1 @ blue 4, poster session 11 @ blue 4, oral session 1 @ blue 4, poster session 1, poster session 11, oral session 1