Curious Explorer: A Provable Exploration Strategy in Policy Learning

Marco Miani, Maurizio Parton, Marco Romito

Research output: Contribution to journalJournal articleResearchpeer-review

10 Downloads (Pure)

Abstract

A coverage assumption is critical with policy gradient methods, because while the objective function is insensitive to updates in unlikely states, the agent may need improvements in those states to reach a nearly optimal payoff. However, this assumption can be unfeasible in certain environments, for instance in online learning, or when restarts are possible only from a fixed initial state. In these cases, classical policy gradient algorithms like REINFORCE can have poor convergence properties and sample efficiency. Curious Explorer is an iterative state space pure exploration strategy improving coverage of any restart distribution ρ. Using ρ and intrinsic rewards, Curious Explorer produces a sequence of policies, each one more exploratory than the previous one, and outputs a restart distribution with coverage based on the state visitation distribution of the exploratory policies. This paper main results are a theoretical upper bound on how often an optimal policy visits poorly visited states, and a bound on the error of the return obtained by REINFORCE without any coverage assumption. Finally, we conduct ablation studies with REINFORCE and TRPO in two hard-exploration tasks, to support the claim that Curious Explorer can improve the performance of very different policy gradient algorithms.
Original languageEnglish
JournalIEEE Transactions on Pattern Analysis and Machine Intelligence
Volume46
Issue number12
Pages (from-to)11422 - 11431
ISSN0162-8828
DOIs
Publication statusPublished - 2024

Keywords

  • Exploration
  • PAC algorithms
  • Reinforcement learning

Fingerprint

Dive into the research topics of 'Curious Explorer: A Provable Exploration Strategy in Policy Learning'. Together they form a unique fingerprint.

Cite this