Skip to main navigation Skip to search Skip to main content

Finite-time Analysis of the Multiarmed Bandit Problem

    Research output: Contribution to journalJournal articleResearchpeer-review

    Abstract

    Reinforcement learning policies face the exploration versus exploitation dilemma, i.e. the search for a balance between exploring the environment to find profitable actions while taking the empirically best action as often as possible. A popular measure of a policy's success in addressing this dilemma is the regret, that is the loss due to the fact that the globally optimal policy is not followed all the times. One of the simplest examples of the exploration/exploitation dilemma is the multi-armed bandit problem. Lai and Robbins were the first ones to show that the regret for this problem has to grow at least logarithmically in the number of plays. Since then, policies which asymptotically achieve this regret have been devised by Lai and Robbins and many others. In this work we show that the optimal logarithmic regret is also achievable uniformly over time, with simple and efficient policies, and for all reward distributions with bounded support.
    Original languageEnglish
    JournalMachine Learning
    Volume47
    Issue number3
    Pages (from-to)235-256
    ISSN0885-6125
    DOIs
    Publication statusPublished - 2002

    Fingerprint

    Dive into the research topics of 'Finite-time Analysis of the Multiarmed Bandit Problem'. Together they form a unique fingerprint.

    Cite this