Skip to content

Implementation of various multi-armed bandits algorithms on a 10-arm testbed.

Notifications You must be signed in to change notification settings

SahanaRamnath/MultiArmedBandit_RL

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

11 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MultiArmedBandit_RL

Implementation of various multi-armed bandits algorithms using Python.

Algorithms Implemented

The following algorithms are implemented on a 10-arm testbed, as described in Reinforcement Learning : An Introduction by Richard and Sutton.

  • Epsilon-Greedy Algorithm
  • Softmax Algorithm
  • Upper Confidence Bound(UCB1)
  • Median Elimination Algorithm(MEA)

Dependencies

  • numpy(used version 1.13.3)
  • matplotlib(used version 1.5.1)

Description of the Experiment

A 10-arm bandit testbed was generated as described in Section 'The 10-armed Testbed' of the book. The testbed contains 2000 bandit problems with 10 arms each, with the true action values q(a) for each action/arm in each bandit sampled from a normal distribution N(0,1). When a learning algorithm is applied to any of these arms at time t, action At is selected from each bandit problem and it gets a reward Rt which is sampled from N (q(At),1). The performance of each of the implemented algorithms is measured as ”Average Reward” (or) ”% Optimal Actions picked” at each time step.

Implementations

The codes for each algorithm and corresponding plots generated can be found in the respective folders.

The 10-arm testbed is implemented separately for each algorithm in the corresponding code file. The epsilon-greedy, softmax and UCB1 algorithms have been implemented for 1000 (time) steps each with varying values of respective parameters. The Median Elimination Algorithm has been implemented for different values of ε and δ (total number of time steps for MEA is determined by the value of these hyperparameters). Running it for these 1000(or in the case of MEA, as calculated by values of ε and δ) time steps constitutes 1 run. This is repeated for 2000 independent runs, each time with a different bandit problem. The rewards obtained over all these runs is averaged to get an idea of the algorithm's average behaviour.

Comparison graphs have been plotted and can be found in the UCB and MEA folders.

More descriptions can be found in the respective folders.

References

About

Implementation of various multi-armed bandits algorithms on a 10-arm testbed.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages