MultiArmedBandit_RL

Implementation of various multi-armed bandits algorithms using Python.

Algorithms Implemented

The following algorithms are implemented on a 10-arm testbed, as described in Reinforcement Learning : An Introduction by Richard and Sutton.

Epsilon-Greedy Algorithm
Softmax Algorithm
Upper Confidence Bound(UCB1)
Median Elimination Algorithm(MEA)

Dependencies

numpy(used version 1.13.3)
matplotlib(used version 1.5.1)

Description of the Experiment

A 10-arm bandit testbed was generated as described in Section 'The 10-armed Testbed' of the book. The testbed contains 2000 bandit problems with 10 arms each, with the true action values q_∗(a) for each action/arm in each bandit sampled from a normal distribution N(0,1). When a learning algorithm is applied to any of these arms at time t, action A_t is selected from each bandit problem and it gets a reward R_t which is sampled from N (q_∗(A_t),1). The performance of each of the implemented algorithms is measured as ”Average Reward” (or) ”% Optimal Actions picked” at each time step.

Implementations

The codes for each algorithm and corresponding plots generated can be found in the respective folders.

The 10-arm testbed is implemented separately for each algorithm in the corresponding code file. The epsilon-greedy, softmax and UCB1 algorithms have been implemented for 1000 (time) steps each with varying values of respective parameters. The Median Elimination Algorithm has been implemented for different values of ε and δ (total number of time steps for MEA is determined by the value of these hyperparameters). Running it for these 1000(or in the case of MEA, as calculated by values of ε and δ) time steps constitutes 1 run. This is repeated for 2000 independent runs, each time with a different bandit problem. The rewards obtained over all these runs is averaged to get an idea of the algorithm's average behaviour.

Comparison graphs have been plotted and can be found in the UCB and MEA folders.

More descriptions can be found in the respective folders.

References

Reinforcement Learning : An Introduction by Richard and Sutton
PAC Bounds for Multi-armed Bandit and Markov Decision Processes

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
Epsilon_Greedy_Method		Epsilon_Greedy_Method
MEA		MEA
Softmax_Method		Softmax_Method
UCB		UCB
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

MultiArmedBandit_RL

Algorithms Implemented

Dependencies

Description of the Experiment

Implementations

References

About

Releases

Packages

Languages

SahanaRamnath/MultiArmedBandit_RL

Folders and files

Latest commit

History

Repository files navigation

MultiArmedBandit_RL

Algorithms Implemented

Dependencies

Description of the Experiment

Implementations

References

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages