Implementation of various multi-armed bandits algorithms using Python.
The following algorithms are implemented on a 10-arm testbed, as described in Reinforcement Learning : An Introduction by Richard and Sutton.
- Epsilon-Greedy Algorithm
- Softmax Algorithm
- Upper Confidence Bound(UCB1)
- Median Elimination Algorithm(MEA)
- numpy(used version 1.13.3)
- matplotlib(used version 1.5.1)
A 10-arm bandit testbed was generated as described in Section 'The 10-armed Testbed' of the book. The testbed contains 2000 bandit problems with 10 arms each, with the true action values q∗(a) for each action/arm in each bandit sampled from a normal distribution N(0,1). When a learning algorithm is applied to any of these arms at time t, action At is selected from each bandit problem and it gets a reward Rt which is sampled from N (q∗(At),1). The performance of each of the implemented algorithms is measured as ”Average Reward” (or) ”% Optimal Actions picked” at each time step.
The codes for each algorithm and corresponding plots generated can be found in the respective folders.
The 10-arm testbed is implemented separately for each algorithm in the corresponding code file. The epsilon-greedy, softmax and UCB1 algorithms have been implemented for 1000 (time) steps each with varying values of respective parameters. The Median Elimination Algorithm has been implemented for different values of ε and δ (total number of time steps for MEA is determined by the value of these hyperparameters). Running it for these 1000(or in the case of MEA, as calculated by values of ε and δ) time steps constitutes 1 run. This is repeated for 2000 independent runs, each time with a different bandit problem. The rewards obtained over all these runs is averaged to get an idea of the algorithm's average behaviour.
Comparison graphs have been plotted and can be found in the UCB and MEA folders.
More descriptions can be found in the respective folders.