ReadMe.rtf


In this project, I implemented the Reinforcement learning algorithm called REINFORCE. This algorithm is under the general class of policy gradient methods for reinforcement learning, meaning it focuses on the policy and changes as it goes and learns more about the environment.   

In this specific algorithm, the agent goes through many episodes. In each episode, it first does a monte carlo rollout, meaning it acts in the environment according to its parameters(the weights of the neural network, which output the probability of doing the two possible actions an agent can do in an environment), until it either “dies”(in cart pole this means either going off position or off angle) or it “survives” for 200 time steps. For each time step the state, action, and rewards were recorded. 

After the rollout, the algorithm looks at each recorded state, action, and reward, and it calculates, based on the reward the action got, how to either increase the probability of that action being chosen by the neural network(that represents its policy, or more specifically, its weights represent its policy) or decrease it. We repeatedly do this until the agent consistently gets a score 200 for 100 episodes(which is considered a solve in the cart pole environment).   

I used pytorch to implement the policy neural network. The biggest challenge I faced was that I made a dumb mistake: I at first only allowed the agent to pick the action associated with the highest probability. This obviously meant the agent did not explore much of its environment! After fixing this, the agent was able to solve at about 2500 episodes, so this is certainly not the best implementation, but its a great and fun start!