Using Tensorflow 2.0 / TF Agents
Using PPO
Here, I create a BlackJack game (the environment), allow the agent to play through many decks, and let the agent optimize its actions (with PPO).
The environment is set as follows:
- One scene is one deck of 52 cards
- The observations are the player's cards, one of the dealer's cards, and all previous cards that have shown (because I was hoping the agent would learn to count cards).
- Possible actions are 0)Stand 1)Hit 2)DoubleDown
- Rewards are +1 if WIN, -1 if LOSE
- If DoubleDown chosen, player gets a single card and the reward is doubled.
Note because the player acts before the dealer, and if the player goes over 21, he automatically loses without the dealer needing to draw cards, the dealer is greatly favored in BlackJack. In the actual game, the player can only DoubleDown if he has 2 cards, but in this simulated environment, the agent can DoubleDown on any turn.
I trained the agent using the TF Agents PPO. The learned action network follow how an intelligent human player would play. The learned value net, has some oddities, but show overall that the agent expects to lose.
For example, when player has 19 and dealer shows a 2, the action network suggests an action of 0 (Stand) with high confidence.
[19] [2] [[0.8581143 0.09495622 0.04692944]]
When the player has 15 and the dealer shows a 10, the agent's best move is still to Stand, but it has low confidence.
[15] [10] [[0.4792729 0.2612255 0.25950167]]
Observing the value network outputs, the agent expects to lose roughly 2 hands by the end of the deck, and as the deck plays out, the agent values the situation less negative. Aside from this overall increasing trend, the agent values some observations better than others.
When player has 20 and the dealer shows a 2, this is valued positively.
[20] [2] [[0.6782062]]
When player has 19, but dealer shows an Ace, this is valued very negatively. Even though a 19 is a high number, the ace's flexibility to be either 1 or 11 in BlackJack is strong for the dealer. This is worsened by the fact that 10s are the most numerous in the deck.
[19] [1] [[-1.744565]]
Overall, I was disappointed the agent could not learn a way to consistently best the dealer even with card counting and the altered DoubleDown rule. However, I did not investigate whether it is theoretically possible to beat the dealer 1-on-1.
Looking at the results another way: if I had no knowledge of the actual mathematical odds, I might assume that because the environment is simple, and because the PPO Agent is good at finding an optimal set of actions in simple environments and still loses, I might conclude is it impossible to beat a dealer in this set up.