Deep Learning for Poker - LabROSA

Deep Learning for Poker - LabROSA

Deep Learning for Poker:
 Inference From Patterns in an
 Adversarial Environment Nikolai Yakovenko, PokerPoker LLC CU Neural Networks Reading Group D...

887KB Sizes 1 Downloads 11 Views

Recommend Documents

Deep Learning for Dialogue Systems
(2014). Amazon Alexa/Echo. (2014). Google Assistant. (2016). DARPA. CALO Project. Virtual Personal Assistants. Material:

Deep learning
starts at 2pm. – sign up for a time slot by emailing Santhosh by TONIGHT. – pick up foam boards. – beware the curs

Poker AI: Equilibrium, Online Resolving, Deep Learning and
Player 1: Random strategy. Player 2: Random strategy. Regret: folding good hands. Action: bet good hands more. Regret: n

Software Libraries for Deep Learning - CEDAR
developed by Google Brain Team. 3. Deep Scalable Sparse Tensor Network. Engine (DSSTNE). – Amazon developed library fo

Deep Learning for Question Answering - UMIACS
Nov 19, 2014 - How can we do better? • Use order of words in a sentence “this man shot Lee Harvey. Oswald” very di

Deep learning infographic - Nvidia
TO BECOME AN EXPERT AT. DIAGNOSING MEDICAL IMAGES. ↓. A SUPERHUMAN COMPUTER CAN. IMPROVE ACCURACY IN JUST A FEW. HOURS

Deep Learning at NAVER - Machine Learning Center
Apr 24, 2015 - Speech recogniCon (acousCc modeling). 3. • Powerful classifier. • Speech recogniCon. NAVER speech rec

Introduction to Deep Learning - PRISM
Jun 18, 2016 - Cadence (Tensilica) IVP core. CEVA XM-4 core videantis v-MP4 vision core. Movidius Myriad 2. Analog Devic

Optimal Auctions through Deep Learning
Jun 12, 2017 - Myerson's [45] seminal work solved the problem of optimal auction ... Bayesian incentive compatible mecha

Deep Learning in Storage - SNIA
2016 WIPRO LTD | WWW.WIPRO.COM | CONFIDENTIAL. 5. Image Processing – Overview I. ▫ The easiest way to understand a c

Deep Learning for Poker:
 Inference From Patterns in an
 Adversarial Environment

Nikolai Yakovenko, PokerPoker LLC CU Neural Networks Reading Group Dec 2, 2015

• This work is not complicated

• Fully explaining the problem would take all available time

• So please interrupt, for clarity and with suggestions!

Convolutional Network for Poker Our approach: • 3D Tensor representation for any poker game • Learn from self-play • Stronger than a rule-based heuristic • Competitive with expert human players • Data-driven, gradient-based approach

Poker as a Function (action

policy)

(action

value) = $X

Private Cards

Bet/Raise = X%

Bet

Public Cards

Check/Call= Y%

Check = $Y

Previous Bets (Public)

Fold

Fold

= Z%

= $0.0

X + Y + Z = 100% Explore & Exploit

Re w

ar ds

Poker as Turn-Based Video Game

Call Raise Fold

Special Case of Atari Games?

Input

Convolutional Network Action Values

Value Estimate Before Every Action

Frame ~ turn-based poker action Discounted reward ~ value of hand before next action [how much you’d sell for?]

More Specific • Our network plays three poker games – Casino video poker – Heads up (1 on 1) limit Texas Hold’em – Heads up (1 on 1) limit 2-7 Triple Draw – Can learn other heads-up limit games

• We are working on heads-up no-limit Texas Hold’em • Let’s focus on Texas Hold’em

Texas Hold’em Private cards

Flop (public)

Turn

River Showdown

Hero Flush

Oppn Two Pairs

Betting Round

Betting Round

Betting Round

Betting Round

Best 5-Card Hand Wins

Representation: Cards as 2D Tensors Private cards

Flop (public)

Turn

River Showdown

Flush

[AhQs]

[AhQs]+[As9s6s]

[AhQsAs9s6s9c2s]

x23456789TJQKA

x23456789TJQKA

x23456789TJQKA

c.............

c.............

c.......1.....

d.............

d.............

d.............

h............1

h............1

h............1

s...........1.

s....1..1...11

Flush draw

Pair (of Aces)

s1...1..1...11

Convnet for Texas Hold’em Basics

Input

convolutions max pool

conv

pool dense layer output 50% dropout layer

Win % against random hand Private cards Public cards [No bets]

Probability (category) • pair, two pairs, flush, etc (as rectified linear units)

(6 x 17 x 17 3D tensor)

98.5% accuracy, after 10 epochs (500k Monte Carlo examples)

What About the Adversary? • Our network learned the Texas Hold’em probabilities. • Can it learn to bet against an opponent? • Three strategies: – Solve for equilibrium in 2-player game • [huge state space]

– Online simulation

• [exponential complexity]

– Learn value function over a dataset • Expert player games • Generated with self-play • [over-fitting, unexplored states]

• We take the data-driven approach…

Add Bets to Convnet

Input

convolutions max pool

• Private cards • Public cards • Pot size as numerical encoding • Position as all-1 or all-0 tensor • Up to 5 all-1 or all-0 tensors for each previous betting round (31 x 17 x 17 3D tensor)

conv

pool dense layer output 50% dropout layer

Output action value: • Bet/Raise • Check/Call • Fold ($0.0, if allowed) Masked loss: • single-trial $ win/loss • only for action taken (or implied)

That’s it? • Much better than naïve player models • Better than heuristic model (based on allin value) • Competitive with expert human players

What is everyone else doing?

CFR: Equilibrium Approximation • Counterfactual regret minimization (CFR)

– Dominant approach in poker research – University of Alberta, 2007 – Used by all Annual Computer Poker Competition (ACPC) winners since 2007

• Optimal solutions for small 1-on-1 games • Within 1% of unexploitable for 1-on-1 limit Texas Hold’em • Statistical tie against world-class players – 80,000 hands of heads-up no limit Texas Hold’em

• Useful solutions for 3-player, 6-player games

CFR Algorithm • Start with a balanced strategy. • Loop over all canonical game states: – Compute “regret” for each action by modeling opponent’s optimal response – Re-balance player strategy in proportion to “regret” – Keep iterating until strategy is stable

• Group game-states into “buckets,” to reduce memory and runtime complexity

Equilibrium vs Convnet • Visits every state • Regret for every action • Optimal opponent response • Converges to an unexploitable equilibrium

• Visits states in the data • Grad on actions taken • Actual opponent response • Over-fitting, even with 1M examples • No explicit balance for overall equilibrium

It’s not even close!

But Weaknesses Can Be Strengths • Visits only states in the data • Gradient only for actions taken • Actual opponent response • Over-fitting, even with 1M examples • No explicit balance for overall equilibrium

• Usable model for large-state games • Train on human games without counterfactual • Optimize strategy for specific opponent • Distill a network for generalization? • Unclear how important balance is in practice…

Balance for Game Theory? • U of Alberta’s limit Hold’em CFR within 1% of un-exploitable • 90%+ of preflop strategies are not stochastic • Several ACPC winners use “Pure-CFR” – Opponent response modeled by singleaction strategy

Explore & Exploit for Limit Hold’em • Sample tail-distribution noise for action values – ε * Gumbel – Better options?

• We also learn an action-percentage

– (bet_values) * action_percent / norm(action_percent) – 100% single-action in most cases – Generalizes more to game context than to specific cards • No intuition why

– Useful for exploration

• Similar cases from other problems??

Observations from Model Evolution • First iteration of the learned model bluffs like crazy • Each re-training beats the previous version, but sometimes weaker against older models – Over-fitting, or forgetting?

• Difficulty with learning hard truths about extreme cases – Can not possibly win, can not possibly lose – We are fixing with side-output re-enforcing Hold’em basics

• Extreme rollout variance for single-trial training data – Over fitting after ~10 epochs, even with 1M dataset – Prevents learning higher-order patterns?

Network Improvements • Training with cards in canonical form – Improves generalization – ≈0.15 bets/hand over previous model

• Training with “1% leaky” rectified linear units – Released saturation in negative network values – ≈0.20 bets/hand over previous model

• Gains are not cumulative

TODO: Improvements • Things we are not doing… – – – –

Input normalization Disk-based loading for 10M+ data points per epoch Full automation for batched self-play Database sampling for experience replay

• Reinforcement learning

– Bet sequences are short, but RL would still help – “Optimism in face of uncertainty” – real problem

• RNN for memory…

Memory Units Change the Game? • If opponent called preflop, his hand is in the blue • If he raised, it is in the green • Use LSTM/GRU memory units to explicitly train for this information?

Next: No Limit Texas Hold’em

Take It to the Limit • Vast majority of tournament poker games are no limit Texas Hold’em • With limit Hold’em “weakly solved,” 2016 ACPC is no limit Hold’em only • Despite Carnegie Mellon team’s success, no limit Hold’em is not close to a perfect solution

No Limit Hold’em: Variable Betting

Call $200 Raise $650 Fold

min $400

allin $20,000

From Binary to Continuous Control 70 53 35 18 0 Fold

Call

Raise

Limit Hold’em

225 150 75 High 0 Low-75 Avg -150 -225 -300

High Low Avg

Fold

Raise 3x

Allin

No Limit Hold’em

CFR for No Limit Hold’em • “Buttons” for several fixed bet sizes – Fixed at % of chips in the pot

• Linear (or log) interpolation between known states • Best-response rules assume future bets increase in size, culminating in an allin bet • Without such rules, response tree traversal is impossible

Call Raise 2x Raise 5x Raise 10x Raise Allin Fold

CFR for NLH: Observations • Live demo from 2012-2013 ACPC medalwinner NeoPoker http:// www.neopokerbot.com/play – It was easy to find “3x bet” strategy that allowed me to win most hands – This does not win a lot, but requires no poker knowledge to beat the “approximate equilibrium” – Effective at heads-up NLH, 3-player NLH, 6-max NLH

45 hands in 2.5 minutes. I raised 100%

A human would push back…

Next Generation CFR • 2014 ACPC NLH winner Slumbot, based on CFR • Much harder to beat! • Better than most human players (including me) – 2014 Slumbot +0.12 bets/hand over 1,000+ hands

• Still easy to win 80%+ hands preflop with wellsized aggressive betting • Why? – Game-theory equilibrium does not adjust to opponent – Implicit assumptions in opponent response modeling

CFR is an Arms Race • Slumbot specs (from 2013 AAAI paper) – 11 bet-size options for first bet

• Pot * {0.25, 0.5, 0.75, 1.0, 1.5, 2.0, 4.0, 8.0, 15.0, 25.0, 50.0}

– – – – –

8, 3 and 1 bet-sizes for subsequent bets 5.7 billion information sets 14.5 billion information-set/action pairs Each state sampled with at least 50 run-outs Precise stochastic strategies, for each information set

• Exclusively plays heads-up NLH, resetting to 200 bets after every hand • 2016 ACPC competition increasing agent disk allotment to 200 GB…

Another Way: Multi-Armed Bandit? • Beta-distribution for each bucket • How to update with a convolutional network?

225 150 75 0 -75 -150 -225 -300 Fold

Raise 3x

Allin

Hack: • SGD update for Beta mean • Offline process or global constant for σ

Using Convnet Output
 for No Limit Betting • Fold_value = 0.0 • Call_value = network output • Bet_value = network output • Can the network estimate a confidence?

If (Bet): • Sample bet-bucket distributions • OR • stats.beta.fit (buckets) • Fit multinomial distribution to point estimates? • MAP estimator? • Ockham's Razor?

Advantages of Betting with ConvNet • Forced to generalize from any size dataset – CFR requires full traversal, at least once – CFR requires defining game-state generalization

• Model can be trained with actual hands – Such as last year’s ACPC competition – Opponent hand histories are not useful for CFR

• Tune-able explore & exploit • Adaptable to RL with continuous control – Learn optimal bet sizes directly

Build ConvNet, then Add Memory • Intra-hand memory – Remember context of previous bets – Side-output [win% vs opponent] for visualization

• Inter-hand memory – Exploit predictable opponents – “Coach” systems for specific opponents – Focus on strategies that actually happen

This is a work in progress… ACPC no limit Hold’em: code due January 2016

Thank you!

Questions?

Citations, Links • Poker-CNN paper, to appear in AAAI 2016: http://arxiv.org/abs/ 1509.06731 • Source code (needs a bit of cleanup): https://github.com/ moscow25/deep_draw • Q-Learning for Atari games (DeepMind): http://www.nature.com/ nature/journal/v518/n7540/full/nature14236.html • Counterfactual regret minimization (CFR)

– Original paper (NIPS 2007) http://webdocs.cs.ualberta.ca/~games/poker/ publications/AAMAS13-abstraction.pdf – Heads-up Limit Holdem is Solved (within 1%) https://www.sciencemag.org/content/ 347/6218/145 – Heads-up No Limit Holdem “statistical tie” vs professional players https:// www.cs.cmu.edu/brains-vs-ai

• CFR-based AI agents:

– NeoPoker, 2012-2013 ACPC medals http://www.neopokerbot.com/play – Slumbot, 2014 ACPC winner (AAAI paper) https://www.aaai.org/ocs/index.php/WS/ AAAIW13/paper/viewFile/7044/6479