January 26, 2025
Machine learning involves interpolation. Given data and we want to find a function with for all . If the and are real numbers then there is a well developed theory for this. I can highly recommend learning about basis splines.
If the are each several megabytes of pixels and the are only allowed the values true or false the problem becomes much more complicated. Machines identifying pictures that look like a blur to the human eye as stop signs is a well-known example indicating the difficulty.
Moore’s law has covered up the fact that there has been little recent advancement in the mathematical theory underpinning machine learning. Even people who never learned proper programming skills can now use a language as inefficient as Python to produce a plethora of results that are not-obviously-wrong. Maybe one day AI will help us sort the wheat from the chaff, but given the long history of false claims about its potential it is safe to say that will not be anytime soon.
The human brain has evolved over the centuries to make up stories. Michael Gazzaniga’s research starting in the 70’s on people who had their corpus callousm severed brought to light that there is not merely a right hemisphere/left hemisphere phenomenon; there are thousands of independent agents simultaneously vying for the first-place prize of human consciousness. The tricky little bastards will make up any story to get that. The human brain was definitely not evolved to evaluate the validity of all the stories 7.8 billion other people’s brains cook up.
The success of Alpha Zero in beating human experts in the games of go, shogi, and chess as if they were small babies has been a shining achievement of of machine learning, but let’s not forget Arthur Samuel’s checkers program from 1959. Lee Se-dol, Yoshiharu Habu, and Magnus Carlsen can take comfort in the fact that the checkers world champion at the time had his ego slowly but relentlessly crushed as he realized a computer program written by someone admitting to know little about checkers beat him in game after game. Samuel used reinforcement learning and alpha-beta pruning, as does Alpha Zero. Of course he did not have 5,000 TPU’s and four 44-core CPU’s at his disposal that enable Monte Carlo tree search to plumb much greater plys.
Reinforcement learning originated from the multi-armed bandit problem. Single-armed bandit is slang for a slot machine. Given a collection of slot-machines how should you play to maximize your take? Strategies involve a tradeoff between exploring the returns of each machine and exploiting the machines that seem to have the highest payoff. Some wag at RAND suggested dropping leaflets over Germany during WWII describing the problem in order to divert their best minds.
There is still no canonical solution to this important and practical problem. If you are running a research group how do you allocate your budget to various teams? The composition of teams can change over time and you can use pep talks to inspire people but mathematics is limited in its ability to capture the messy reality of the world we live in.
The mathematical model used for reinforcement learning is a Markov decision process. A MDP is defined by states , actions , rewards , and transition probabilities, the probability of moving to state and receive reward given you are in state and take action at time . Some models specify , for , the set of possible actions when in state .
At time in state an action results in a new state and reward at time with probability . The process is called Markov because the probablity does not depend on . There is no reason it couldn’t but that would make it harder to find your keys in a dark alley.
In the multi-armed bandit problem there is only one state: the set of machines to play. The action is which machine to play. This results in the reward with probability after you drop a dollar in machine . The next round still has the same set of machines available for play.
A policy specifies the probability of taking action given you are in state . This results in a sequence of random variables describing possible future states and rewards: , , , given . A more clever approach might look at the history of the results of prior actions, but we are stumbling under Markov’s steet light for now.
Reinforcement learning is the study of how to find the optimal policy, for some definition of optimal.
A gain function is any function of future rewards, Common choices are are average future rewards over a horizon and exponential discounting of future rewards where is the discount factor.
The state-value function for policy is , the expected gain from the policy. Reinforcement learning assumes you want a policy that maximizes this. Nobel prizes have been won by people who have demonstrated sure rewards seem to be preferred over uncertain rewards having larger expected value. People also seem to prefer uncertain losses over certain loss. (Reflection. Machina, Rappoport, Tversky, Khaneman, …) A similar problem exists in the financial world for managing a portfolio of investments. Other Nobel prizes have been awarded for systematic approaches to incorporating the risks involved.
But let’s get back to our expected value street light. For exponential decay the law of total probability yields We want to find the optimal state-value function . is the Bellman optimality equation.
Exercise. Show for any policy .
The action-value function for is . It is the state-value function for a given action. We want to find . Note for exponential decay gives the optimal action-value function.
An -armed bandit is a MDP with one state and actions. Solutions should explore the available actions and exploit the most promising. In this case the action-value function does not depend on the state. If we knew the reward distributions for each action then the optimal strategy would be to always select the action with the largest expected value but we don’t need to estimate each individual reward distribution in order to find the optimal policy.
The -greedy strategy selects the action maximizing the current action-value function with probability and a random action with probability . The action-value function is updated based on the observed reward. Hopefully this will converge to and we can find a method to make this happen quickly.
The first visit Monte Carlo prediction approximates the state-value function for a given policy. Choose an initial state-value function and generate actions using . For each state in the run update to be the sample average of the returns.
TDL is a model-free method that learns by bootstrapping the value function.
The simplest example, , specifies a learning rate and updates to .