19/12/2024

Designing reward functions: reinforcement learning

Disha Dagli, of the IFoA’s Professionalism, Regulation and Ethics Working Party, sets out some of the considerations for reinforcement learning and reward functions

The problem

Say you’re teaching a robot to play chess: how do you teach it to win? You can’t sit there with every possible chess board set up, in every possible position, and assign a ‘correct’ move to make in every single case. If we really dive into the metaphysics of it, is there even a ‘correct’ move in isolation? So we scratch this idea.

Maybe you could try assigning a ‘response’ mapping to each move the opponent makes. But then you realise the quantifiable threat in an opponent’s move depends on the current set up, so you’d once again need to consider every possible set up and add that into your mapping input function. So you give up again.

Maybe there is no way of deterministically mapping each state into a perfect answer in the end that we know of. The game isn’t simple enough that a programmer could come up with a deterministic answer to ‘code up’ for a robot to execute. There’s no easy way to say X is a good move or Y is a bad move in isolation without knowing the context of the game. Part of the problem here is ‘learning’ this optimal mapping. (Fun fact: if there was a pre-existing good or bad move matching, we could use supervised learning.)

The solution: reinforcement learning

So what do we do? Well, let’s start with how we teach humans things. We don’t go telling people to behave in every possible scenario – that would be absurd. Instead, we use a little and trial and error. We do something, see how our environment responds and interpret this response in our internal reward system.

Say I tell a joke, everyone laughs, this translates into a little dopamine hit. I realise this is a ‘good action’ and categorise it as something I want to do more of. I push someone, they start crying, and their tears make me feel guilty. This releases some sadness hormones and boom, I decide I shouldn’t do this again. ‘Bad action’ categorised. This is my internal reward system that works to decide how I behave. It’s based on previous experiences and is continuously evolving. Hello reinforcement learning.

This is exactly what we should do. And this is exactly what researchers at DeepMind and Google did to create AlphaZero in 2017. AlphaZero is a computer program that achieved a ‘superhuman level of play’ within 24 hours by using reinforcement learning. By playing against itself and learning, the program taught itself from scratch how to beat the world champions. It’s not an understatement to say this was a revolution in the chess world and has forever changed the trajectory of chess.

Before this, chess computers like Deep Blue and Stockfish were stuck on using brute force methods. This quite literally meant they were running simultaneous calculations exploring possible trees of moves and making the best choice from there. This was by no means easy: Deep Blue was using 32 processors to check 200 million chess positions per second. And in 1997 it beat Garry Kasparov, the reigning world champion in a six-game marathon. This process, while effective, wasn’t efficient and we needed something bigger to even begin the battle against Go, a game with more board variations than atoms in the observable universe!

And it doesn’t stop there. Reinforcement learning has become an integral part of robotic manipulations and has enabled robots to learn tasks like picking up objects, moving in new environments and even folding laundry. It’s also used by Tesla in self-driving cars to navigate in dynamic, complex environments. Reinforcement learning is also the core of J.P. Morgan’s deep neural network for Algo Execution (DNA) market pricing toolset.

Reward function risks and how to address them

So what does this mean for us? Is reinforcement learning this new magic wand that we can wave at any problem and create a shiny new solution? Well… no not really. The way the program learns is through trial and error. Each outcome gives us a learning point, and based on the reward system we’ve designed, adjusts the parameters of our neural network. The problem here is that this process can become a ‘black box’. We might not know what biases we’re incorporating and whether the reward function we’ve chosen is in our best interests. There’s also the risk of opening ourselves up to adversarial attacks. Today we’ll focus on the reward function risks.

For example, OpenAI tried using reinforcement learning on CoastRunners, a game where the character has to hit targets along a route and finish a boat race as quickly as possible. Using the score of the player (in hitting the targets) in the reward function led to some interesting behaviour. The bot learned it could get a higher score by swaying off the path and hitting more targets but not finishing the actual race. This discrepancy between the actual goal and the reward function is a very common issue. While innocuous here, it can have potentially dangerous outcomes (think paperclip problem: a thought experiment saying a sufficiently intelligent AI could eventually wipe out humanity in its task to make as many paperclips as possible).

The idea of designing a reward function should be a trial-and-error process incorporating human input and feedback at regular intervals. Be wary of sparse functions that don’t offer much learning (like 1 when you win and 0 otherwise). Shape reward functions with intermediate rewards to guide the agent towards the goal. But also, be cautious of functions that may incentivise behaviours other that what is desired. Often a simpler reward is preferred to an overcomplicated structure that could create loopholes. Also, keep reward magnitudes consistent to ensure stable learning. Large rewards can cause instability, while very small rewards might make learning too slow.

Designing reward functions: reinforcement learning

The problem

The solution: reinforcement learning

Reward function risks and how to address them

References