Marcus Hutter
๐ค SpeakerVoice Profile Active
This person's voice can be automatically recognized across podcast episodes using AI voice matching.
Appearances Over Time
Podcast Appearances
And that is a very interesting question.
And I'm asked a lot about this question.
Where do the rewards come from?
And that depends.
And I give you now a couple of answers.
So if you want to build agents,
Now let's start simple.
So let's assume we want to build an agent based on the Aixi model, which performs a particular task.
Let's start with something super simple like playing chess or Go or something.
Then the reward is winning the game is plus 1, losing the game is minus 1, done.
You apply this agent.
If you have enough compute, you let it self-play, and it will learn the rules of the game, will play perfect chess.
After some while, problem solved.
So if you have more complicated problems, then you may believe that you have the right reward, but it's not.
So a nice, cute example is Elevator Control.
That is also in Rich Sutton's book, which is a great book, by the way.
So you control the elevator and you think, well, maybe the reward should be coupled to how long people wait in front of the elevator.
You know, long wait is bad.
You program it and you do it.
And what happens is the elevator eagerly picks up all the people but never drops them off.