Marcus Hutter
๐ค SpeakerVoice Profile Active
This person's voice can be automatically recognized across podcast episodes using AI voice matching.
Appearances Over Time
Podcast Appearances
And we first, I should say, you know, what is, you know, how do we measure performance?
So we measure performance by giving the agent reward.
That's the so-called reinforcement learning framework.
So every time step, you can give it a positive reward, a negative reward, or maybe no reward.
It could be very scarce, right?
Like if you play chess, just at the end of the game, you give plus one for winning or minus one for losing.
So in the IXE framework, that's completely sufficient.
So occasionally you give a reward signal and you ask the agent to maximize reward, but not greedily sort of, you know, the next one, next one, because that's very bad in the long run if you're greedy.
But over the lifetime of the agent.
So let's assume the agent lives for m time steps, let's say dies in sort of 100 years sharp.
That's just the simplest model to explain.
So it looks at the future reward sum and asks, what is my action sequence, or actually more precisely my policy, which leads in expectation
because I don't know the world, to the maximum reward sum.
Let me give you an analogy.
In chess, for instance, we know how to play optimally in theory.
It's just a minimax strategy.
I play the move which seems best to me under the assumption that the opponent plays the move which is best
for him, so worst for me, under the assumption that I play, again, the best move, and then you have this expected max three to the end of the game, and then you backpropagate, and then you get the best possible move.
So that is the optimal strategy, which von Neumann already figured out a long time ago, for playing adversarial games.
Luckily, or maybe unluckily for the theory, it becomes harder.