
So, we've learned quite a bit so far.

We've learned about Markov Decision Processes.

We have fully observable with a set of states

and corresponding actions where they have stochastic action effects

characterized by a conditional probability entity of P of S prime

given that we apply action A in state S.

We seek to maximize a reward function

that we define over states.

You can equally define over states in action pairs.

The objective was to maximize the expected

future accumulative and discounted rewards,

as shown by this formula over here.

The key to solving them was called value iteration

where we assigned a value to each state.

There's alternative techniques that have assigned values

to state action pairs, often called Q(s, a),

but we didn't really consider this so far.

We defined a recursive update rule

to update V(s) that was very logical

after we understood that we have an action choice,

but nature chooses for us the outcome of the action

in a stochastic transition probability over here.

And then we observe the value iteration converged

and we're able to define a policy if we're assuming

the argmax under the value iteration expression,

which I don't spell out over here.

This is a beautiful framework.

It's really different from planning than before

because of the stochasticity of the action effects.

Rather than making a single sequence of states and actions,

as would be the case in deterministic planning,

now we make an entire field a socalled policy

that assigns an action to every possible state.

And we compute this using a technique called value iteration

that spreads value in reverse order through the field of states.