In [97]:

cities   = utils.cities
nbcities = len(cities)

(a) Building Probability and Reward Arrays¶

As a first step, we would like to build arrays of probability and reward values which will support our inferences. We will build three arrays for that purpose:

A numpy array P of size $N \times N \times N$ where $N$ is the number of cities, and where the first dimensions represents the current state, the second dimension represents the action being taken, and the third dimension represents the state in the next time step. The array should be built in a way that one arrives in the intended city (specified by our action) with a probability of 0.8 and stay stuck in the current city with probability 0.2. Because the process is time-independent, we need only one such matrix.
A numpy array Rmid of size $N \times N \times N$ representing the rewards at each time step. It includes an artificial negative reward of $-99$ for actions that are not allowed from the current state. It further adds a negative reward of $-1$ for transitions between two distinct cities in order to model the cost of fuel. Like for the matrix P, the first dimension represents the state, the second dimension represents the action, and the last dimension represents the state we transition to.
A numpy array Rlast of size $1 \times 1 \times N$ that adds to the reward in the last time step a reward if the goal has been achieved. Specifically, it adds a reward of $20$ if the agent arrives in Urziceni in the final step (our goal), $10$ if it terminates instead in Craiova (a fallback objective), and a reward of $0$ otherwise.

In other words, the rewards $R_0,\dots,R_4$ are given by Rmid and the reward $R_5$ is given by the sum of Rmid and Rlast.

In [102]:

I = numpy.eye(len(utils.cities))
Illegal = (1-I-utils.A)
print(Illegal)

def buildArrays():
    N = len(utils.cities)
    
    I = np.eye(N)
    
    Illegal = 1 - I - utils.A
    
    Rmid = (-99 * Illegal)[:, :, na] - 1 * utils.A[:, na, :]
    
    Rlast = (
        20 * I[utils.cities.index('U')] +
        10 * I[utils.cities.index('C')]
    )[na, na, :]
    
    P = np.zeros((N, N, N))

    for s in range(N):
        for a in range(N):
            P[s, a, s] = 0.2
            
            P[s, a, a] += 0.8
    
    return P, Rmid, Rlast

[[0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 1. 1. 0.]
 [1. 0. 1. 1. 1. 0. 0. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 0. 1. 1.]
 [1. 1. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 1. 1. 1. 1. 1.]
 [1. 1. 0. 0. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 0. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 0. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1.]
 [1. 0. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 0. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 0. 1. 1. 1. 1. 1. 1. 0. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 1. 1. 1. 1. 1. 0. 1. 1. 1.]
 [1. 1. 1. 0. 1. 1. 1. 1. 1. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 0. 1. 1. 1. 0.]
 [1. 0. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 1. 1. 1. 1. 1.]
 [1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 1. 1. 1. 1.]
 [0. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 0. 1. 0. 0. 1. 1. 1. 1.]
 [0. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1.]
 [1. 0. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 1.]
 [1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 1. 1. 0. 0. 1.]
 [0. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 1. 0. 1. 1. 1. 1. 1. 1. 0.]]

WE can test the implementation using the following code.

In [105]:

P,Rmid,Rlast = buildArrays()

print(P[:1,:2,:])
print(Rmid[0,-2:,:])
print(Rlast[:,:,:])

[[[1.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0. ]
  [0.2 0.8 0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   0.  0.  0. ]]]
[[ -99.  -99.  -99.  -99.  -99.  -99.  -99.  -99.  -99.  -99.  -99.  -99.
   -99.  -99.  -99. -100. -100.  -99.  -99. -100.]
 [  -0.   -0.   -0.   -0.   -0.   -0.   -0.   -0.   -0.   -0.   -0.   -0.
    -0.   -0.   -0.   -1.   -1.   -0.   -0.   -1.]]
[[[ 0.  0. 10.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.  0.
   20.  0.  0.]]]

(b) Value Iteration and Expected Utilities under Rational Acting¶

Now, we can carry out the value iteration algorithm. This will enables us to measure what is the expected utility if being in Urziceni on the first day and acting rationally. The recursive equations for generating the expected utilities when following an optimal policy are given by: \begin{align*} Q(A_t,S_t) &= \sum_{S_{t+1} \in D} \big[R(S_t,A_t,S_{t+1}) + U(S_{t+1})\big] P(S_{t+1}|A_t,S_t)\\ U(S_t) &= \max_{A_t \in D} Q(A_t,S_t) \end{align*} One first sets $U(S_6)$ to an empty array, and then compute $U(S_5),U(S_4),\dots$ by repeated application of this equation. This can be implemented compactly as:

In [108]:

U = numpy.zeros([len(utils.cities)])

for t in [5,4,3,2,1,0]:
    R = Rmid + (Rlast if t==5 else 0)
    Q = (P*(U+R)).sum(axis=-1)
    U = Q.max(axis=-1)
    
utils.printvalues(U)

           Arad: 14.335
      Bucharest: 18.999
        Craiova: 16.739
        Dobreta: 14.699
         Eforie: 17.970
        Fagaras: 17.970
        Giurgiu: 17.970
        Hirsova: 18.999
           Iasi: 17.970
          Lugoj: 7.593
        Mehadia: 11.279
          Neamt: 16.686
         Oradea: 14.335
        Pitesti: 17.972
 Rimnicu Vilcea: 16.737
          Sibiu: 16.694
      Timisoara: 9.692
       Urziceni: 20.000
         Vaslui: 18.999
         Zerind: 9.692

and the resulting utilities $U(S_0)$ can then be printed. This correspond to the expected utility one would get if starting in each city on the map and acting rationally. As can be observed, our utility associated to starting in Arad is around 14. If we would have started in a different city, e.g. closer to Urziceni, the expected utility would increase, with the maximum utility obtained if being already in Urziceni on day 1.

(c) Extracting the Policy¶

We now would like to determine what policy is required at each time step in order to achieve the desired high expected utility. Implement an extension of the value iteration procedure above that keeps track this time of the best actions at each time step. As a reminder, the best actions can be extracted using the equation:

\begin{align*} \pi_t(S_t) &= \arg\max_{A_t \in D} Q(A_t,S_t) \end{align*}

this function return a matrix $A$ of size $6 \times \#\,\text{cities}$ indicating, for any time step and state value (city) at that time step, the best possible action to be taken.

In [111]:

def getPolicy():
    P, Rmid, Rlast = buildArrays()
    U = np.zeros(len(utils.cities))
    A = np.zeros((6, len(utils.cities)), dtype=int)
    
    for t in [5, 4, 3, 2, 1, 0]:
        R = Rmid + (Rlast if t == 5 else 0)
    
        Q = (P * (U + R)).sum(axis=-1)
        A[t] = Q.argmax(axis=-1)
        U = Q.max(axis=-1)
    
    return A

The policy is visualized by running the code below.

In [114]:

utils.printpolicy(getPolicy())

day 1  | day 2  | day 3  | day 4  | day 5  | day 6 
A -> S | A -> S | A -> S | A -> S | A -> A | A -> A
B -> U | B -> U | B -> U | B -> U | B -> U | B -> U
C -> P | C -> P | C -> P | C -> P | C -> C | C -> C
D -> C | D -> C | D -> C | D -> C | D -> C | D -> C
E -> H | E -> H | E -> H | E -> H | E -> H | E -> E
F -> B | F -> B | F -> B | F -> B | F -> B | F -> F
G -> B | G -> B | G -> B | G -> B | G -> B | G -> G
H -> U | H -> U | H -> U | H -> U | H -> U | H -> U
I -> V | I -> V | I -> V | I -> V | I -> V | I -> I
L -> M | L -> M | L -> M | L -> M | L -> L | L -> L
M -> D | M -> D | M -> D | M -> D | M -> D | M -> M
N -> I | N -> I | N -> I | N -> I | N -> N | N -> N
O -> S | O -> S | O -> S | O -> S | O -> O | O -> O
P -> B | P -> B | P -> B | P -> B | P -> B | P -> C
R -> P | R -> P | R -> P | R -> P | R -> C | R -> C
S -> F | S -> F | S -> F | S -> F | S -> R | S -> S
T -> A | T -> A | T -> A | T -> T | T -> T | T -> T
U -> U | U -> U | U -> U | U -> U | U -> U | U -> U
V -> U | V -> U | V -> U | V -> U | V -> U | V -> U
Z -> A | Z -> A | Z -> A | Z -> Z | Z -> Z | Z -> Z

We observe that the best action if starting in Arad is to first move to Sibiu. If the action get carried out correctly, one then moves from Sibiu to Fagaras, etc. Interestingly, as time passes, reaching Urziceni may no longer be possible from certain locations. Hence the rational action may in these cases be to opt for the fallback objective of going to Craiova, or less ambitious, simply stay in the current city to avoid any further spendings on fuel.

Stochastic Path Planning with Markov Decision Processes

Inferences in a Markov Decision Process¶

(a) Building Probability and Reward Arrays¶

(b) Value Iteration and Expected Utilities under Rational Acting¶

(c) Extracting the Policy¶