Mehran Sahami Handout #10
CS121 August 7, 2001
Midterm Examination Solutions
Problem 1: Uninformed Search (10 points)
Show which nodes are expanded, and the order in which they are expanded using:

Breadthfirst Search
A, B, C, D, E, F, G, H, I, J

Depthfirst Search
A, B, E, L, M, F, C, G, H, N, O

Iterative Deepening Search (starting with depth 0, and incrementing the depth
by 1 in each iteration)
A, A, B, C, D, A, B, E, F, C, G, H, I, D, J
Problem 2: Heuristic Search (18 points)
Suppose that you are given a heuristic function hhat. Although hhat is not admissible, it is nearly admissible, in the sense that there is some small constant such that:
0 hhat(n) h(n) +
for every node n. Prove that, with such a heuristic, the A* algorithm will find a path to the goal (assuming one exists) for which the total cost g is within of optimal.
Proof:
Let y be a goal node with optimal cost. We denote this optimal cost g(y) = g*.
We perform a proof by contradiction. Negating the desired result, we assume that A* has found a path to a goal node v which has total cost g(v) > g* + .
In order for node v to be expanded, it means than no other node in the OPEN list for A* could have a smaller fhat value.
fhat(v) = g(v) + hhat(v)
> g* + (since g(v) > g* + , and hhat(v) 0)
First, we note that the node v cannot be the same as the node y, since fhat(y) < fhat(v).
fhat(y) = g(y) + hhat(y)
= g* + hhat(y)
g* + (since h(y) = 0)
Now, let node u be some node on the path to the optimal goal y that has not yet been expanded, but is on the OPEN list. Note that such a node u must exist since the goal y is on some path from the root, and y has not yet been expanded (otherwise we would have found a goal node prior to v).
Realize that, fhat(u) = g(u) + hhat(u)
g(u) + h(u) + .
Now, we note that g(u) + h(u) is simply the cost of reaching the optimal goal y, going through the node u. But since u lies on the path to the optimal goal y, it means that the cost of reaching the optimal goal y going through node u, is simply g* the cost of reaching the optimal goal node.
Thus, fhat(u) g* + .
This fhat value is lower than v, which means that node u must get expanded before expanding node v. However, if this were the case, and node u can represent any node on the path to y, it means that all nodes on the path to y (including y) must be expanded before node v. Hence we have a contradiction, so the goal node v that is found cannot have cost > g* + . So the goal node found must be within of optimal.
Problem 3: AlphaBeta (12 points)
Consider the following game tree. The value for each leaf node in the tree is shown immediately below each node. The first player is the maximizing player.

Perform alphabeta search, and show which nodes in the search tree are pruned. Also give the “backedup” values at each node of the game tree that is not pruned.
Pruned nodes: Q
I (and, thus, T and U)
K (and, thus, X and Y)

What move should the first player make?
The first player should make the move that leads to node B.
Problem 4: DelayedReinforcement Learning (12 points)

Briefly explain what the “exploration vs. exploitation tradeoff” is in delayedreinforcement learning. Explain why it arises and how occasionally taking random actions helps to address this problem.
The exploitation vs. exploration tradeoff refers to the fact that an agent may learn a policy that is nonoptimal, but still achieves a positive reward. As a result, the agent may choose to follow this policy and obtain the reward. This exploitation of the existing policy, however, may cause the agent to foregoing the opportunity to learn a more optimal policy that achieves a greater discounted reward. Such a policy may exist, but if the agent has not explored the space of policies sufficiently, then this better policy may not be found.
Occasionally taking random actions helps address this problem, for two reasons. First, since the agent is exploiting its current policy most of the time, it is still obtaining positive rewards in general. However, by taking occasional random actions, the agent will explored parts of its world that it would otherwise have not ventured into if it was rigidly exploiting its current policy. Hence, the agent can balance exploiting getting rewards with its current policy, while also occasionally exploring other parts of its worls to try to learn better policies.

You are given a robot simulator (like the one you wrote in the first programming assignment). In this simulation, the robot is using delayedreinforcement learning with 0 < < 1, and takes a random action (as opposed to the action that maximizes expected reward) with probability P = c, where c is a very small (much less than 0.1) positive constant. If the robot simulation runs for many (possibly infinite) number of time steps, is it guaranteed to eventually find an optimal policy? Informally, explain why or why not.
Yes, the simulation is guaranteed to eventually find an optimal policy. The reason for this is that there are two criteria for guaranteeing finding an optimal policy. The first is that the reinforcement learning be done using 0 < < 1. This is given. The second is that every state in the world be visited infinitely often. Since the simulation contains an element of randomness and may proceed for possibly infinite time steps, every state in the world will be visited a potentially infinite number of times. This theoretically guarantees that an optimal policy will be converged on by the agent. In practice, an optimal policy is generally found after a large number of simulation steps when a small element of randomness is used.
Problem 5: Propositional Logic (15 points)
Use resolution to perform the following proof:
Given: (P Q R) S
Q R
S P
R
Prove: S

{P, Q, R, S} Given

{Q, R} Given

{S, P} Given

{R} Given

{S} Negation of goal

{P, R, S} Resolve 1 and 2 on Q

{R, S} Resolve 3 and 6 on P

{S} Resolve 4 and 7 on R

{} Resolve 5 and 8 on S
Problem 6: Predicate Logic (23 points)

Convert the following English sentences into firstorder predicate logic.

Angelica is a mouse.

Bob is a friend of all dogs.

Cecil is a cat.

David is a dog.

Angelica is friends with all of Bob’s friends who are not cats.

Dogs are not cats.
Mouse(Angelica)
(forall x) (Dog(x) Friend(Bob, x))
Cat(Cecil)
Dog(David)
(forall x) ([Friend(Bob, x) Cat(x)] Friend(Angelica, x))
(forall x) (Dog(x) Cat(x))

Convert the logical sentences from part (a) into clausal form.

{Mouse(Angelica)}

{Dog(x1), Friend(Bob, x1)}

{Cat(Cecil)}

{Dog(David)}

{Friend(Bob, x2), Cat(x2), Friend(Angelica, x2)}

{Dog(x3), Cat(x3)}

Use resolution to prove: Angelica is friends with someone.

{Friend(Angelica, x4)} Negating the goal

{Friend(Bob, x5), Cat(x5)} Resolving 5 and 7 on Friend(Angelica, x4)
with s = {x2/x4}

{Dog(x6), Cat(x6)} Resolving 2 and 8 on Friend(Bob, x5)
with s = {x1/x5}

{Dog(x7)} Resolving 6 and 9 on Cat(x6)
with s = {x3/x6}

{} Resolving 4 and 10 on Dog(David)
with s = {x7/David} 