In the context of this exercise, why does an overly large learning rate cause the graph to diverge?
In the graph shown in the exercise, it shows what can happen when the learning rate is too large, causing a zig-zag-like pattern of divergence due to constantly overshooting the minimum.
When the function overshoots the minimum, it will attempt to make changes to the parameters to compensate, and will move in the opposite direction as the previous step, towards the minimum. A large learning rate can cause these changes to be drastic and the function will recompute values that are larger and larger than the previous steps. This will continue and result in the zig-zag-like pattern of divergence up the graph.