Why record episode reward in this way? This makes the reward curve look nice but in fact it is not.
Why not just record the value of ep_r?
if global_ep_r.value == 0.:
global_ep_r.value = ep_r
else:
global_ep_r.value = global_ep_r.value * 0.99 + ep_r * 0.01