[RL] Improve reward function

Instead of  `(p_0 - vwap_t)`  compare against `p_0 - (max([p_0; p_t]) + min([p_0; p_t])) / 2` (normalized between -1 and 1). Therefore we have a stable reward for any kind of fluctuation.

![evernote snapshot 20180310 225733](https://user-images.githubusercontent.com/955179/37247110-7ec7117a-24b6-11e8-972e-334979d0a9f6.jpg)