Alpyne Stock Game - Multiprocessing SubprocVecEnv not Working #15

negodfre · 2022-02-11T08:15:35Z

negodfre
Feb 11, 2022

Has anyone been able to run the stock game using multiprocessing?

The default for make_vec_env is to use DummyVecEnv which doesn't actually perform multiprocessing.

I altered the rltrain_policy.py file to try and get it the program to run.
I created multiple sims but only kept the one client.
I also tried one sim that was used to instantiate multiple environments.

However, each of these approaches resulted in a broken pipe error.
Specifically, the error occurs when SubprocVecEnv attempts to send 'get_spaces' and then receive it.

sub_proc_vec_env.py within stable_baselines3
CODE:

        print("About to send get spaces...")
        self.remotes[0].send(("get_spaces", None))
        print("Sent the get_spaces to pipe connection...")
        observation_space, action_space = self.remotes[0].recv()

rlpolicy_train.py
CODE:

from pathlib import Path

import numpy as np
from gym import spaces
import gym
import sys
import os

try:  # handle sb2 or sb3
    # v- stops pycharm from complaining
    # noinspection PyUnresolvedReferences
    from stable_baselines3 import PPO, A2C
    from stable_baselines3.common.env_util import make_vec_env
    from stable_baselines3.common.env_checker import check_env
    from stable_baselines3.common.vec_env import DummyVecEnv, SubprocVecEnv
    from stable_baselines3.common.utils import set_random_seed
except ModuleNotFoundError:
    try:
        # v- stops pycharm from complaining
        # noinspection PyUnresolvedReferences
        from stable_baselines import PPO2 as PPO
    except ModuleNotFoundError:
        raise ModuleNotFoundError("Need stable baselines to run this example")
    
from alpyne.client.alpyne_client import AlpyneClient
from alpyne.client.model_run import ModelRun
from alpyne.data.spaces import Observation, Action

from alpyne.client.abstract import BaseAlpyneEnv

#region MyStockClass
class MyStockGameEnv(BaseAlpyneEnv):
    """
    Custom Gym Environment for the Stock Management Game example model.

    Observation:
        Type: Box(2)
        Index   Name            Min     Max
        0       stock_amount    0.0     10000.0
        1       last_order_rate 0.0     50.0

    Actions:
        Type: Box(1)
        Index   Name            Min     Max
        0       order_rate      0       50.0

    Reward:
        1 if stock amount at 5000; falls off quartically, reaching 0 at +- 3000 and bottoming out at -1 by +- 3500
        (Reference: https://www.desmos.com/calculator/vlaaprjxvv)

    Episode termination:
        Default episode end is at/after 2 years.

        This may be different based on the configuration passed with the provided sim
        or if any additional terminal criteria is implemented.
    """

    def __init__(self, sim: ModelRun):
        super().__init__(sim)
        self.steps_near_bounds = 0  # number of steps the sim has spent near the stock bounds

    def _get_observation_space(self) -> spaces.Space:
        return spaces.Box(low=np.array([0.0, 0.0]), high=np.array([10000.0, 50.0]))

    def _convert_from_observation(self, observation: Observation):
        return np.array([observation.stock_amount, observation.last_order_rate])

    def _get_action_space(self) -> spaces.Space:
        return spaces.Box(low=0.0, high=50.0, shape=(1,), dtype=np.float16)

    def _convert_to_action(self, action: np.ndarray) -> Action:
        return Action(order_rate=float(action[0]))

    def _calc_reward(self, observation: Observation) -> float:
        return max(-1,-125e-16*(observation.stock_amount-5000)**4+1)

    def _terminal_alternative(self, observation: Observation) -> bool:
        """ Additional logic to stop the sim if too long is spent in the extremes ends """
        if 100 <= observation.stock_amount <= 9900:
            self.steps_near_bounds = 0
        else:  # +- 100 from the bounds
            self.steps_near_bounds += 1

        return self.steps_near_bounds >= 5  # arbitrarily chosen small(ish) number
#endregion

def get_configuration(cfg):
    cfg.acquisition_lag_days = 1
    return cfg

def make_env(sim, seed):
    def _f():
        env = MyStockGameEnv(sim)
        env.seed(seed)
        return env
    return _f

if __name__ == "__main__":
    cur_dir = Path(__file__).parent
    port = 51150
    n_workers = 4
    client = AlpyneClient(str(cur_dir/"Exported"/"model.jar"), port=port, blocking=True)

    ## create multiple sims...
    sims = [client.create_reinforcement_learning(get_configuration(client.configuration_template)) for _ in range(n_workers)]

    envs = []
    for i, j in enumerate(sims):
        envs.append(make_env(j, i))

    envs = SubprocVecEnv([lambda: elem for elem in envs])

    model = PPO('MlpPolicy', envs, verbose=1)

    model.learn(7000)

    model.save("StockGamePolicy_PPO.zip")

ERROR:

`2022-02-11 03:12:27,659 - sLogger - INFO: Starting model in \Alpyne-main\examples\Stock Management Game\Exported
2022-02-11 03:12:27,674 - sLogger - DEBUG: Executing:
java -cp \anylogic\lib\site-packages\alpyne\resources*;\anylogic\lib\site-packages\alpyne\resources\alpyne_lib*;*;lib*;StockManagementGame*;lib\database*;lib\sa*;lib\database\querydsl*;lib\sa\jackson*;lib\sa\spark* com.anylogic.alpyne.AlpyneServer -p 51150 -o \Alpyne-main\examples\Stock Management Game -l WARNING .

2022-02-11 03:12:27,752 - sLogger - INFO: Started app | PID = 26328
2022-02-11 03:12:37,762 - alpyne.client.http_client - DEBUG: GET /versions/number/0: None
2022-02-11 03:12:38,103 - alpyne.client.http_client - DEBUG: => 200 (OK) [('Content-Type', 'application/json'), ('Connection', 'close'), ('Content-Length', '1568')] {'version': 0, 'experimentTemplate': {'outputs': [{'name': 'stockValueDS', 'type': 'DATA_SET', 'value': None, 'units': None}, {'name': 'orderRateDS', 'type': 'DATA_SET', 'value': None, 'units': None}, {'name': 'demandRateDS', 'type': 'DATA_SET', 'value': None, 'units': None}, {'name': 'fullnessDistribution', 'type': 'HISTOGRAM_DATA', 'value': None, 'units': None}, {'name': 'amountSold', 'type': 'DOUBLE'}, {'name': 'amountWasted', 'type': 'DOUBLE'}], 'reinforcement_learning': {'action': [{'name': 'order_rate', 'type': 'DOUBLE'}], 'configuration': [{'name': 'acquisition_lag_days', 'type': 'INTEGER'}, {'name': '{START_TIME}', 'type': 'DOUBLE', 'value': 0.0, 'units': 'DAY'}, {'name': '{START_DATE}', 'type': 'DATE_TIME', 'value': '2020-01-01T05:00:00Z', 'units': None}, {'name': '{STOP_TIME}', 'type': 'DOUBLE', 'value': 731.0, 'units': 'DAY'}, {'name': '{STOP_DATE}', 'type': 'DATE_TIME', 'value': '2022-01-01T05:00:00Z', 'units': None}, {'name': '{RANDOM_SEED}', 'type': 'LONG'}], 'observation': [{'name': 'stock_amount', 'type': 'DOUBLE'}, {'name': 'recent_stock_amounts', 'type': 'DOUBLE_ARRAY'}, {'name': 'last_order_rate', 'type': 'DOUBLE'}, {'name': 'time_days', 'type': 'DOUBLE'}]}, 'inputs': [{'name': 'inRLMode', 'type': 'BOOLEAN'}, {'name': 'acquisitionLag', 'type': 'DOUBLE'}, {'name': 'maxHoldingAmount', 'type': 'DOUBLE'}, {'name': '{START_TIME}', 'type': 'DOUBLE', 'value': 0.0, 'units': 'DAY'}, {'name': '{START_DATE}', 'type': 'DATE_TIME', 'value': '2020-01-01T05:00:00Z', 'units': None}, {'name': '{STOP_TIME}', 'type': 'DOUBLE', 'value': 731.0, 'units': 'DAY'}, {'name': '{STOP_DATE}', 'type': 'DATE_TIME', 'value': '2022-01-01T05:00:00Z', 'units': None}, {'name': '{RANDOM_SEED}', 'type': 'LONG'}]}}
What is self.remotes type: <class 'multiprocessing.connection.PipeConnection'>
About to do cloud pickle wrapper...
About to start process...
About to do cloud pickle wrapper...
About to start process...
About to do cloud pickle wrapper...
About to start process...
About to do cloud pickle wrapper...
About to start process...
About to send get spaces...
Sent the get_spaces to pipe connection...
Traceback (most recent call last):
Traceback (most recent call last):
File "", line 1, in
File "\Python\Python38\lib\multiprocessing\connection.py", line 312, in _recv_bytes
File "\Python\Python38\lib\multiprocessing\spawn.py", line 116, in spawn_main
exitcode = _main(fd, parent_sentinel)
File "Programs\Python\Python38\lib\multiprocessing\spawn.py", line 126, in _main
self = reduction.pickle.load(from_parent)
File "anylogic\lib\site-packages\stable_baselines3\common\vec_env\base_vec_env.py", line 374, in setstate
self.var = cloudpickle.loads(var)
TypeError: 'dict' object is not callable
nread, err = ov.GetOverlappedResult(True)
BrokenPipeError: [WinError 109] The pipe has been ended

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "/Alpyne-main/examples/Stock Management Game/rlpolicy_train.py", line 171, in
envs = SubprocVecEnv([lambda: elem for elem in envs])
File "\anylogic\lib\site-packages\stable_baselines3\common\vec_env\subproc_vec_env.py", line 116, in init
observation_space, action_space = self.remotes[0].recv()
File "\Python38\lib\multiprocessing\connection.py", line 250, in recv
buf = self._recv_bytes()
File "\Python38\lib\multiprocessing\connection.py", line 321, in _recv_bytes
raise EOFError
EOFError
2022-02-11 03:12:41,745 - alpyne.client.http_client - DEBUG: DELETE /: None
2022-02-11 03:12:41,752 - alpyne.client.http_client - DEBUG: => 202 (Accepted) [('Connection', 'close'), ('Content-Length', '0')] None`

negodfre · 2022-02-11T10:44:24Z

negodfre
Feb 11, 2022
Author

The error appears to occur during the cloudpickle operation of loading.
The error results from the 'inputs' from the 'sim' instance.
If this is removed, then the error goes away.
Specifically, the client.configuration_template is not cloudpickleable.

ERROR OCCURS W/ "INPUTS'
Code:

env = MyStockGameEnv(sim)
cp_dump_env = cloudpickle.dumps(env)

Error:

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~\AppData\Local\Temp/ipykernel_18532/3468939914.py in <module>
----> 1 cloudpickle.loads(cp_dump_env)

TypeError: 'dict' object is not callable

ERROR NO LONGER OCCURS - NO 'INPUTS'

sim = client.create_reinforcement_learning(get_configuration(client.configuration_template))
env = MyStockGameEnv(sim)

CODE:

cp_dump_env = cloudpickle.dumps(env)

env.sim.__dict__.pop('inputs')

cp_dump_env = cloudpickle.dumps(env)

cloudpickle.loads(cp_dump_env)

5 replies

rokitatomer May 2, 2022

Hello,

Were you able to make multiprocess work?
Did you have any success training a working policy for the Stock game? I did not succeed yet, and it is pretty frustrating.

negodfre May 2, 2022
Author

I stopped trying to get multiprocessing to work.
I want to get back to see if I can figure it out.
I was able to get the stock game to train and my own simulation.

However, for my own simulation, I use doubles and I believe I had to add that data type to the file.

I still run into some odd errors, and sometimes trying to run it multiple times it will eventually run.

Don't have my computer currently, but when I do in a week I can put more detail on what changes I made.

t-wolfeadam May 2, 2022

Hi @rokitatomer and @negodfre, multiprocessing is a known to be an issue due to problems serializing the objects used to connect to the alpyne server. IIRC, an alternative currently is to run multiple instances of the communicator, each with their own single sim, in separate processes.

Also, for the stock management example, it may be too simple to train a policy that can come close to being as effective as what a simple heuristic can do. Sorry for not making this more clear but it's not intended to be a "good" toy example, just one that's easy to understand and put together, while also using a model from the default AL repo.

As a general note, I had to put this project on the backburner to work on other tasks for the past few months, but I will be getting back into a rewrite within the coming weeks - but thank you both your patience while this project is in beta!

rokitatomer May 3, 2022

Hi,
I am not particularly interested in multi process, but to make alpyne work even for single process and a toy problem. I think that the stock game is simple, and therefore it should be possible to train a policy in a simple manner (if not for that, then it would be impossible for more complicated models). Since I was not able to make it work, I am curious if anybody was able to make it work. It seems that @negodfre was successful, and it would be beneficial to share and update your example model.
Thanks!

negodfre May 3, 2022
Author

Yeah, sorry, to clarify, I got it to run. I'm not quite sure how successful the policy was. I won't have access to my PC till next week and I can take a look.

In my own example, I needed to normalize both the input and the reward to get successful training. I used tensorboard support in stable-baslelines3 to see how the model progressed. I looked at cumulative reward along with percent variation explained. Followed by that, I looked at specific cases to see if the learned behavior is as expected. But that was for my own example not the stock example.

There are some pre-written scripts for stable-baselines3 that handle normalization as a wrapper. Those didn't seem to work and I didn't spend a ton of time trying to figure out why. Instead I created my own normalization procedure in the custom environment.

rokitatomer · 2022-05-02T16:37:27Z

rokitatomer
May 2, 2022

Thank you for your reply. Can you please share your code for creating successful stock game policy and the policy? How did you check the policy and it's performance? Best, Tomer

…

On Mon, May 2, 2022, 19:13 negodfre ***@***.***> wrote: I stopped trying to get multiprocessing to work. I want to get back to see if I can figure it out. I was able to get the stock game to train and my own simulation. However, for my own simulation, I use doubles and I believe I had to add that data type to the file. I still run into some odd errors, and sometimes trying to run it multiple times it will eventually run. — Reply to this email directly, view it on GitHub <https://github.com/t-wolfeadam/Alpyne/discussions/15#discussioncomment-2674837>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AW4MNQ6F4RYKUELUOKGYM33VH75KZANCNFSM5ODOXN5A> . You are receiving this because you commented.Message ID: ***@***.***>

0 replies

rokitatomer · 2022-07-03T03:43:30Z

rokitatomer
Jul 3, 2022

Hi, I am still having difficulties getting a successful training for the stock game (single process). I have tried to normalize the action and the reward, but still I get random policy behavior. @negodfre I would appreciate it if you could share your code for creating a successful training and its test. Best, Tomer

…

On Tue, May 3, 2022 at 7:05 PM negodfre ***@***.***> wrote: Yeah, sorry, to clarify, I got it to run. I'm not quite sure how successful the policy was. I won't have access to my PC till next week and I can take a look. In my own example, I needed to normalize both the input and the reward to get successful training. I used tensorboard support in stable-baslelines3 to see how the model progressed. I looked at cumulative reward along with percent variation explained. Followed by that, I looked at specific cases to see if the learned behavior is as expected. But that was for my own example not the stock example. There are some pre-written scripts for stable-baselines3 that handle normalization as a wrapper. Those didn't seem to work and I didn't spend a ton of time trying to figure out why. Instead I created my own normalization procedure in the custom environment. — Reply to this email directly, view it on GitHub <https://github.com/t-wolfeadam/Alpyne/discussions/15#discussioncomment-2681388>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AW4MNQ6TCQBAXXAYHLPJUWTVIFFENANCNFSM5ODOXN5A> . You are receiving this because you were mentioned.Message ID: ***@***.***>

2 replies

negodfre Jul 3, 2022
Author

Sorry about the delay!
Life has been quite busy!

I was able to successfully train a policy by pseudo-'normalizing' the input to be values between 0 & 1.
(Order amounts should be fine... but I think stock amounts could technically get above 10,000 or below 0. I only
normalized Stock as though that doesn't happen by just simply dividing by 10,000)

You can see the changes by looking at the "_old" and regular functions.
For example "_get_action_space_old" vs "_get_action_space"
I didn't change the reward whatsoever.

from pathlib import Path
import tensorboard

import numpy as np
from gym import spaces
try:  # handle sb2 or sb3
    # v- stops pycharm from complaining
    # noinspection PyUnresolvedReferences
    from stable_baselines3 import PPO, A2C
except ModuleNotFoundError:
    try:
        # v- stops pycharm from complaining
        # noinspection PyUnresolvedReferences
        from stable_baselines import PPO2 as PPO
    except ModuleNotFoundError:
        raise ModuleNotFoundError("Need stable baselines to run this example")
    
from alpyne.client.alpyne_client import AlpyneClient
from alpyne.client.model_run import ModelRun
from alpyne.data.spaces import Observation, Action
from stable_baselines3.common.callbacks import CheckpointCallback

from alpyne.client.abstract import BaseAlpyneEnv

class MyStockGameEnv(BaseAlpyneEnv):
    """
    Custom Gym Environment for the Stock Management Game example model.

    Observation:
        Type: Box(2)
        Index   Name            Min     Max
        0       stock_amount    0.0     10000.0
        1       last_order_rate 0.0     50.0

    Actions:
        Type: Box(1)
        Index   Name            Min     Max
        0       order_rate      0       50.0

    Reward:
        1 if stock amount at 5000; falls off quartically, reaching 0 at +- 3000 and bottoming out at -1 by +- 3500
        (Reference: https://www.desmos.com/calculator/vlaaprjxvv)

    Episode termination:
        Default episode end is at/after 2 years.

        This may be different based on the configuration passed with the provided sim
        or if any additional terminal criteria is implemented.
    """

    def __init__(self, sim: ModelRun):
        super().__init__(sim)
        self.steps_near_bounds = 0  # number of steps the sim has spent near the stock bounds

    def _get_observation_space(self) -> spaces.Space:
        return spaces.Box(low=np.array([0.0, 0.0]), high=np.array([1, 1]))

    def _get_observation_space_old(self) -> spaces.Space:
        return spaces.Box(low=np.array([0.0, 0.0]), high=np.array([10000.0, 50.0]))

    def _convert_from_observation(self, observation: Observation):
        return np.array([observation.stock_amount/10000, observation.last_order_rate/50])

    def _convert_from_observation_old(self, observation: Observation):
        return np.array([observation.stock_amount, observation.last_order_rate])

    def _get_action_space(self) -> spaces.Space:
        return spaces.Box(low=0.0, high=1.0, shape=(1,), dtype=np.float16)

    def _get_action_space_old(self) -> spaces.Space:
        return spaces.Box(low=0.0, high=50.0, shape=(1,), dtype=np.float16)

    def _convert_to_action(self, action: np.ndarray) -> Action:
        return Action(order_rate=float(action[0]*50))

    def _calc_reward(self, observation: Observation) -> float:
        return max(-1,-125e-16*(observation.stock_amount-5000)**4+1)

    def _terminal_alternative(self, observation: Observation) -> bool:
        """ Additional logic to stop the sim if too long is spent in the extremes ends """
        if 100 <= observation.stock_amount <= 9900:
            self.steps_near_bounds = 0
        else:  # +- 100 from the bounds
            self.steps_near_bounds += 1

        return self.steps_near_bounds >= 5  # arbitrarily chosen small(ish) number


if __name__ == '__main__':

    file_name = "test_2"
    client = AlpyneClient(r"Exported\model.jar")
    tensorboard_path = f"./stock_tensorboard/{file_name}"

    # create new model run from basic configuration
    cfg = client.configuration_template
    cfg.acquisition_lag_days = 1
    sim = client.create_reinforcement_learning(cfg)

    # wrap in our custom gym environment
    env = MyStockGameEnv(sim)

    # # (optional) to distribute across parallel runs,
    # #               vectorize the environment with stable-baselines' method
    # env = make_vec_env(env, n_envs=4)

    checkpoint_callback = CheckpointCallback(save_freq=25000, save_path=str(Path(tensorboard_path)/"models"),
                                            name_prefix='Stock_PPO_simple')

    # pass it to stable-baselines for RL training
    model = PPO('MlpPolicy', env, verbose=1, tensorboard_log=tensorboard_path)

    model.learn(500000, tb_log_name=file_name, callback=checkpoint_callback)
    model.save("StockGamePolicy_PPO.zip")

negodfre Jul 3, 2022
Author

The orange line is the original model script as-is.
The blue line is the adjusted input values.

If you use pypeline to make predictions with the saved model, remember to scale them to 0 & 1 just like you would do during training.

negodfre · 2022-07-03T07:03:42Z

negodfre
Jul 3, 2022
Author

This was the playthrough with a model trained on 50,000 iterations.
It does a fairly good job staying near 5000, which gives the maximum return (test out different values in reward function if you wish to see this).

Note: I used pypeline to get the prediction from the trained PPO model.

0 replies

Alpyne Stock Game - Multiprocessing SubprocVecEnv not Working #15

Uh oh!

Uh oh!

negodfre Feb 11, 2022

Replies: 4 comments · 7 replies

Uh oh!

Uh oh!

negodfre Feb 11, 2022 Author

Uh oh!

rokitatomer May 2, 2022

Uh oh!

Uh oh!

negodfre May 2, 2022 Author

Uh oh!

t-wolfeadam May 2, 2022

Uh oh!

rokitatomer May 3, 2022

Uh oh!

negodfre May 3, 2022 Author

Uh oh!

rokitatomer May 2, 2022

Uh oh!

rokitatomer Jul 3, 2022

Uh oh!

Uh oh!

negodfre Jul 3, 2022 Author

Uh oh!

Uh oh!

negodfre Jul 3, 2022 Author

Uh oh!

Uh oh!

negodfre Jul 3, 2022 Author

negodfre
Feb 11, 2022

Replies: 4 comments 7 replies

negodfre
Feb 11, 2022
Author

negodfre May 2, 2022
Author

negodfre May 3, 2022
Author

rokitatomer
May 2, 2022

rokitatomer
Jul 3, 2022

negodfre Jul 3, 2022
Author

negodfre Jul 3, 2022
Author

negodfre
Jul 3, 2022
Author