You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
We are going to utilize the reinforce algorithm in which our agent will use episode samples from starting state to goal state directly from the environment. Our model has two linear layers with 4 in features and 2 out features for 4 state variables and 2 actions respectively. We also define an action buffer as `saved_log_probs` and a rewards one. We also have an intermediate ReLU layer through which the outputs of the 1st layer are passed to receive the score for each action taken. Finally, we return a list of probabilities for each of these actions.
@@ -156,13 +143,16 @@ class Policy(nn.Module):
156
143
def__init__(self):
157
144
super(Policy, self).__init__()
158
145
self.affine1 = nn.Linear(4, 128)
146
+
self.dropout = nn.Dropout(p=0.6)
159
147
self.affine2 = nn.Linear(128, 2)
160
148
161
149
self.saved_log_probs = []
162
150
self.rewards = []
163
151
164
152
defforward(self, x):
165
-
x = F.relu(self.affine1(x))
153
+
x =self.affine1(x)
154
+
x =self.dropout(x)
155
+
x = F.relu(x)
166
156
action_scores =self.affine2(x)
167
157
return F.softmax(action_scores, dim=1)
168
158
```
@@ -172,10 +162,10 @@ And then we initialize our model, optimizer, epsilon and timesteps.
@@ -206,40 +197,40 @@ Next we need to select an action to take. After we get a list of probabilities,
206
197
207
198
208
199
```python
209
-
defselect_action(model, observation):
200
+
defselect_action(policy, observation):
210
201
state = torch.from_numpy(observation).float().unsqueeze(0)
211
-
probs =model(state)
202
+
probs =policy(state)
212
203
m = Categorical(probs)
213
204
action = m.sample()
214
-
model.saved_log_probs.append(m.log_prob(action))
205
+
policy.saved_log_probs.append(m.log_prob(action))
215
206
return action.item()
216
207
```
217
208
218
209
We initialize a list to save policy loss and true returns of the rewards returned from the environment. Then we calculate the policy losses from the advantage (`-log_prob * reward`). Finally, we reset the gradients, perform backprop on the policy loss and reset the rewards and actions buffer.
0 commit comments