Is this algorithm suitable for off-policy policy?

I just finished reading your paper, and I  notice that it is an on policy method.  
And I wondering if anyone has tested it with an rl method that has a replay_buff pool.  
As far as I know, for off-policy method with RNN structure(like lstm, gru or attention or transformer...), if hidden state is stored with a sample (s,a,r,s'), the hidden state would become a stale data after a long training--- Is this issue conqured with adaptive-transformer?   

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Is this algorithm suitable for off-policy policy? #15

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Is this algorithm suitable for off-policy policy? #15

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions