Hi, I'm interested in your research and also planning to do that kind of research. In this implementation, you collect initial observations with random policy. Can I use autopilot during initial collect step? It might be good for stable learning especially in early learning steps. Thank you for great paper and implementation.