Best practices question for an op passing data forward -> backward #19355

DickJC123 · 2020-10-15T22:12:43Z

DickJC123
Oct 15, 2020
Collaborator

What's the recommended approach for an operator that has a computed value from Forward() that is helpful in the Backward() calculation, given that the computed value is not an I/O of the operator?

I'm familiar with the long standing approach of declaring a 'hidden output', but is that still the recommended approach?

@samskalicky @MoisesHer @szha @ptrendx @leezu

Answered by szha

Oct 15, 2020

FStatefulCompute is designed for this purpose:
https://github.com/apache/incubator-mxnet/blob/527573ec2b9b2696ffcafd1570cd94e2187f4c32/src/operator/rnn.cu#L35
https://github.com/apache/incubator-mxnet/blob/4a8da9ec62e8cadd7df6ad5e9ba305b777104068/src/operator/rnn-inl.h#L1523

View full answer

szha · 2020-10-15T22:19:48Z

szha
Oct 15, 2020
Collaborator

FStatefulCompute is designed for this purpose:
https://github.com/apache/incubator-mxnet/blob/527573ec2b9b2696ffcafd1570cd94e2187f4c32/src/operator/rnn.cu#L35
https://github.com/apache/incubator-mxnet/blob/4a8da9ec62e8cadd7df6ad5e9ba305b777104068/src/operator/rnn-inl.h#L1523

1 reply

DickJC123 Oct 15, 2020
Collaborator Author

I can think of two choices when using the FStatefulCompute approach:

Dynamically Alloc() the fwd->bwd Tensor on the Forward(), then Free() is on the Backward(), or
Alloc() the fwd->bwd Tensor once and hold onto it forever.

Approach 1 is not CUDA Graphs compatible, since the captured fwd->bwd Tensor is no longer valid when the graph is replayed.
Approach 2 holds onto the memory past Backward(), so past when it's actually needed.

The 'hidden output' approach does not have these problems. It is CUDA Graphs compatible, assuming a static graph memory layout, plus the fwd->bwd Tensor is up for reallocation past the Backward() call.

DickJC123 · 2020-10-19T18:42:12Z

DickJC123
Oct 19, 2020
Collaborator Author

Another point to consider is how the operator behaves in an inference-only graph. Seems like with the stateful op approach, one can easily react to is_train == false, and not allocate the fwd->bwd Tensor if it's not needed otherwise. I'm not sure if the 'hidden output' approach can avoid allocating the space for that.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Best practices question for an op passing data forward -> backward #19355

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Select a reply

Uh oh!

Best practices question for an op passing data forward -> backward #19355

Uh oh!

Uh oh!

DickJC123 Oct 15, 2020 Collaborator

Replies: 2 comments · 1 reply

Uh oh!

szha Oct 15, 2020 Collaborator

Uh oh!

DickJC123 Oct 15, 2020 Collaborator Author

Uh oh!

Uh oh!

DickJC123 Oct 19, 2020 Collaborator Author

DickJC123
Oct 15, 2020
Collaborator

Replies: 2 comments 1 reply

szha
Oct 15, 2020
Collaborator

DickJC123 Oct 15, 2020
Collaborator Author

DickJC123
Oct 19, 2020
Collaborator Author