DTensor implementation #3117
YuliangLiu0306
started this conversation in
Development | Core
Replies: 1 comment
-
What's the status of this work, and can I participate in it?😊 |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Proposal
We have invastigated the current implementation of DTensor from PyTorch and TensorFlow. Inspired from them, we propose a new design for DTensor.
Motivation
Supplys a uniform way for checkpointing, in automatic parallelism or other flexible distributed training paradigms, we need to save and load checkpoints in a flexible and fine-grained way.
DTensor serves as a tensor abstraction carrying distributed information. It is a key component to support both SPMD automatic parallelism and Gemini.
Refactor related components, like
CommSpec
,ShardingSpec
,LayoutConverter
,DeviceMesh
, etc. Those components were tightly coupled with Automatic parallelism feature, which makes it hard to reuse them in other components.Design
We design several components for API refactoring.

Possible class definition (pseudo-code)
DTensor
Layout
ShardingSpec
CommSpec
LayoutConverter
Future work
After refatoring/implement above features, we could use them to implement a new abstraction called

DProxy
to serve as a proxy of the real tensor in automatic parallelism context. It will carry necessary information to estimate distributed operation memory/computing overhead.Self-service
Beta Was this translation helpful? Give feedback.
All reactions