-
Notifications
You must be signed in to change notification settings - Fork 834
feat(docs): KEP-2442-Support JAX Training Runtime #2643
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat(docs): KEP-2442-Support JAX Training Runtime #2643
Conversation
Pull Request Test Coverage Report for Build 18256788359Details
💛 - Coveralls |
c39fc67
to
271a5d1
Compare
b137639
to
cadaf06
Compare
@Electronic-Waste @andreyvelich please take a look when you have time. |
regarding data loading I mentioned in our sync meetings, after going over jax document, as you said, it will be handled within objective function by the user, so its sharding of data in memory and no need for further mechanism to handle permanent data (PVC, PV). however user can take care of it anytime with a patch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mahdikhashan Thanks for this great work. I've left my initial comments for you.
PTAL when you have time.
/cc @kubeflow/wg-training-leads @astefanutti @franciscojavierarceo
@mahdikhashan Hey, do you get a chance to address the comments above? |
I'll address them by tomorrow before our sync meeting, hope its fine with you. |
c92f96c
to
860a8a7
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the update @mahdikhashan! I left some comments for you:)
/cc @kubeflow/wg-training-leads @astefanutti @sandipanpanda @franciscojavierarceo
### Communication Backend | ||
|
||
#### OpenMPI | ||
|
||
**Pros:** | ||
|
||
* Compatible with existing MPI runtime in Kubeflow Trainer v2, making deployment easier. | ||
|
||
**Cons:** | ||
|
||
* Typically requires more complex environment setup compared to simpler backends like Gloo. | ||
|
||
#### Gloo | ||
|
||
**Pros:** | ||
|
||
* Lightweight and simple to use. | ||
|
||
**Cons:** | ||
|
||
* Significantly slower than OpenMPI (10–20×) for distributed JAX training on CPUs and GPUs. | ||
* Less optimized for multi-node scaling and lacks native support for high-speed interconnects like InfiniBand. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I remember that @sandipanpanda implemented the Jax Runtime with Gloo backend in the last year:)
Which one do you prefer? Any suggestion? @kubeflow/wg-training-leads @astefanutti @sandipanpanda
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
asking me, i'd say, open-mpi has upper hand, and lack of gpu support in gloo is a significant design decision maker, at least for me. i'd appreciate @sandipanpanda feedback on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe in the future we can use the gpu ci runner: #2674. But if we decide to use Gloo, we can test it on cpu first like what we did in the last year:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
other points could be raise from @andreyvelich , since deepspeed and mlx are based on openmpi at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
asking me, i'd say, open-mpi has upper hand, and lack of gpu support in gloo is a significant design decision maker, at least for me. i'd appreciate @sandipanpanda feedback on it.
The main motive behind using gloo was it was simple to set up and use, making development iterations faster, in long term open-mpi provides more value
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
asking me, i'd say, open-mpi has upper hand, and lack of gpu support in gloo is a significant design decision maker, at least for me. i'd appreciate @sandipanpanda feedback on it.
The main motive behind using gloo was it was simple to set up and use, making development iterations faster, in long term open-mpi provides more value
thanks for your reply
The number of JAX processes can be defined using the `numNodes` field within the `mlPolicy` section of the `ClusterTrainingRuntime` configuration. This allows for specifying how many JAX processes/controller run in the distributed setup. | ||
|
||
|
||
 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you also describe the whole procedure step-by-step with words? I think, it will be more clear to readers if the graph is companied with descriptive words:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also need test plans section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right. i'll add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
Signed-off-by: Mahdi Khashan <mahdikhashan1@gmail.com>
Signed-off-by: Mahdi Khashan <mahdikhashan1@gmail.com>
Signed-off-by: Mahdi Khashan <mahdikhashan1@gmail.com>
Signed-off-by: Mahdi Khashan <mahdikhashan1@gmail.com>
Signed-off-by: Mahdi Khashan <mahdikhashan1@gmail.com>
Signed-off-by: Mahdi Khashan <mahdikhashan1@gmail.com>
Signed-off-by: Mahdi Khashan <mahdikhashan1@gmail.com>
Signed-off-by: Mahdi Khashan <mahdikhashan1@gmail.com>
Signed-off-by: Mahdi Khashan <mahdikhashan1@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mahdikhashan Great work! Thanks for your instant response. Just one more suggestion. We're close to the end.
### Non-Goals | ||
|
||
- No TPU testing, tests will use CPU | ||
- No GPU testing, tests will use CPU |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now, we've had access to GPU test cluster. So it might not be accurate now. How about saying non-goals:
- Set up test in TPU cluster
- Add advanced JAX APIs
WDYT? @andreyvelich @mahdikhashan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mahdikhashan Thanks for your great work! Let's move forward with this KEP!
/approve
/lgtm
/assign @andreyvelich @tenzen-y @astefanutti
/hold for other maintainers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for this @mahdikhashan!
/lgtm
/approve
/hold cancel
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: andreyvelich, Electronic-Waste The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
What this PR does / why we need it:
Which issue(s) this PR fixes (optional, in
Fixes #<issue number>, #<issue number>, ...
format, will close the issue(s) when PR gets merged):part of #2442
Checklist: