Skip to content

Conversation

mahdikhashan
Copy link
Member

@mahdikhashan mahdikhashan commented May 10, 2025

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in Fixes #<issue number>, #<issue number>, ... format, will close the issue(s) when PR gets merged):
part of #2442

Checklist:

  • Docs included if any changes are user facing

@coveralls
Copy link

coveralls commented May 10, 2025

Pull Request Test Coverage Report for Build 18256788359

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage remained the same at 55.174%

Totals Coverage Status
Change from base Build 18223898761: 0.0%
Covered Lines: 1093
Relevant Lines: 1981

💛 - Coveralls

@google-oss-prow google-oss-prow bot added size/L and removed size/XS labels May 10, 2025
@mahdikhashan mahdikhashan changed the title [proposal] GSoC Project 8: Jax Runtime for V2 [proposal] GSoC Project 8: JAX Runtime for V2 May 10, 2025
@mahdikhashan mahdikhashan changed the title [proposal] GSoC Project 8: JAX Runtime for V2 (draft)[proposal] GSoC Project 8: JAX Runtime for V2 May 28, 2025
@mahdikhashan mahdikhashan force-pushed the gsoc-2442-jax-runtime-proposal branch from c39fc67 to 271a5d1 Compare May 28, 2025 16:53
@mahdikhashan mahdikhashan marked this pull request as draft May 28, 2025 17:01
@google-oss-prow google-oss-prow bot added size/L and removed size/M labels Jun 20, 2025
@mahdikhashan mahdikhashan changed the title (draft)[proposal] GSoC Project 8: JAX Runtime for V2 [proposal] GSoC Project 8: JAX Runtime for V2 Jun 20, 2025
@mahdikhashan mahdikhashan marked this pull request as ready for review June 20, 2025 17:51
@mahdikhashan mahdikhashan force-pushed the gsoc-2442-jax-runtime-proposal branch from b137639 to cadaf06 Compare June 20, 2025 17:53
@mahdikhashan
Copy link
Member Author

@Electronic-Waste @andreyvelich please take a look when you have time.

@mahdikhashan
Copy link
Member Author

mahdikhashan commented Jun 20, 2025

regarding data loading I mentioned in our sync meetings, after going over jax document, as you said, it will be handled within objective function by the user, so its sharding of data in memory and no need for further mechanism to handle permanent data (PVC, PV). however user can take care of it anytime with a patch.

cc @Electronic-Waste

Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mahdikhashan Thanks for this great work. I've left my initial comments for you.

PTAL when you have time.

/cc @kubeflow/wg-training-leads @astefanutti @franciscojavierarceo

@google-oss-prow google-oss-prow bot requested review from a team and franciscojavierarceo June 23, 2025 10:14
@Electronic-Waste
Copy link
Member

@mahdikhashan Hey, do you get a chance to address the comments above?

@mahdikhashan
Copy link
Member Author

@mahdikhashan Hey, do you get a chance to address the comments above?

I'll address them by tomorrow before our sync meeting, hope its fine with you.

@mahdikhashan mahdikhashan force-pushed the gsoc-2442-jax-runtime-proposal branch from c92f96c to 860a8a7 Compare June 27, 2025 07:49
Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the update @mahdikhashan! I left some comments for you:)

/cc @kubeflow/wg-training-leads @astefanutti @sandipanpanda @franciscojavierarceo

Comment on lines 170 to 191
### Communication Backend

#### OpenMPI

**Pros:**

* Compatible with existing MPI runtime in Kubeflow Trainer v2, making deployment easier.

**Cons:**

* Typically requires more complex environment setup compared to simpler backends like Gloo.

#### Gloo

**Pros:**

* Lightweight and simple to use.

**Cons:**

* Significantly slower than OpenMPI (10–20×) for distributed JAX training on CPUs and GPUs.
* Less optimized for multi-node scaling and lacks native support for high-speed interconnects like InfiniBand.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember that @sandipanpanda implemented the Jax Runtime with Gloo backend in the last year:)

Which one do you prefer? Any suggestion? @kubeflow/wg-training-leads @astefanutti @sandipanpanda

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

asking me, i'd say, open-mpi has upper hand, and lack of gpu support in gloo is a significant design decision maker, at least for me. i'd appreciate @sandipanpanda feedback on it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe in the future we can use the gpu ci runner: #2674. But if we decide to use Gloo, we can test it on cpu first like what we did in the last year:)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other points could be raise from @andreyvelich , since deepspeed and mlx are based on openmpi at the moment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

asking me, i'd say, open-mpi has upper hand, and lack of gpu support in gloo is a significant design decision maker, at least for me. i'd appreciate @sandipanpanda feedback on it.

The main motive behind using gloo was it was simple to set up and use, making development iterations faster, in long term open-mpi provides more value

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

asking me, i'd say, open-mpi has upper hand, and lack of gpu support in gloo is a significant design decision maker, at least for me. i'd appreciate @sandipanpanda feedback on it.

The main motive behind using gloo was it was simple to set up and use, making development iterations faster, in long term open-mpi provides more value

thanks for your reply

The number of JAX processes can be defined using the `numNodes` field within the `mlPolicy` section of the `ClusterTrainingRuntime` configuration. This allows for specifying how many JAX processes/controller run in the distributed setup.


![user-roles](./drawing.drawio.svg)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you also describe the whole procedure step-by-step with words? I think, it will be more clear to readers if the graph is companied with descriptive words:)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, sure.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also need test plans section.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right. i'll add it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Signed-off-by: Mahdi Khashan <mahdikhashan1@gmail.com>
Signed-off-by: Mahdi Khashan <mahdikhashan1@gmail.com>
Signed-off-by: Mahdi Khashan <mahdikhashan1@gmail.com>
Signed-off-by: Mahdi Khashan <mahdikhashan1@gmail.com>
Signed-off-by: Mahdi Khashan <mahdikhashan1@gmail.com>
Signed-off-by: Mahdi Khashan <mahdikhashan1@gmail.com>
Signed-off-by: Mahdi Khashan <mahdikhashan1@gmail.com>
Signed-off-by: Mahdi Khashan <mahdikhashan1@gmail.com>
Signed-off-by: Mahdi Khashan <mahdikhashan1@gmail.com>
@mahdikhashan mahdikhashan reopened this Oct 4, 2025
@google-oss-prow google-oss-prow bot added size/L and removed size/XS labels Oct 4, 2025
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mahdikhashan Great work! Thanks for your instant response. Just one more suggestion. We're close to the end.

Comment on lines 52 to 55
### Non-Goals

- No TPU testing, tests will use CPU
- No GPU testing, tests will use CPU
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now, we've had access to GPU test cluster. So it might not be accurate now. How about saying non-goals:

  1. Set up test in TPU cluster
  2. Add advanced JAX APIs

WDYT? @andreyvelich @mahdikhashan

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

Signed-off-by: mahdikhashan <mahdikhashan1@gmail.com>
Copy link
Member

@Electronic-Waste Electronic-Waste left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mahdikhashan Thanks for your great work! Let's move forward with this KEP!

/approve
/lgtm
/assign @andreyvelich @tenzen-y @astefanutti
/hold for other maintainers

Copy link
Member

@andreyvelich andreyvelich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this @mahdikhashan!
/lgtm
/approve
/hold cancel

@google-oss-prow
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: andreyvelich, Electronic-Waste

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [Electronic-Waste,andreyvelich]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@google-oss-prow google-oss-prow bot merged commit 92c3f9b into kubeflow:master Oct 5, 2025
33 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants