Skip to content

model fragments for diloco #1446

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from
Open

Conversation

tushar00jain
Copy link
Contributor

@tushar00jain tushar00jain commented Jul 23, 2025

Summary:

  • add a configuration option for users to provide how they want to partition the model
  • if this is provided, the model needs to implement FaultTolerantTrainingSpec that defines the framentation function to split the model based on the configuration
  • determine the model fragments in training script to pass to ft manager

Test Plan:
Running llama3 8b parameters with 2 fragments, 1 step delay, each fragment gets synced every 20 steps

image

Stack created with Sapling. Best reviewed with ReviewStack.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jul 23, 2025
@tushar00jain tushar00jain force-pushed the pr1446 branch 4 times, most recently from be87993 to 04977f1 Compare July 24, 2025 00:55
Summary:
- add a configuration option for users to provide how they want to partition the model
- if this is provided, the model needs to implement `FaultTolerantTrainingSpec` that defines the framentation function to split the model based on the configuration
- determine the model fragments in training script to pass to ft manager

Test Plan:
Running llama3 8b parameters with 2 fragments, 1 step delay, each fragment gets synced every 20 steps

<img width="944" height="545" alt="image" src="https://github.com/user-attachments/assets/6d16f486-7260-49d6-8ba3-3e98cd331e58" />
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants