Add ability to shard custom layers for DPO and LoRA distributed #2072

joecummings · 2024-11-26T19:10:16Z

Context

What is the purpose of this PR? Is it to

add a new feature
fix a bug
update tests and/or documentation
other (please add here)

Please link to any issues this PR addresses.

Addresses the concerns in #2071

Changelog

What are the changes made in this PR?

Add param to accept custom_sharded_layers in DPO and LoRA distributed
Update configs

Test plan

Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.

run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
add unit tests for any new functionality
update docstrings for any new or updated methods or classes
run unit tests via pytest tests
run recipe tests via pytest tests -m integration_test
manually run any new or modified recipes with sufficient proof of correctness
include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

Note in both cases that the memory when not sharded is higher than with sharded, confirming that this works.
WandB link for LoRA distributed: https://wandb.ai/jcummings/test-123
WandB link for DPO LoRA distributed: https://wandb.ai/jcummings/test-123-dpo

Configs
Testing for Qwen2.5 72B LoRA: https://wandb.ai/jcummings/qwen-sharding
Testing for Qwen2.5 32B LoRA: https://wandb.ai/jcummings/qwen2.5-32-sharding
Testing for Llama3 70B LoRA: https://wandb.ai/jcummings/llama3-sharded
(Llama3.1 70B LoRA was tested above)

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Here is a docstring example
and a tutorial example

I did not change any public API
I have added an example to docs or docstrings

pytorch-bot · 2024-11-26T19:10:21Z

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2072

📄 Preview Python docs built from this PR

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 91f5e04 with merge base d7f8eb0 ():
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

felipemello1

stamping to unblock.

Can you please add a comparison without sharding the output weight? i.e., only shard the embeddings. I believe its not helpful.
are you going to add defaults to the configs?

felipemello1

.

joecummings · 2024-11-26T21:29:51Z

stamping to unblock.

Can you please add a comparison without sharding the output weight? i.e., only shard the embeddings. I believe its not helpful.

This is the case if the weights are tied; however, Llama does not do this.

are you going to add defaults to the configs?

Yep, I added it but commented out to all large LoRA configs.

Add ability to shard custom layers for DPO and LoRA distributed

87cbdd5

facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 26, 2024

joecummings requested a review from felipemello1 November 26, 2024 19:43

joecummings changed the title ~~[WIP] Add ability to shard custom layers for DPO and LoRA distributed~~ Add ability to shard custom layers for DPO and LoRA distributed Nov 26, 2024

joecummings mentioned this pull request Nov 26, 2024

CUDA OOM error with supposedly good enough specs according to memory stats output #2071

Closed

felipemello1 approved these changes Nov 26, 2024

View reviewed changes

Update LoRA configs

91f5e04

ebsmothers mentioned this pull request Nov 26, 2024

v0.5.0 tracker #2008

Closed

44 tasks

felipemello1 merged commit b1aecb1 into meta-pytorch:main Nov 26, 2024
17 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add ability to shard custom layers for DPO and LoRA distributed #2072

Add ability to shard custom layers for DPO and LoRA distributed #2072

Uh oh!

joecummings commented Nov 26, 2024 •

edited

Loading

Uh oh!

pytorch-bot bot commented Nov 26, 2024 •

edited

Loading

Uh oh!

felipemello1 left a comment •

edited

Loading

Uh oh!

felipemello1 left a comment •

edited

Loading

Uh oh!

joecummings commented Nov 26, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add ability to shard custom layers for DPO and LoRA distributed #2072

Add ability to shard custom layers for DPO and LoRA distributed #2072

Uh oh!

Conversation

joecummings commented Nov 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

Changelog

Test plan

UX

Uh oh!

pytorch-bot bot commented Nov 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2072

✅ No Failures

Uh oh!

felipemello1 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

felipemello1 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joecummings commented Nov 26, 2024

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

joecummings commented Nov 26, 2024 •

edited

Loading

pytorch-bot bot commented Nov 26, 2024 •

edited

Loading

felipemello1 left a comment •

edited

Loading

felipemello1 left a comment •

edited

Loading