Skip to content

Conversation

joecummings
Copy link
Member

@joecummings joecummings commented Nov 26, 2024

Context

What is the purpose of this PR? Is it to

  • add a new feature
  • fix a bug
  • update tests and/or documentation
  • other (please add here)

Please link to any issues this PR addresses.

Addresses the concerns in #2071

Changelog

What are the changes made in this PR?

  • Add param to accept custom_sharded_layers in DPO and LoRA distributed
  • Update configs

Test plan

Please make sure to do each of the following if applicable to your PR. If you're unsure about any one of these just ask and we will happily help. We also have a contributing page for some guidance on contributing.

  • run pre-commit hooks and linters (make sure you've first installed via pre-commit install)
  • add unit tests for any new functionality
  • update docstrings for any new or updated methods or classes
  • run unit tests via pytest tests
  • run recipe tests via pytest tests -m integration_test
  • manually run any new or modified recipes with sufficient proof of correctness
  • include relevant commands and any other artifacts in this summary (pastes of loss curves, eval results, etc.)

Note in both cases that the memory when not sharded is higher than with sharded, confirming that this works.
WandB link for LoRA distributed: https://wandb.ai/jcummings/test-123
WandB link for DPO LoRA distributed: https://wandb.ai/jcummings/test-123-dpo

Configs
Testing for Qwen2.5 72B LoRA: https://wandb.ai/jcummings/qwen-sharding
Testing for Qwen2.5 32B LoRA: https://wandb.ai/jcummings/qwen2.5-32-sharding
Testing for Llama3 70B LoRA: https://wandb.ai/jcummings/llama3-sharded
(Llama3.1 70B LoRA was tested above)

UX

If your function changed a public API, please add a dummy example of what the user experience will look like when calling it.
Here is a docstring example
and a tutorial example

  • I did not change any public API
  • I have added an example to docs or docstrings

Copy link

pytorch-bot bot commented Nov 26, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/2072

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit 91f5e04 with merge base d7f8eb0 (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Nov 26, 2024
@joecummings joecummings changed the title [WIP] Add ability to shard custom layers for DPO and LoRA distributed Add ability to shard custom layers for DPO and LoRA distributed Nov 26, 2024
Copy link
Contributor

@felipemello1 felipemello1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

stamping to unblock.

  1. Can you please add a comparison without sharding the output weight? i.e., only shard the embeddings. I believe its not helpful.

  2. are you going to add defaults to the configs?

Copy link
Contributor

@felipemello1 felipemello1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

.

@joecummings
Copy link
Member Author

stamping to unblock.

  1. Can you please add a comparison without sharding the output weight? i.e., only shard the embeddings. I believe its not helpful.

This is the case if the weights are tied; however, Llama does not do this.

  1. are you going to add defaults to the configs?

Yep, I added it but commented out to all large LoRA configs.

@ebsmothers ebsmothers mentioned this pull request Nov 26, 2024
44 tasks
@felipemello1 felipemello1 merged commit b1aecb1 into meta-pytorch:main Nov 26, 2024
17 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants