-
Notifications
You must be signed in to change notification settings - Fork 680
DoRA #1115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DoRA #1115
Conversation
🔗 Helpful Links🧪 See artifacts and rendered test results at hud.pytorch.org/pr/pytorch/torchtune/1115
Note: Links to docs will display an error until the docs builds have been completed. ✅ No FailuresAs of commit 5f243af with merge base 0bdd308 ( This comment was automatically generated by Dr. CI and updates every 15 minutes. |
Thanks for introducing DoRA to torchtune!
|
torchtune/models/convert_weights.py
Outdated
_TO_PEFT_KEYS = { | ||
"lora_a": "lora_A", | ||
"lora_b": "lora_B", | ||
"lora.a": "lora_A", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we rename the LoRA keys name?
torchtune/modules/peft/lora.py
Outdated
lora_out = self._dora_forward(x, base_weight, lora_out) | ||
return base_out + lora_out | ||
|
||
def _dora_forward(self, x: Tensor, base_weight: Tensor, lora_out: Tensor) -> Tensor: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The DoRA logic looks good to me based on HF's implementation
torchtune/modules/peft/lora.py
Outdated
|
||
|
||
class LoRALinear(nn.Module, AdapterModule): | ||
class LoRALinear(nn.Module): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this LoRA refactor part, defer to @ebsmothers to check
Update: I commandeered this PR and pushed some changes with help from @weifengpy. Instead of the previous version where we split out A lot of the reason for the design of the original version was the usage of FSDP1 APIs, which were a bit too restrictive in terms of sharding params with different values of There is still some more work to do here though: (1) Checkpoint save does not work (I think this should be relatively straightforward though) I think QDoRA will also need more extensive testing too. cc @SLR722 |
raise AssertionError("Unexpected key loading adapter") | ||
|
||
|
||
def load_dora_magnitudes(model: nn.Module) -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fyi after https://github.com/pytorch/pytorch/pull/132954/files landed we can now remove this hack by following the example there
0d487b2
to
4455169
Compare
29998ad
to
6744221
Compare
for m in reversed(list(model.modules())): | ||
if isinstance(m, nn.Linear) and m.weight.requires_grad: | ||
fully_shard(m, **fsdp_kwargs) | ||
if isinstance(m, DoRALinear): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we treat DoRALinear separately here? why don't we treat it same as LoRALinear?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's necessary for proper initialization of the magnitude from the sharded LoRA A and B weights. Since otherwise they will be in different FSDP param groups so we would probably need to manually gather and then reshard the weights. You can doublecheck by commenting out L319-320 though to do the usual LoRA sharding, my guess is you will see some kind of DTensor error though.
adapter_params = get_adapter_params(model) | ||
set_trainable_params(model, adapter_params) | ||
num_lora_ab, num_transformer_layers = _get_n_lora_and_tformer_layers(model) | ||
num_lora, num_transformer_layers = _get_n_lora_and_tformer_layers(model) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I saw multiple changes to test_lora_fsdp_wrap test but the unit test doesn't pass on those changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry can you elaborate? Do you mean changes when you merge with latest main? Or changes from Calvin's initial version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are several changes on test_distributed.py compared with master and I think it's either done by Calvin or your FSDP refactor work. Those tests are broken and I don't have context on why we need these changes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK yeah I can take a look. The only changes I made were these ones
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I fixed the test_distributed.py issue by reverting some of Calvin's change on _distributed.py, which are unnecessary after the DoRA refactor
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1115 +/- ##
==========================================
+ Coverage 69.80% 70.12% +0.32%
==========================================
Files 272 274 +2
Lines 13053 13271 +218
==========================================
+ Hits 9111 9306 +195
- Misses 3942 3965 +23 ☔ View full report in Codecov by Sentry. |
I made several updates to this PR to push it closer to the finish line.
There are some remaining issues may need help from @ebsmothers or other people
cc: @ebsmothers |
To further verification the correctness of this DoRA work
![]() ![]() |
|
||
# Training | ||
epochs: 1 | ||
max_steps_per_epoch: 50 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
debug code sneaking in 😃
torchtune/modules/peft/dora.py
Outdated
""" | ||
# NOTE: this function has to be updated if the names of "lora_a" and "lora_b" | ||
# in this module change. | ||
# TODO: need to add back magnitude, but causing initial check to error out cause it's not defined yet |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like this can be removed now
torchtune/modules/peft/__init__.py
Outdated
"_lora_a_init_params", | ||
"_lora_b_init_params", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want these to be public
class _DoraReference(nn.Module): | ||
""" | ||
DoRA linear layer reference. | ||
Paper: https://arxiv.org/abs/2402.09353 | ||
Based on the code from: | ||
https://github.com/huggingface/peft/blob/main/src/peft/tuners/lora/layer.py | ||
https://github.com/huggingface/peft/blob/main/src/peft/tuners/lora/dora.py | ||
For more info, see the discussion here: | ||
https://github.com/huggingface/peft/pull/1474 | ||
""" | ||
|
||
def __init__( | ||
self, | ||
dtype: torch.dtype, | ||
in_dim: int, | ||
out_dim: int, | ||
rank: int, | ||
alpha: float, | ||
dropout: float = 0.0, | ||
use_bias: bool = False, | ||
quantize_base: bool = False, | ||
use_dora: bool = True, | ||
): | ||
super().__init__() | ||
self.use_bias = use_bias | ||
self.quantize_base = quantize_base | ||
self.use_dora = use_dora | ||
|
||
linear = nn.Linear( | ||
in_features=in_dim, out_features=out_dim, bias=use_bias, dtype=dtype | ||
) | ||
weight = linear.weight if not quantize_base else to_nf4(linear.weight) | ||
bias = None | ||
if use_bias: | ||
if quantize_base: | ||
raise NotImplementedError() | ||
bias = linear.bias | ||
self.register_parameter("weight", nn.Parameter(weight)) | ||
self.register_parameter( | ||
"bias", nn.Parameter(bias) if bias is not None else None | ||
) | ||
|
||
self.lora_a = nn.Linear(in_dim, rank, bias=False, dtype=dtype) | ||
self.lora_b = nn.Linear(rank, out_dim, bias=False, dtype=dtype) | ||
nn.init.kaiming_uniform_(self.lora_a.weight, a=math.sqrt(5)) | ||
nn.init.zeros_(self.lora_b.weight) | ||
self.scaling = alpha / rank | ||
if use_dora: | ||
self.lora_magnitude = nn.Parameter(torch.randn(out_dim, dtype=dtype)) | ||
self.dropout = nn.Dropout(p=dropout) | ||
|
||
def initialize_dora(self): | ||
weight = self.weight.to(self.lora_a.weight.dtype) | ||
lora_weight = self.lora_b.weight @ self.lora_a.weight | ||
weight_norm = self._get_weight_norm(weight, lora_weight) | ||
self.lora_magnitude = nn.Parameter(weight_norm, requires_grad=True) | ||
|
||
def forward(self, x): | ||
result = self._base_forward(x) | ||
torch_result_dtype = result.dtype | ||
x = x.to(self.lora_a.weight.dtype) | ||
if not self.use_dora: | ||
result = result + self.lora_b(self.lora_a(self.dropout(x))) * self.scaling | ||
else: | ||
x = self.dropout(x) | ||
result = result + self._dora_forward(x) | ||
result = result.to(torch_result_dtype) | ||
return result | ||
|
||
def _base_forward(self, x): | ||
if self.quantize_base: | ||
return linear_nf4(input=x, weight=self.weight) | ||
return F.linear(x, self.weight, self.bias) | ||
|
||
def _dora_forward(self, x): | ||
lora_result = self.lora_b(self.lora_a(x)) | ||
x_eye = torch.eye( | ||
self.lora_a.weight.shape[1], device=self.lora_a.weight.device, dtype=x.dtype | ||
) | ||
lora_weight = self.lora_b(self.lora_a(x_eye)).T | ||
|
||
magnitude = self.lora_magnitude | ||
weight = self.weight.to(x.dtype) | ||
weight_norm = self._get_weight_norm(weight, lora_weight.detach()) | ||
weight_norm = weight_norm.detach() | ||
mag_norm_scale = (magnitude / weight_norm).view(1, -1) | ||
result_dora = (mag_norm_scale - 1) * ( | ||
F.linear(x, weight) | ||
) + mag_norm_scale * lora_result * self.scaling | ||
return result_dora | ||
|
||
def _get_weight_norm(self, weight, lora_weight): | ||
weight = weight + self.scaling * lora_weight | ||
weight_norm = torch.linalg.norm(weight, dim=1).to(weight.dtype) | ||
return weight_norm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should refactor this test. In general we try not to put the reference implementation directly in the unit test, but instead use it to determine the expected values, hardcode those in the test, and point to where we got them
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Address this comment by only compare the numerical in unit test and move the reference implementation to tests/torchtune/models/llama2/scripts/compare_dora.py
torch.manual_seed(0) | ||
qdora_linear_out = qdora_linear(inputs) | ||
torch.testing.assert_close( | ||
dora_linear_out, qdora_linear_out, rtol=1e-01, atol=1e-01 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to make this assert very loose since I found to_nf4 will change the original value significantly
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm this is a bit suspicious to me. For comparison in the corresponding LoRA test we do not have to do this. Can we make sure nothing unexpected is happening with the magnitude vector here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few small things, otherwise want to make sure there are no correctness around quantization with the magnitude. After that this looks good to go!
rank=lora_rank, | ||
alpha=lora_alpha, | ||
dropout=lora_dropout, | ||
use_dora=use_dora, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't have to pass use_dora
to adapter_cls
or DoRALinear
(seems there are a bunch of such instances but just in this file)
path: /tmp/Meta-Llama-3-8B-Instruct/original/tokenizer.model | ||
|
||
checkpointer: | ||
_component_: torchtune.utils.FullModelMetaCheckpointer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI @RdoubleA is working on landing some changes to change this path from .utils.
-> .training.
. Let's coordinate since these are both big sets of changes
quantize_base=False, | ||
) | ||
|
||
# @pytest.mark.parametrize("dtype", [torch.bfloat16, torch.float32]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How come this is commented out?
torch.manual_seed(0) | ||
qdora_linear_out = qdora_linear(inputs) | ||
torch.testing.assert_close( | ||
dora_linear_out, qdora_linear_out, rtol=1e-01, atol=1e-01 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm this is a bit suspicious to me. For comparison in the corresponding LoRA test we do not have to do this. Can we make sure nothing unexpected is happening with the magnitude vector here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This one is finally ready to go! Huge thanks to both @calvinpelletier for all the initial work and to @SLR722 for taking this one the last mile (really quite a few miles). Really excited to be able to support DoRA!
`norm.py` and `norm_type` in `JobConfig` and llama `ModelArgs` were introduced when `nn.RMSNorm` was not available. Now that we don't have such need, let's remove them, following meta-pytorch#1111.
Context
Adding support for DoRA: https://arxiv.org/abs/2402.09353
Also refactoring
LoRALinear
. The adapter logic is now encapsulated in a submoduleLowRankAdapter
while the base layer logic remains inLoRALinear
. This lets us just wrap theLowRankAdapter
module for FSDP, instead of separately wrapping thelora_a
/lora_b
projections and the newmagnitude
parameter.Addresses #1100, #893, #936
Changelog
LoRALinear
/LowRankAdapter
LoRALinear
(enabled viause_dora
in constructor)get_merged_lora_ckpt
use_dora
to all model/component buildersrecipes/dev/lora_finetune_fsdp2.py
*.lora_magnitude_vector.weight
<->*.lora.magnitude
)Test plan
pytest tests/torchtune/modules/peft/test_lora.py::TestLoRALinear::test_dora
Compared finetunes for LoRA variants (config based on
llama3/8B_lora(_single_device)
for 2 epochs):LoRA 1xH100
DoRA 1xH100
QLoRA 1xH100
QDoRA 1xH100
DoRA 2xH100
Attempted to replicate section 3.2 ("Weight Decomposition Analysis") of the DoRA paper.
The code is here.
Results (for comparison with figure 2 of the paper):



Each point is a single query projection (avg direction delta on the X axis, avg magnitude delta on the Y axis).
Full finetune
LoRA finetune
DoRA fintune
The difference with the paper is probably due to a different experimental setup (Llama3 finetuned on alpaca instead of VL-BART finetuned on four image-text tasks).