feat(transformers): add doge (v4.54.1) #1392

alien-0119 · 2025-10-27T03:13:28Z

What does this PR do?

Adds # (feature)
Add model Doge and fast ut.

Usage example:

from transformers import AutoTokenizer
from mindone.transformers import AutoModelForCausalLM
import mindspore as ms

tokenizer = AutoTokenizer.from_pretrained("SmallDoge/Doge-20M")
model = AutoModelForCausalLM.from_pretrained("SmallDoge/Doge-20M")
inputs = tokenizer("Hey how are you doing?", return_tensors="np")
inputs = {k: ms.tensor(v) for k, v in inputs.items()}

outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.batch_decode(outputs))

Performance:
Experiments were tested on Ascend Atlas 800T A2 machines with mindspore 2.7.0 pynative mode.

model	precision	weight load(s)	s/step
SmallDoge/Doge-20M	fp32	3.407	0.333
SmallDoge/Doge-20M	fp16	3.602	0.330
SmallDoge/Doge-20M	bf16	3.724	0.353

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you make sure to update the documentation with your changes? E.g. record bug fixes or new features in What's New. Here are the
documentation guidelines
Did you build and run the code without any errors?
Did you report the running environment (NPU type/MS version) and performance in the doc? (better record it for data loading, model inference, or training tasks)
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@xxx

gemini-code-assist · 2025-10-27T03:13:47Z

Summary of Changes

Hello @alien-0119, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces the Doge model, a new small language model, into the mindone.transformers framework. The changes encompass the full integration of the model's architecture, including specialized components like RMS normalization, rotary embeddings, and a Mixture of Experts (MoE) layer with load balancing. This addition expands the library's capabilities by providing a new model option, complete with auto-configuration support and robust unit tests to verify its implementation.

Highlights

New Doge Model Integration: The Doge model, including its CausalLM, SequenceClassification, and base model components, has been added to the mindone.transformers library.
Core Architecture Components: New modules like DogeRMSNorm, DogeRotaryEmbedding, DogeAttention, DogeMLP, and DogeCDMoE (Mixture of Experts) are introduced to support the Doge model's architecture.
Auto-Configuration and Modeling Support: The Doge model and its configuration are now registered within the mindone.transformers auto-configuration and auto-modeling systems, allowing for easy instantiation.
Mixture of Experts (MoE) Load Balancing: A load_balancing_loss_func has been implemented to manage the routing of tokens to experts within the DogeCDMoE layer, promoting balanced expert utilization.
Comprehensive Unit Testing: Dedicated unit tests have been added for the Doge model, DogeForCausalLM, and DogeForSequenceClassification to ensure functional correctness and precision alignment between PyTorch and MindSpore implementations.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request adds the Doge model, including its MoE (Mixture of Experts) variant, to the library. The changes are well-structured, adding the model implementation, tests, and updating auto-discovery classes.

My review has identified a critical bug in the MoE implementation path within DogeDecoderLayer that would cause a runtime error. I've also pointed out some areas for improvement regarding code duplication and an incorrect type hint.

A significant concern is the lack of tests for the MoE functionality. Given the bug found, I strongly recommend adding tests for the MoE configuration to ensure its correctness and prevent future regressions.

Once these issues are addressed, this will be a solid contribution.

gemini-code-assist · 2025-10-27T03:15:37Z

mindone/transformers/models/doge/modeling_doge.py

+        # state transformation
+        residual = hidden_states
+        hidden_states = self.post_attention_layernorm(hidden_states)
+        hidden_states = self.mlp(hidden_states)


There's a critical bug in the DogeDecoderLayer when using the Mixture of Experts (MoE) functionality. If config.is_moe is true, self.mlp is an instance of DogeCDMoE, and its construct method returns a tuple (hidden_states, router_logits).

On this line, the tuple is assigned to the single variable hidden_states. This will cause a TypeError on the next line when mint.nn.functional.dropout is called on the tuple.

To fix this, you should unpack the output of self.mlp and assign only the hidden_states tensor to the hidden_states variable. The router_logits are captured by the OutputRecorder mechanism and are not needed here.

Suggested change

hidden_states = self.mlp(hidden_states)

mlp_output = self.mlp(hidden_states)

hidden_states = mlp_output[0] if isinstance(mlp_output, tuple) else mlp_output

gemini-code-assist · 2025-10-27T03:15:37Z

tests/transformers_tests/models/doge/test_modeling_doge.py

+class DogeModelTester:
+    def __init__(
+        self,
+        batch_size=8,
+        seq_length=16,
+        is_training=True,
+        use_input_mask=True,
+        use_token_type_ids=False,
+        use_labels=True,
+        vocab_size=128,
+        hidden_size=32,
+        num_hidden_layers=2,
+        num_attention_heads=4,
+        intermediate_size=64,
+        hidden_act="silu",
+        max_position_embeddings=512,
+        type_vocab_size=16,
+        type_sequence_label_size=2,
+        initializer_range=0.02,
+        num_labels=3,
+        pad_token_id=0,
+        scope=None,
+    ):
+        self.batch_size = batch_size
+        self.seq_length = seq_length
+        self.is_training = is_training
+        self.use_input_mask = use_input_mask
+        self.use_token_type_ids = use_token_type_ids
+        self.use_labels = use_labels
+        self.vocab_size = vocab_size
+        self.hidden_size = hidden_size
+        self.num_hidden_layers = num_hidden_layers
+        self.num_attention_heads = num_attention_heads
+        self.intermediate_size = intermediate_size
+        self.hidden_act = hidden_act
+        self.max_position_embeddings = max_position_embeddings
+        self.type_vocab_size = type_vocab_size
+        self.type_sequence_label_size = type_sequence_label_size
+        self.initializer_range = initializer_range
+        self.num_labels = num_labels
+        self.pad_token_id = pad_token_id
+        self.scope = scope


The added tests for the Doge model do not seem to cover the Mixture of Experts (MoE) functionality (is_moe=True). The DogeModelTester class doesn't include MoE-related configuration parameters like is_moe, num_experts, and num_experts_per_tok.

Given that a critical bug was found in the MoE code path, it's very important to add test cases that specifically enable and validate the MoE implementation. Please update DogeModelTester to support MoE configurations and add corresponding test cases to DOGE_CASES. This will help prevent regressions and ensure this key feature works as expected.

gemini-code-assist · 2025-10-27T03:15:37Z

mindone/transformers/models/doge/modeling_doge.py

+        use_cache: Optional[bool] = False,
+        cache_position: Optional[Tensor] = None,
+        **kwargs: Unpack[TransformersKwargs],
+    ) -> tuple[Tensor, Optional[tuple[Tensor, Tensor]]]:


The return type hint for DogeDecoderLayer.construct appears to be incorrect. It's specified as -> tuple[Tensor, Optional[tuple[Tensor, Tensor]]] on line 398, but the function implementation on line 422 returns a single Tensor (hidden_states). This is misleading and should be corrected to match the actual return type for better code clarity and correctness. The past_key_value is handled via the Cache object, which is updated in-place, so it doesn't need to be returned.

Suggested change

) -> tuple[Tensor, Optional[tuple[Tensor, Tensor]]]:

) -> Tensor:

gemini-code-assist · 2025-10-27T03:15:38Z

mindone/transformers/models/doge/modeling_doge.py

+    for layer_gate_logits in gate_logits:
+        (scores_x, scores_y), (indices_x, indices_y) = layer_gate_logits.topk(num_keys, dim=-1)
+
+        all_scores = scores_x.unsqueeze(-1) + scores_y.unsqueeze(-2)
+        all_indices = indices_x.unsqueeze(-1) * num_keys + indices_y.unsqueeze(-2)
+        all_scores = all_scores.view(*all_scores.shape[:-2], -1)
+        all_indices = all_indices.view(*all_indices.shape[:-2], -1)
+
+        _, position_indices = all_scores.topk(top_k, dim=-1)
+        expert_indices = all_indices.gather(-1, position_indices)
+
+        routing_weights = mint.nn.functional.softmax(all_scores, dim=-1)


The logic for calculating expert scores and indices within load_balancing_loss_func is a duplication of the logic found in DogeCDMoE.construct (lines 352-359).

To improve maintainability and reduce redundancy, consider refactoring this shared logic into a separate helper function. This function could take router_logits and num_keys as input and return the routing_weights and expert_indices. Both DogeCDMoE and load_balancing_loss_func could then call this helper.

alien-0119 requested a review from vigo999 as a code owner October 27, 2025 03:13

gemini-code-assist bot reviewed Oct 27, 2025

View reviewed changes

alien-0119 force-pushed the doge_master branch from 88d0e74 to e5041b7 Compare October 27, 2025 03:25

add doge

b00deda

alien-0119 force-pushed the doge_master branch from e5041b7 to b00deda Compare October 27, 2025 03:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

feat(transformers): add doge (v4.54.1) #1392

feat(transformers): add doge (v4.54.1) #1392

alien-0119 commented Oct 27, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Oct 27, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Oct 27, 2025

Uh oh!

gemini-code-assist bot Oct 27, 2025

Uh oh!

gemini-code-assist bot Oct 27, 2025

Uh oh!

gemini-code-assist bot Oct 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	hidden_states = self.mlp(hidden_states)
	mlp_output = self.mlp(hidden_states)
	hidden_states = mlp_output[0] if isinstance(mlp_output, tuple) else mlp_output

	) -> tuple[Tensor, Optional[tuple[Tensor, Tensor]]]:
	) -> Tensor:

Uh oh!

feat(transformers): add doge (v4.54.1) #1392

Are you sure you want to change the base?

feat(transformers): add doge (v4.54.1) #1392

Conversation

alien-0119 commented Oct 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Who can review?

Uh oh!

gemini-code-assist bot commented Oct 27, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Oct 27, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

alien-0119 commented Oct 27, 2025 •

edited

Loading