Skip to content

Conversation

@alien-0119
Copy link
Collaborator

@alien-0119 alien-0119 commented Oct 27, 2025

What does this PR do?

Adds # (feature)
Add model Doge and fast ut.

Usage example:

from transformers import AutoTokenizer
from mindone.transformers import AutoModelForCausalLM
import mindspore as ms

tokenizer = AutoTokenizer.from_pretrained("SmallDoge/Doge-20M")
model = AutoModelForCausalLM.from_pretrained("SmallDoge/Doge-20M")
inputs = tokenizer("Hey how are you doing?", return_tensors="np")
inputs = {k: ms.tensor(v) for k, v in inputs.items()}

outputs = model.generate(**inputs, max_new_tokens=100)
print(tokenizer.batch_decode(outputs))

Performance:
Experiments were tested on Ascend Atlas 800T A2 machines with mindspore 2.7.0 pynative mode.

model precision weight load(s) s/step
SmallDoge/Doge-20M fp32 3.407 0.333
SmallDoge/Doge-20M fp16 3.602 0.330
SmallDoge/Doge-20M bf16 3.724 0.353

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you read the contributor guideline?
  • Did you make sure to update the documentation with your changes? E.g. record bug fixes or new features in What's New. Here are the
    documentation guidelines
  • Did you build and run the code without any errors?
  • Did you report the running environment (NPU type/MS version) and performance in the doc? (better record it for data loading, model inference, or training tasks)
  • Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

@xxx

@alien-0119 alien-0119 requested a review from vigo999 as a code owner October 27, 2025 03:13
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @alien-0119, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces the Doge model, a new small language model, into the mindone.transformers framework. The changes encompass the full integration of the model's architecture, including specialized components like RMS normalization, rotary embeddings, and a Mixture of Experts (MoE) layer with load balancing. This addition expands the library's capabilities by providing a new model option, complete with auto-configuration support and robust unit tests to verify its implementation.

Highlights

  • New Doge Model Integration: The Doge model, including its CausalLM, SequenceClassification, and base model components, has been added to the mindone.transformers library.
  • Core Architecture Components: New modules like DogeRMSNorm, DogeRotaryEmbedding, DogeAttention, DogeMLP, and DogeCDMoE (Mixture of Experts) are introduced to support the Doge model's architecture.
  • Auto-Configuration and Modeling Support: The Doge model and its configuration are now registered within the mindone.transformers auto-configuration and auto-modeling systems, allowing for easy instantiation.
  • Mixture of Experts (MoE) Load Balancing: A load_balancing_loss_func has been implemented to manage the routing of tokens to experts within the DogeCDMoE layer, promoting balanced expert utilization.
  • Comprehensive Unit Testing: Dedicated unit tests have been added for the Doge model, DogeForCausalLM, and DogeForSequenceClassification to ensure functional correctness and precision alignment between PyTorch and MindSpore implementations.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds the Doge model, including its MoE (Mixture of Experts) variant, to the library. The changes are well-structured, adding the model implementation, tests, and updating auto-discovery classes.

My review has identified a critical bug in the MoE implementation path within DogeDecoderLayer that would cause a runtime error. I've also pointed out some areas for improvement regarding code duplication and an incorrect type hint.

A significant concern is the lack of tests for the MoE functionality. Given the bug found, I strongly recommend adding tests for the MoE configuration to ensure its correctness and prevent future regressions.

Once these issues are addressed, this will be a solid contribution.

# state transformation
residual = hidden_states
hidden_states = self.post_attention_layernorm(hidden_states)
hidden_states = self.mlp(hidden_states)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

There's a critical bug in the DogeDecoderLayer when using the Mixture of Experts (MoE) functionality. If config.is_moe is true, self.mlp is an instance of DogeCDMoE, and its construct method returns a tuple (hidden_states, router_logits).

On this line, the tuple is assigned to the single variable hidden_states. This will cause a TypeError on the next line when mint.nn.functional.dropout is called on the tuple.

To fix this, you should unpack the output of self.mlp and assign only the hidden_states tensor to the hidden_states variable. The router_logits are captured by the OutputRecorder mechanism and are not needed here.

Suggested change
hidden_states = self.mlp(hidden_states)
mlp_output = self.mlp(hidden_states)
hidden_states = mlp_output[0] if isinstance(mlp_output, tuple) else mlp_output

Comment on lines +34 to +75
class DogeModelTester:
def __init__(
self,
batch_size=8,
seq_length=16,
is_training=True,
use_input_mask=True,
use_token_type_ids=False,
use_labels=True,
vocab_size=128,
hidden_size=32,
num_hidden_layers=2,
num_attention_heads=4,
intermediate_size=64,
hidden_act="silu",
max_position_embeddings=512,
type_vocab_size=16,
type_sequence_label_size=2,
initializer_range=0.02,
num_labels=3,
pad_token_id=0,
scope=None,
):
self.batch_size = batch_size
self.seq_length = seq_length
self.is_training = is_training
self.use_input_mask = use_input_mask
self.use_token_type_ids = use_token_type_ids
self.use_labels = use_labels
self.vocab_size = vocab_size
self.hidden_size = hidden_size
self.num_hidden_layers = num_hidden_layers
self.num_attention_heads = num_attention_heads
self.intermediate_size = intermediate_size
self.hidden_act = hidden_act
self.max_position_embeddings = max_position_embeddings
self.type_vocab_size = type_vocab_size
self.type_sequence_label_size = type_sequence_label_size
self.initializer_range = initializer_range
self.num_labels = num_labels
self.pad_token_id = pad_token_id
self.scope = scope
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The added tests for the Doge model do not seem to cover the Mixture of Experts (MoE) functionality (is_moe=True). The DogeModelTester class doesn't include MoE-related configuration parameters like is_moe, num_experts, and num_experts_per_tok.

Given that a critical bug was found in the MoE code path, it's very important to add test cases that specifically enable and validate the MoE implementation. Please update DogeModelTester to support MoE configurations and add corresponding test cases to DOGE_CASES. This will help prevent regressions and ensure this key feature works as expected.

use_cache: Optional[bool] = False,
cache_position: Optional[Tensor] = None,
**kwargs: Unpack[TransformersKwargs],
) -> tuple[Tensor, Optional[tuple[Tensor, Tensor]]]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The return type hint for DogeDecoderLayer.construct appears to be incorrect. It's specified as -> tuple[Tensor, Optional[tuple[Tensor, Tensor]]] on line 398, but the function implementation on line 422 returns a single Tensor (hidden_states). This is misleading and should be corrected to match the actual return type for better code clarity and correctness. The past_key_value is handled via the Cache object, which is updated in-place, so it doesn't need to be returned.

Suggested change
) -> tuple[Tensor, Optional[tuple[Tensor, Tensor]]]:
) -> Tensor:

Comment on lines +573 to +585
for layer_gate_logits in gate_logits:
(scores_x, scores_y), (indices_x, indices_y) = layer_gate_logits.topk(num_keys, dim=-1)

all_scores = scores_x.unsqueeze(-1) + scores_y.unsqueeze(-2)
all_indices = indices_x.unsqueeze(-1) * num_keys + indices_y.unsqueeze(-2)
all_scores = all_scores.view(*all_scores.shape[:-2], -1)
all_indices = all_indices.view(*all_indices.shape[:-2], -1)

_, position_indices = all_scores.topk(top_k, dim=-1)
expert_indices = all_indices.gather(-1, position_indices)

routing_weights = mint.nn.functional.softmax(all_scores, dim=-1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The logic for calculating expert scores and indices within load_balancing_loss_func is a duplication of the logic found in DogeCDMoE.construct (lines 352-359).

To improve maintainability and reduce redundancy, consider refactoring this shared logic into a separate helper function. This function could take router_logits and num_keys as input and return the routing_weights and expert_indices. Both DogeCDMoE and load_balancing_loss_func could then call this helper.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant