GraniteMoeHybrid (based on v4.51.3)
A new model is added to transformers: GraniteMoeHybrid
It is added on top of the v4.51.3 release, and can be installed from the following tag: v4.51.3-GraniteMoeHybrid-preview
.
In order to install this version, please install with the following command:
pip install git+https://github.com/huggingface/transformers@v4.51.3-GraniteMoeHybrid-preview
If fixes are needed, they will be applied to this release; this installation may therefore be considered as stable and improving.
As the tag implies, this tag is a preview of the GraniteMoeHybrid model. This tag is a tagged version of the main
branch and does not follow semantic versioning. This model will be included in the next minor release: v4.52.0
.
GraniteMoeHybrid
The GraniteMoeHybrid
model builds on top of GraniteMoeSharedModel
and Bamba
. Its decoding layers consist of state space layers or MoE attention layers with shared experts. By default, the attention layers do not use positional encoding.
Usage example
GraniteMoeHybrid can be found on the Huggingface Hub.
from transformers import AutoModelForCausalLM, AutoTokenizer
model_path = "ibm-granite/granite-4.0-tiny-preview"
tokenizer = AutoTokenizer.from_pretrained(model_path)
# drop device_map if running on CPU
model = AutoModelForCausalLM.from_pretrained(model_path, device_map="auto")
model.eval()
# change input text as desired
prompt = "Write a code to find the maximum value in a list of numbers."
# tokenize the text
input_tokens = tokenizer(prompt, return_tensors="pt")
# generate output tokens
output = model.generate(**input_tokens, max_new_tokens=100)
# decode output tokens into text
output = tokenizer.batch_decode(output)
# loop over the batch to print, in this example the batch size is 1
for i in output:
print(i)