GitHub - nvidia-cosmos/cosmos-reason1: Cosmos-Reason1 models understand the physical common sense and generate appropriate embodied decisions in natural language through long chain-of-thought reasoning processes.

Paper | Website | HuggingFace

Cosmos-Reason1 is a suite of models, ontologies, and benchmarks that we develop with the goal of enabling multimodal LLMs to generate physically grounded responses. We release one multimodal LLMs: Cosmos-Reason1-7B post-trained with Physical AI SFT, and Physical AI reinforcement learning. We define ontologies for physical common sense and embodied reasoning, and also build benchmarks to evaluate Physical AI reasoning capabilities of multimodal LLMs.

Model

Cosmos-Reason1-7B

Getting Started

Inference

NOTE: We suggest using fps=4 for the input video and max_tokens=4096 to avoid truncated response.

from transformers import AutoProcessor
from vllm import LLM, SamplingParams
from qwen_vl_utils import process_vision_info

# You can also replace the MODEL_PATH by a safetensors folder path mentioned above
MODEL_PATH = "nvidia/Cosmos-Reason1-7B"

llm = LLM(
    model=MODEL_PATH,
    limit_mm_per_prompt={"image": 10, "video": 10},
)

sampling_params = SamplingParams(
    temperature=0.6,
    top_p=0.95,
    repetition_penalty=1.05,
    max_tokens=4096,
)

video_messages = [
    {"role": "system", "content": "You are a helpful assistant. Answer the question in the following format: <think>\nyour reasoning\n</think>\n\n<answer>\nyour answer\n</answer>."},
    {"role": "user", "content": [
            {"type": "text", "text": (
                    "Is it safe to turn right?"
                )
            },
            {
                "type": "video", 
                "video": "assets/sample.mp4",
                "fps": 4,
            }
        ]
    },
]

# Here we use video messages as a demonstration
messages = video_messages

processor = AutoProcessor.from_pretrained(MODEL_PATH)
prompt = processor.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
image_inputs, video_inputs, video_kwargs = process_vision_info(messages, return_video_kwargs=True)

mm_data = {}
if image_inputs is not None:
    mm_data["image"] = image_inputs
if video_inputs is not None:
    mm_data["video"] = video_inputs

llm_inputs = {
    "prompt": prompt,
    "multi_modal_data": mm_data,

    # FPS will be returned in video_kwargs
    "mm_processor_kwargs": video_kwargs,
}

outputs = llm.generate([llm_inputs], sampling_params=sampling_params)
generated_text = outputs[0].outputs[0].text

print(generated_text)

User prompt: Is it safe to turn right?

Assistant response:

<think> Okay, let's see. The user is asking if it's safe to turn right based on the video provided. First, I need to ... The key points here are the parked vehicles. There are cars parked on both sides of the road, ... </think>

<answer> Based on the video, turning right may not be entirely safe due to the following factors: ... </answer>

SFT and RL Training

Please check our User Guide.

SFT and RL Training System Architecture

Cosmos-Reason1 provides toolchain to enable large scale SFT and RL training workload with following features:

HuggingFace Integration
- Qwen-2.5
- Qwen-2.5-VL
- Qwen-3
- Qwen-3-MoE
Parallelism
- Tensor Parallelism
- Sequence Parallelism
- Context Parallelism
- FSDP Parallelism
- Pipeline Parallelism
Fully asynchronous (replicas specialization)
- Policy (Consumer): Replicas of training instances
- Rollout (Producer): Replicas of generation engines
- Low-precision training (FP8) and rollout (FP8 & FP4) support
Single-Controller Architecture
- Efficient messaging system (e.g., weight-sync, rollout, evaluate) to coordinate policy and rollout replicas
- Dynamic NCCL Process Groups for on-the-fly replicas registration/un-registration to enable fault-tolerant and elastic large-scale RL training
- Dynamic hyper-parameters adjustment

License and Contact

This project will download and install additional third-party open source software projects. Review the license terms of these open source projects before use.

NVIDIA Cosmos source code is released under the Apache 2 License.

NVIDIA Cosmos models are released under the NVIDIA Open Model License. For a custom license, please contact cosmos-license@nvidia.com.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
assets		assets
configs		configs
cosmos_reason1		cosmos_reason1
docs		docs
tests		tests
tools		tools
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
ATTRIBUTIONS.md		ATTRIBUTIONS.md
CMakeLists.txt		CMakeLists.txt
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
constraints.txt		constraints.txt
requirements.txt		requirements.txt
ruff.toml		ruff.toml
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Paper | Website | HuggingFace

Model

Getting Started

Inference

SFT and RL Training

SFT and RL Training System Architecture

License and Contact

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 22

Languages

License

nvidia-cosmos/cosmos-reason1

Folders and files

Latest commit

History

Repository files navigation

Paper | Website | HuggingFace

Model

Getting Started

Inference

SFT and RL Training

SFT and RL Training System Architecture

License and Contact

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 22

Languages

Packages