NVIDIA Cosmos Reason 2 Brings Advanced Reasoning To Physical AI

Community Article Published January 5, 2026

NVIDIA today released Cosmos Reason 2, the latest advancement in open, reasoning vision language models for physical AI. Cosmos Reason 2 surpasses its previous version in accuracy and tops the Physical AI Bench and Physical Reasoning leaderboards as the #1 open model for visual understanding.

NVIDIA Cosmos Reason 2: Reasoning Vision Language Model for Physical AI

Since their introduction, vision-language models have rapidly improved at tasks like object and pattern recognition in images. But they still struggle with tasks humans find natural, like planning several steps ahead, dealing with uncertainty or adapting to new situations. Cosmos Reason is designed to close this gap by giving robots and AI agents stronger common sense and reasoning to solve complex problems step by step.

Cosmos Reason 2 is a state-of-the-art, open reasoning vision-language model (VLM) that enables robots and AI agents to see, understand, plan, and act in the physical world like humans. It uses common sense, physics, and prior knowledge to recognize how objects move across space and time to handle complex tasks, adapt to new situations, and figure out how to solve problems step by step.

โœจ Key Highlights

  • Improved spatio-temporal understanding and timestamp precision.

  • Optimized performance with flexible deployment options from edge to cloud with 2B and 8B parameters model sizes.

  • Support for expanded set of spatial understanding and visual perception capabilities โ€” 2D/3D point localization, bounding box coordinates, trajectory data, and OCR support.

  • Improved long-context understanding with 256K input tokens, up from 16K with Cosmos Reason 1.

  • Adaptable to multiple use cases with easy-to-use Cosmos Cookbook recipes.

๐Ÿค– Popular Use Cases

  • Video analytics AI agents โ€” These agents can extract valuable insights from massive volumes of video data to optimize processes. Cosmos Reason 2 builds on the capabilities of Cosmos Reason 1 and now provides OCR support, as well as 2D/3D point localization and a set of mark understanding.

    Example of how Cosmos Reason can understand text embedded within a video to determine the condition of the road during a rainstorm.

    Developers can jumpstart development of video analytics AI agents by using the NVIDIA blueprint for video search and summarization (VSS) with Cosmos Reason as the VLM.

    Salesforce is transforming workplace safety and compliance by analyzing video footage captured by Cobalt robots with Agentforce and VSS blueprint with Cosmos Reason as the VLM.

  • Data annotation and critique โ€” Enable developers to automate high-quality annotation and critique of massive, diverse training datasets. Cosmos Reason provides time stamps and detailed descriptions for real or synthetically generated training videos.

    Data annotation and critique example
    Example of a sample prompt to generate detailed, time-stamped captions for a race car video.

    Uber is exploring Cosmos Reason 2 to deliver accurate, searchable video captions for autonomous vehicle (AV) training data, enabling efficient identification of critical driving scenarios. This co-authored Reason 2 for AV Video Captioning and VQA recipe demonstrates how to fine-tune and evaluate Cosmos Reason 2-8B on annotated AV videos. Across multiple evaluation metrics, measurable improvements were achieved: BLEU scores improved 10.6% (0.113 โ†’ 0.125), MCQ-based VQA gained 0.67 percentage points (80.18% โ†’ 80.85%), and LingoQA increased 13.8% (63.2% โ†’ 77.0%). These gains demonstrate effective domain adaptation for AV applications.

  • Robot planning and reasoning โ€” Act as the brain for deliberate, methodical decision-making in a robot vision language action (VLA) model. Cosmos Reason 2 now provides trajectory coordinates in addition to determining next steps.

    Example of the prompt and JSON output from Cosmos Reason 2 to provide the steps and trajectory the robot gripper needs to take to move the painterโ€™s tape into the basket.

    Encord provides native support for Cosmos Reason 2 in its Data Agent library and AI data platform, enabling developers to leverage Cosmos Reason 2 as a VLA for robotics and other physical AI use cases.

Companies like Hitachi, Milestone and VAST Data are using Cosmos Reason to advance robotics, autonomous driving, and video analytics AI agents for traffic and workplace safety.

Try Cosmos Reason 2 on build.nvidia.com and experience the latest features with sample prompts for generating bounding boxes and robot trajectories. Upload your own videos and images for further analysis.

Download Cosmos Reason 2 models (2B and 8B) on Hugging Face or use Cosmos Reason 2 in the cloud. The model will be available soon on Amazon Web Services, Google Cloud and Microsoft Azure. To get started, check out Cosmos Reason 2 documentation and the Cosmos Cookbook.

Other Models From The Cosmos Family:

๐Ÿ”ฎ Cosmos Predict 2.5

Cosmos Predict is a generative AI model that predicts future states of the physical world as video, based on text, image, or video inputs.

  • Physical AI Bench leader for quality, accuracy and overall consistency.
  • Up to 30 seconds of physically and temporally consistent clip per generation.
  • Supports multiple framerates and resolution.
  • Pre-trained on 200 million clips.
  • Available as 2B and 14B pre-trained models and various 2B post-trained models for multiview, action conditioning and autonomous vehicle training.

Check out model card>>

๐Ÿ” Cosmos Transfer 2.5

Cosmos Transfer is our lightest multicontrol model built for video to world style transfer.

  • Scale a single simulation or spatial video across various environments and lighting conditions.
  • Improved prompt adherence and physics alignment.
  • Use with NVIDIA Isaac Simโ„ข or NVIDIA Omniverse NuRec for simulation to real transformation.

Check out model card>>

๐Ÿค– NVIDIA GR00T N1.6

NVIDIA GR00T N1.6 is an open reasoning vision language action (VLA) model, purpose-built for humanoid robots, that unlocks full body control and uses NVIDIA Cosmos Reason for better reasoning and contextual understanding.

Resources

๐Ÿง‘๐Ÿปโ€๐Ÿณ Read the Cosmos Cookbook โ†’ https://nvda.ws/4qevli8

๐Ÿ“š Explore Models & Datasets โ†’ https://github.com/nvidia-cosmos

โฌ‡๏ธ Try Cosmos Models in our Hosted Catalog โ†’ https://nvda.ws/3Yg0Dcx

๐Ÿ’ป Join the Cosmos Community โ†’ https://discord.gg/u23rXTHSC9

๐Ÿ—ณ๏ธ Contribute to the Cosmos Cookbook โ†’ https://nvda.ws/4aQcBkk

Community

very bullish on embodied VLA/VLMs this year ๐Ÿ”ฅ

looks so cool!

Sign up or log in to comment