Skip to content

feat(transform, chat, gemini, media): Gemini enable video processing #6150

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

VooDisss
Copy link

@VooDisss VooDisss commented Jul 24, 2025

Related GitHub Issue

Closes: #6144
Big thanks to jordanrendric for helping by providing working proof of concept that it's possible.

Roo Code Task Context (Optional)

Description

This pull request introduces support for video content processing for Gemini models. The changes generalize the media handling in the chat UI to support both images and videos, and update the data transformation logic to correctly format video content for the Gemini API.

Key implementation details:

  • Gemini Transformer Update: The gemini-format.ts transformer has been extended to process video content blocks, ensuring they are correctly converted to the Gemini API format. It also now sorts content parts to place media before text.
  • Generalized Media UI: The chat interface has been refactored to handle generic media types instead of just images. This includes:
  • Dynamic File Type Acceptance: ChatView.tsx now dynamically determines the accepted file types based on the selected model's capabilities, enabling video formats like MP4, MOV, etc., for supported Gemini models.
  • MIME Type Utility: A new getMimeType utility function was added to reliably identify media types from data URIs.

Test Procedure

The changes have been tested through both automated unit tests and manual verification.

Unit Tests:

  • New unit tests have been added to gemini-format.spec.ts to cover video and mixed-media content transformations.
  • To run the tests, execute the following command from the src directory: npx vitest run api/transform/__tests__/gemini-format.spec.ts

Manual Testing:
Reviewers can verify the changes by following these steps:

  1. Select a Gemini model that supports video input (e.g., gemini-2.5-pro).
  2. Drag and drop or paste a video file (e.g., an .mp4 or .mov file) into the chat text area.
  3. Verify that a video thumbnail appears in the composer.
  4. Send a message containing the video.
  5. Verify that the message is sent and the model processes the video content in its response.
  6. Test with other media combinations (e.g., images, text and video) to ensure they are handled correctly.

Pre-Submission Checklist

  • Issue Linked: This PR is linked to an approved GitHub Issue (see "Related GitHub Issue" above).
  • Scope: My changes are focused on the linked issue (one major feature/fix per PR).
  • Self-Review: I have performed a thorough self-review of my code.
  • Testing: New and/or updated tests have been added to cover my changes (if applicable).
  • Documentation Impact: I have considered if my changes require documentation updates (see "Documentation Updates" section below).
  • Contribution Guidelines: I have read and agree to the Contributor Guidelines.

Screenshots / Videos

Code_24_17_29_20___8306

Documentation Updates

  • No documentation updates are required.

Additional Notes

This change paves the way for supporting more multimodal inputs in the future.

Get in Touch


Important

Enable video processing for Gemini models by extending media handling in transformation logic and UI components.

  • Behavior:
    • Extend convertAnthropicContentToGemini in gemini-format.ts to support video content blocks.
    • Sorts content parts to place media before text.
  • UI Components:
    • Replace selectedImages with selectedMedia in ChatView.tsx, ChatRow.tsx, and ChatTextArea.tsx.
    • Add MediaThumbnails component for displaying media thumbnails.
    • ChatView.tsx dynamically determines accepted file types based on model capabilities.
  • Utilities:
    • Add getMimeType utility to identify media types from data URIs.
  • Testing:
    • Add unit tests in gemini-format.spec.ts for video content transformation.
    • Update ChatTextArea.spec.tsx to test new media handling logic.

This description was created by Ellipsis for ec674d2. You can customize this summary. It will automatically update as commits are pushed.

This commit introduces support for video content by updating the Gemini transformer and generalizing media handling in the UI.

Key changes include:
- Extending gemini-format.ts to process video content blocks.
- Adding tests to gemini-format.spec.ts to validate video and mixed media handling.
- Refactoring the chat UI to use a generic MediaThumbnails component.
- Introducing a getMimeType utility for identifying media types from data URIs.

Closes: RooCodeInc#6144
@VooDisss VooDisss requested review from mrubens, cte and jr as code owners July 24, 2025 05:10
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. enhancement New feature or request labels Jul 24, 2025
@hannesrudolph hannesrudolph added the Issue/PR - Triage New issue. Needs quick review to confirm validity and assign labels. label Jul 24, 2025
hannesrudolph added a commit that referenced this pull request Jul 24, 2025
- Consolidate duplicate getMimeType functions into shared utilities
- Remove duplicate MediaThumbnails component, enhance Thumbnails to support video
- Add JSDoc comments to VideoContentBlock interface
- Convert inline styles to Tailwind classes in ChatRow
- Add robust error handling for video processing
- Create centralized media configuration for accepted file types
- Ensure consistent test naming conventions
- Fix ESLint warnings
@hannesrudolph
Copy link
Collaborator

Hi @VooDisss,

I've addressed all the review feedback in a new branch pr-6150. The changes include:

  1. ✅ Consolidated duplicate getMimeType functions into shared utilities
  2. ✅ Removed duplicate MediaThumbnails component and enhanced Thumbnails to support video
  3. ✅ Added JSDoc comments to the VideoContentBlock interface
  4. ✅ Converted inline styles to Tailwind classes in ChatRow
  5. ✅ Added robust error handling for video processing
  6. ✅ Created centralized media configuration for accepted file types
  7. ✅ Fixed all ESLint warnings

You can view the changes at: https://github.com/RooCodeInc/Roo-Code/tree/pr-6150

Since this PR is from your fork, you'll need to either:

  • Cherry-pick the commits from the pr-6150 branch into your fork's main branch
  • Or close this PR and I can create a new one with all the changes

Thank you for your contribution!

- Consolidate duplicate getMimeType functions into shared utilities
- Remove duplicate MediaThumbnails component, enhance Thumbnails to support video
- Add JSDoc comments to VideoContentBlock interface
- Convert inline styles to Tailwind classes in ChatRow
- Add robust error handling for video processing
- Create centralized media configuration for accepted file types
- Ensure consistent test naming conventions
- Fix ESLint warnings
@dosubot dosubot bot added size:XL This PR changes 500-999 lines, ignoring generated files. and removed size:L This PR changes 100-499 lines, ignoring generated files. labels Jul 24, 2025
@VooDisss
Copy link
Author

@hannesrudolph thank you for your edits in your pr-6150.

I have ran gh pr checkout 6150 and compiled the extension and it works, your edition even fixed some UI parts of it, thank you!
I'm attaching the .gif that it works:

Code_24_17_29_20___8306

@daniel-lxs daniel-lxs moved this from Triage to PR [Needs Prelim Review] in Roo Code Roadmap Jul 24, 2025
@hannesrudolph hannesrudolph added PR - Needs Preliminary Review and removed Issue/PR - Triage New issue. Needs quick review to confirm validity and assign labels. labels Jul 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request PR - Needs Preliminary Review size:XL This PR changes 500-999 lines, ignoring generated files.
Projects
Status: PR [Needs Prelim Review]
Development

Successfully merging this pull request may close these issues.

feat: Enable Video Uploads for Multimodal Analysis
2 participants