Skip to content

[WIP]: Add multimodal support via lance #39

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

AyushExel
Copy link

@AyushExel AyushExel commented Jun 9, 2025

  • add multimodal parsing
    • pdf
    • docx
    • HTML (Experimental)
    • ppt
  • Support create task
    • QA
    • CoT

AyushExel and others added 8 commits May 7, 2025 14:30
This commit introduces multimodal parsing capabilities to the `ingest` command, allowing the extraction of both text and associated image data from various document formats.

Key changes include:

- A new `--multimodal` flag for the `ingest` command in `synthetic_data_kit/cli.py`.
- Updates to `synthetic_data_kit/core/ingest.py` to manage the multimodal workflow.
- Modifications to the following parsers in `synthetic_data_kit/parsers/` (DOCX, PDF, PPTX, HTML) to:
    - Extract images when `--multimodal` is enabled.
    - Implement initial image-text association logic:
        - DOCX: First image in document associated with all text blocks.
        - PDF: First image on a page associated with all text from that page.
        - PPTX: First image on a slide associated with all text from that slide.
        - HTML: Text from content tags and `alt`-text from `<img>` tags are extracted, with images linked to their `alt`-text entries.
    - Their `save` methods now create Lance datasets with 'text' (string) and 'image' (binary, or None) columns when in multimodal mode.
- Unit tests for parser `parse()` methods have been added. HTML and TXT tests are functional. DOCX tests use mocking. PDF and PPTX `parse()` tests have ongoing mocking challenges. All `save()` method tests are currently blocked by an external `lance` library environment issue.
- `README.md` has been updated to document the new feature, usage, and association heuristics.

This feature provides a foundational capability for processing documents with embedded images. Future enhancements may include more sophisticated image-text association techniques.
@facebook-github-bot facebook-github-bot added the CLA Signed This label is managed by the Meta Open Source bot. label Jun 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CLA Signed This label is managed by the Meta Open Source bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants