Skip to content

feat: allow mime_type to be guessed for ByteStream #9573

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged

Conversation

kanenorman
Copy link
Contributor

Related Issues

Proposed Changes:

  • Add guess_mime_type parameter to ByteStream
  • Refactor FileTypeRouter
  • Add guess_mime_type parameter to haystack.components.converters.utils.get_bytestream_from_source

How did you test it?

  • Added unit test for ByteStream
  • Added unit test for get_bytestream_from_source
  • Did not change test for FileTypeRouter

Notes for the reviewer

This is a breaking change so would like feedback 😄

Checklist

  • I have read the contributors guidelines and the code of conduct
  • I have updated the related issue with new insights and changes
  • I added unit tests and updated the docstrings
  • I've used one of the conventional commit types for my PR title: fix:, feat:, build:, chore:, ci:, docs:, style:, refactor:, perf:, test: and added ! in case the PR includes breaking changes.
  • I documented my code
  • I ran pre-commit hooks and fixed any issue

@kanenorman kanenorman requested a review from a team as a code owner June 30, 2025 22:47
@kanenorman kanenorman requested review from julian-risch and removed request for a team June 30, 2025 22:47
@github-actions github-actions bot added topic:tests type:documentation Improvements on the docs labels Jun 30, 2025
Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @kanenorman Thank you for opening this pull request!
I left a few comments and also saw that three unit tests are failing here:

FAILED test/components/converters/test_multi_file_converter.py::TestMultiFileConverter::test_run_error_handling - haystack.core.errors.PipelineRuntimeError: The following component failed to run:
Component name: 'router'
Component type: 'FileTypeRouter'
Error: [Errno 2] No such file or directory: '/home/runner/work/haystack/haystack/test/test_files/non_existent.txt'
FAILED test/components/routers/test_file_router.py::TestFileTypeRouter::test_unlisted_extensions - FileNotFoundError: [Errno 2] No such file or directory: '/home/runner/work/haystack/haystack/test/test_files/audio/ignored.mp3'
FAILED test/components/routers/test_file_router.py::TestFileTypeRouter::test_no_extension - FileNotFoundError: [Errno 2] No such file or directory: '/home/runner/work/haystack/haystack/test/test_files/txt/doc_2'

Please let me know if you need help looking into those. Other than need the pull request only needs a release note and we describe how you can generate one here. Please let me know in case you need help with that.
I was wondering why you wrote in the pull request description that it is a breaking change? With guess_mime_type being False as default and with FileTypeRouter using guess_mime_type True, the behavior remains the same, no?

Other than that the pull request looks quite good to me already! 🙂

@@ -8,20 +8,24 @@
from haystack.dataclasses import ByteStream


def get_bytestream_from_source(source: Union[str, Path, ByteStream]) -> ByteStream:
def get_bytestream_from_source(
source: Union[str, Path, ByteStream], guess_mime_type: Optional[bool] = False
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we have False as a default value, there is no reason to make this Optional. Let's change the type to bool

Suggested change
source: Union[str, Path, ByteStream], guess_mime_type: Optional[bool] = False
source: Union[str, Path, ByteStream], guess_mime_type: bool = False

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and in from_file_path you also use bool instead of Optional[bool]

@@ -31,3 +33,38 @@ def test_normalize_metadata_list_of_wrong_size():
def test_normalize_metadata_other_type():
with pytest.raises(ValueError, match="meta must be either None, a dictionary or a list of dictionaries."):
normalize_metadata(({"a": 1},), sources_count=1)


def test_get_bytestream_from_path_object(tmp_path):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In addition to the test cases here, I suggest we add a test case for FileTypeRouter in test/components/routers/test_file_router.py checking that the parameter additional_mimetypes still works as expected. We're doing mimetypes.add_type(mime, ext) there and need to check that when get_bytestream_from_source it's taking into account these additional mimetypes, for example: {"application/vnd.openxmlformats-officedocument.wordprocessingml.document": ".docx"}.

@kanenorman
Copy link
Contributor Author

Thanks for the feedback @julian-risch. I’ll get the changes implemented shortly.

I was wondering why you wrote in the pull request description that it is a breaking change? With guess_mime_type being False by default and FileTypeRouter using guess_mime_type=True, the behavior remains the same, no?

You’re absolutely right. I misspoke. Thanks for pointing that out! 😄

Regarding the failing unit test, I’ve identified the issue and will push a fix.

Before proceeding, I reviewed the latest haystack-ai source code (v2.15.1) and noticed that FileTypeRouter raises a FileNotFoundError only when a non-existent file is passed with the meta parameter. Without meta, the component does not raise an error and still classifies the file based on its extension.

Here’s a minimal example:

python-version: 3.11.3
mac-os: Sequoia 15
haystack-ai: 2.15.1

from haystack.components.routers import FileTypeRouter

router = FileTypeRouter(mime_types=[r'text/plain'])

# No meta - does not raise error
router.run(sources=["non_existent.txt"])
# → {'text/plain': [PosixPath('non_existent.txt')]}

# With meta - raises FileNotFoundError
router.run(sources=["non_existent.txt"], meta={"spam": "eggs"})
# → FileNotFoundError: [Errno 2] No such file or directory: 'non_existent.txt'

This feels inconsistent. Should metadata change how we treat missing files? Should the component:

  • Maintain existing behavior?
  • Skip missing files?
  • Raise an error in all cases?
  • Still classify the path even if the file doesn't exist?

Happy to align with whatever behavior is intended, just wanted to clarify before committing a fix.

@julian-risch
Copy link
Member

Hi @kanenorman thanks for your patience! We discussed in the team and concluded that for now we would like to keep the inconsistency and not fix it in this PR. That makes this PR a bit more complex but doesn't break any current behavior.
We'll then open a separate issue on the inconsistency and plan to have a deprecation warning in the 2.16.0 release, and consistent behavior in 2.17.0.

@kanenorman kanenorman requested a review from a team as a code owner July 7, 2025 15:43
@kanenorman kanenorman requested review from dfokina and removed request for a team July 7, 2025 15:43
@coveralls
Copy link
Collaborator

coveralls commented Jul 11, 2025

Pull Request Test Coverage Report for Build 16437756318

Warning: This coverage report may be inaccurate.

This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.

Details

  • 0 of 0 changed or added relevant lines in 0 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.01%) to 90.464%

Totals Coverage Status
Change from base Build 16421645889: 0.01%
Covered Lines: 12077
Relevant Lines: 13350

💛 - Coveralls

Copy link
Member

@julian-risch julian-risch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! 👍 Congratulations on your first pull request merged into Haystack!
The pull request now doesn't introduce any breaking changes, which is great. How you handled windows/unix mimetype differences in the test makes sense to me.

@julian-risch julian-risch changed the title feat: bytestream guess mime type feat: allow mime_type to be guessed for ByteStream Jul 22, 2025
@julian-risch julian-risch enabled auto-merge (squash) July 22, 2025 07:38
@julian-risch julian-risch merged commit 9420492 into deepset-ai:main Jul 22, 2025
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
topic:tests type:documentation Improvements on the docs
Projects
None yet
Development

Successfully merging this pull request may close these issues.

feat: allow mime_type to be guessed for ByteStream
3 participants