-
Notifications
You must be signed in to change notification settings - Fork 2.3k
feat: allow mime_type to be guessed for ByteStream #9573
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: allow mime_type to be guessed for ByteStream #9573
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello @kanenorman Thank you for opening this pull request!
I left a few comments and also saw that three unit tests are failing here:
FAILED test/components/converters/test_multi_file_converter.py::TestMultiFileConverter::test_run_error_handling - haystack.core.errors.PipelineRuntimeError: The following component failed to run:
Component name: 'router'
Component type: 'FileTypeRouter'
Error: [Errno 2] No such file or directory: '/home/runner/work/haystack/haystack/test/test_files/non_existent.txt'
FAILED test/components/routers/test_file_router.py::TestFileTypeRouter::test_unlisted_extensions - FileNotFoundError: [Errno 2] No such file or directory: '/home/runner/work/haystack/haystack/test/test_files/audio/ignored.mp3'
FAILED test/components/routers/test_file_router.py::TestFileTypeRouter::test_no_extension - FileNotFoundError: [Errno 2] No such file or directory: '/home/runner/work/haystack/haystack/test/test_files/txt/doc_2'
Please let me know if you need help looking into those. Other than need the pull request only needs a release note and we describe how you can generate one here. Please let me know in case you need help with that.
I was wondering why you wrote in the pull request description that it is a breaking change? With guess_mime_type
being False as default and with FileTypeRouter using guess_mime_type
True, the behavior remains the same, no?
Other than that the pull request looks quite good to me already! 🙂
@@ -8,20 +8,24 @@ | |||
from haystack.dataclasses import ByteStream | |||
|
|||
|
|||
def get_bytestream_from_source(source: Union[str, Path, ByteStream]) -> ByteStream: | |||
def get_bytestream_from_source( | |||
source: Union[str, Path, ByteStream], guess_mime_type: Optional[bool] = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we have False
as a default value, there is no reason to make this Optional. Let's change the type to bool
source: Union[str, Path, ByteStream], guess_mime_type: Optional[bool] = False | |
source: Union[str, Path, ByteStream], guess_mime_type: bool = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and in from_file_path
you also use bool
instead of Optional[bool]
@@ -31,3 +33,38 @@ def test_normalize_metadata_list_of_wrong_size(): | |||
def test_normalize_metadata_other_type(): | |||
with pytest.raises(ValueError, match="meta must be either None, a dictionary or a list of dictionaries."): | |||
normalize_metadata(({"a": 1},), sources_count=1) | |||
|
|||
|
|||
def test_get_bytestream_from_path_object(tmp_path): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In addition to the test cases here, I suggest we add a test case for FileTypeRouter in test/components/routers/test_file_router.py
checking that the parameter additional_mimetypes
still works as expected. We're doing mimetypes.add_type(mime, ext)
there and need to check that when get_bytestream_from_source
it's taking into account these additional mimetypes, for example: {"application/vnd.openxmlformats-officedocument.wordprocessingml.document": ".docx"}
.
Thanks for the feedback @julian-risch. I’ll get the changes implemented shortly.
You’re absolutely right. I misspoke. Thanks for pointing that out! 😄 Regarding the failing unit test, I’ve identified the issue and will push a fix. Before proceeding, I reviewed the latest haystack-ai source code (v2.15.1) and noticed that Here’s a minimal example: python-version: 3.11.3 from haystack.components.routers import FileTypeRouter
router = FileTypeRouter(mime_types=[r'text/plain'])
# No meta - does not raise error
router.run(sources=["non_existent.txt"])
# → {'text/plain': [PosixPath('non_existent.txt')]}
# With meta - raises FileNotFoundError
router.run(sources=["non_existent.txt"], meta={"spam": "eggs"})
# → FileNotFoundError: [Errno 2] No such file or directory: 'non_existent.txt' This feels inconsistent. Should metadata change how we treat missing files? Should the component:
Happy to align with whatever behavior is intended, just wanted to clarify before committing a fix. |
Hi @kanenorman thanks for your patience! We discussed in the team and concluded that for now we would like to keep the inconsistency and not fix it in this PR. That makes this PR a bit more complex but doesn't break any current behavior. |
Pull Request Test Coverage Report for Build 16437756318Warning: This coverage report may be inaccurate.This pull request's base commit is no longer the HEAD commit of its target branch. This means it includes changes from outside the original pull request, including, potentially, unrelated coverage changes.
Details
💛 - Coveralls |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! 👍 Congratulations on your first pull request merged into Haystack!
The pull request now doesn't introduce any breaking changes, which is great. How you handled windows/unix mimetype differences in the test makes sense to me.
Related Issues
Proposed Changes:
guess_mime_type
parameter toByteStream
FileTypeRouter
guess_mime_type
parameter tohaystack.components.converters.utils.get_bytestream_from_source
How did you test it?
ByteStream
get_bytestream_from_source
FileTypeRouter
Notes for the reviewer
This is a breaking change so would like feedback 😄
Checklist
fix:
,feat:
,build:
,chore:
,ci:
,docs:
,style:
,refactor:
,perf:
,test:
and added!
in case the PR includes breaking changes.