Skip to content

a few tweaks, fix test & a couple bugs #29

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Nov 21, 2024
Merged

Conversation

jkwatson
Copy link
Collaborator

No description provided.

@@ -60,7 +60,8 @@ def __init__(
api_base=api_base,
messages_to_prompt=messages_to_prompt,
completion_to_prompt=completion_to_prompt,
default_headers=default_headers)
default_headers=default_headers,
context=context)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this fix is actually on main, I think

@@ -101,6 +95,7 @@ def index_file(self, file_path: str, file_id: str):

for chunk, embedding in zip(chunks, embeddings):
chunk.embedding = embedding
chunk.metadata["file_name"] = os.path.basename(file_path)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure this is true unless we save with the original file name to s3. I thought we saved with the UUID

@@ -113,6 +108,7 @@ def _documents_in_file(self, reader: BaseReader, file_path: str, file_id: str) -

for i, document in enumerate(documents):
# Update the document metadata
document.id_ = file_id
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to couple them, right? Is there a reason why we need them to match?

@@ -124,6 +120,7 @@ def _chunks_in_document(self, document: Document) -> List[BaseNode]:

for j, chunk in enumerate(chunks):
chunk.metadata["file_id"] = document.metadata["file_id"]
chunk.metadata["document_id"] = document.metadata["file_id"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure why? Maybe document_id on both sides?

@jkwatson jkwatson merged commit 743d5f4 into cm/indexer Nov 21, 2024
1 check passed
@jkwatson jkwatson deleted the jw/cm/indexer_patch branch November 21, 2024 17:56
conradocloudera added a commit that referenced this pull request Nov 21, 2024
* Provide Indexer to index files

* fix imports for local dev

* skip tests that require same qdrant client

* pass document id within the test

* fix the other test with the race condition

* fix monkey patch

* a few tweaks, fix test & a couple bugs (#29)

* resolve mypy

* docx support

* make ruff happy

---------

Co-authored-by: John Watson <jkwatson@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants