-
Notifications
You must be signed in to change notification settings - Fork 6
Provide Indexer to index files #27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from 5 commits
Commits
Show all changes
12 commits
Select commit
Hold shift + click to select a range
cb09531
Provide Indexer to index files
conradocloudera 5aeed84
fix imports for local dev
conradocloudera 2cfc7ca
skip tests that require same qdrant client
conradocloudera 5071e39
pass document id within the test
conradocloudera e108d7e
fix the other test with the race condition
conradocloudera 81e5e53
fix monkey patch
conradocloudera 743d5f4
a few tweaks, fix test & a couple bugs (#29)
jkwatson 6604420
Merge branch 'main' into cm/indexer
conradocloudera 0561acb
Merge branch 'cm/indexer' of github.com:cloudera/CML_AMP_RAG_Studio i…
conradocloudera 02b7383
resolve mypy
conradocloudera ce44205
docx support
conradocloudera 60a7cbf
make ruff happy
conradocloudera File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
# | ||
# CLOUDERA APPLIED MACHINE LEARNING PROTOTYPE (AMP) | ||
# (C) Cloudera, Inc. 2024 | ||
# All rights reserved. | ||
# | ||
# Applicable Open Source License: Apache 2.0 | ||
# | ||
# NOTE: Cloudera open source products are modular software products | ||
# made up of hundreds of individual components, each of which was | ||
# individually copyrighted. Each Cloudera open source product is a | ||
# collective work under U.S. Copyright Law. Your license to use the | ||
# collective work is as provided in your written agreement with | ||
# Cloudera. Used apart from the collective work, this file is | ||
# licensed for your use pursuant to the open source license | ||
# identified above. | ||
# | ||
# This code is provided to you pursuant a written agreement with | ||
# (i) Cloudera, Inc. or (ii) a third-party authorized to distribute | ||
# this code. If you do not have a written agreement with Cloudera nor | ||
# with an authorized and properly licensed third party, you do not | ||
# have any rights to access nor to use this code. | ||
# | ||
# Absent a written agreement with Cloudera, Inc. ("Cloudera") to the | ||
# contrary, A) CLOUDERA PROVIDES THIS CODE TO YOU WITHOUT WARRANTIES OF ANY | ||
# KIND; (B) CLOUDERA DISCLAIMS ANY AND ALL EXPRESS AND IMPLIED | ||
# WARRANTIES WITH RESPECT TO THIS CODE, INCLUDING BUT NOT LIMITED TO | ||
# IMPLIED WARRANTIES OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY AND | ||
# FITNESS FOR A PARTICULAR PURPOSE; (C) CLOUDERA IS NOT LIABLE TO YOU, | ||
# AND WILL NOT DEFEND, INDEMNIFY, NOR HOLD YOU HARMLESS FOR ANY CLAIMS | ||
# ARISING FROM OR RELATED TO THE CODE; AND (D)WITH RESPECT TO YOUR EXERCISE | ||
# OF ANY RIGHTS GRANTED TO YOU FOR THE CODE, CLOUDERA IS NOT LIABLE FOR ANY | ||
# DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, PUNITIVE OR | ||
# CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, DAMAGES | ||
# RELATED TO LOST REVENUE, LOST PROFITS, LOSS OF INCOME, LOSS OF | ||
# BUSINESS ADVANTAGE OR UNAVAILABILITY, OR LOSS OR CORRUPTION OF | ||
# DATA. | ||
# |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
# | ||
# CLOUDERA APPLIED MACHINE LEARNING PROTOTYPE (AMP) | ||
# (C) Cloudera, Inc. 2024 | ||
# All rights reserved. | ||
# | ||
# Applicable Open Source License: Apache 2.0 | ||
# | ||
# NOTE: Cloudera open source products are modular software products | ||
# made up of hundreds of individual components, each of which was | ||
# individually copyrighted. Each Cloudera open source product is a | ||
# collective work under U.S. Copyright Law. Your license to use the | ||
# collective work is as provided in your written agreement with | ||
# Cloudera. Used apart from the collective work, this file is | ||
# licensed for your use pursuant to the open source license | ||
# identified above. | ||
# | ||
# This code is provided to you pursuant a written agreement with | ||
# (i) Cloudera, Inc. or (ii) a third-party authorized to distribute | ||
# this code. If you do not have a written agreement with Cloudera nor | ||
# with an authorized and properly licensed third party, you do not | ||
# have any rights to access nor to use this code. | ||
# | ||
# Absent a written agreement with Cloudera, Inc. ("Cloudera") to the | ||
# contrary, A) CLOUDERA PROVIDES THIS CODE TO YOU WITHOUT WARRANTIES OF ANY | ||
# KIND; (B) CLOUDERA DISCLAIMS ANY AND ALL EXPRESS AND IMPLIED | ||
# WARRANTIES WITH RESPECT TO THIS CODE, INCLUDING BUT NOT LIMITED TO | ||
# IMPLIED WARRANTIES OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY AND | ||
# FITNESS FOR A PARTICULAR PURPOSE; (C) CLOUDERA IS NOT LIABLE TO YOU, | ||
# AND WILL NOT DEFEND, INDEMNIFY, NOR HOLD YOU HARMLESS FOR ANY CLAIMS | ||
# ARISING FROM OR RELATED TO THE CODE; AND (D)WITH RESPECT TO YOUR EXERCISE | ||
# OF ANY RIGHTS GRANTED TO YOU FOR THE CODE, CLOUDERA IS NOT LIABLE FOR ANY | ||
# DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, PUNITIVE OR | ||
# CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, DAMAGES | ||
# RELATED TO LOST REVENUE, LOST PROFITS, LOSS OF INCOME, LOSS OF | ||
# BUSINESS ADVANTAGE OR UNAVAILABILITY, OR LOSS OR CORRUPTION OF | ||
# DATA. | ||
# |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,131 @@ | ||
# | ||
# CLOUDERA APPLIED MACHINE LEARNING PROTOTYPE (AMP) | ||
# (C) Cloudera, Inc. 2024 | ||
# All rights reserved. | ||
# | ||
# Applicable Open Source License: Apache 2.0 | ||
# | ||
# NOTE: Cloudera open source products are modular software products | ||
# made up of hundreds of individual components, each of which was | ||
# individually copyrighted. Each Cloudera open source product is a | ||
# collective work under U.S. Copyright Law. Your license to use the | ||
# collective work is as provided in your written agreement with | ||
# Cloudera. Used apart from the collective work, this file is | ||
# licensed for your use pursuant to the open source license | ||
# identified above. | ||
# | ||
# This code is provided to you pursuant a written agreement with | ||
# (i) Cloudera, Inc. or (ii) a third-party authorized to distribute | ||
# this code. If you do not have a written agreement with Cloudera nor | ||
# with an authorized and properly licensed third party, you do not | ||
# have any rights to access nor to use this code. | ||
# | ||
# Absent a written agreement with Cloudera, Inc. ("Cloudera") to the | ||
# contrary, A) CLOUDERA PROVIDES THIS CODE TO YOU WITHOUT WARRANTIES OF ANY | ||
# KIND; (B) CLOUDERA DISCLAIMS ANY AND ALL EXPRESS AND IMPLIED | ||
# WARRANTIES WITH RESPECT TO THIS CODE, INCLUDING BUT NOT LIMITED TO | ||
# IMPLIED WARRANTIES OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY AND | ||
# FITNESS FOR A PARTICULAR PURPOSE; (C) CLOUDERA IS NOT LIABLE TO YOU, | ||
# AND WILL NOT DEFEND, INDEMNIFY, NOR HOLD YOU HARMLESS FOR ANY CLAIMS | ||
# ARISING FROM OR RELATED TO THE CODE; AND (D)WITH RESPECT TO YOUR EXERCISE | ||
# OF ANY RIGHTS GRANTED TO YOU FOR THE CODE, CLOUDERA IS NOT LIABLE FOR ANY | ||
# DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, PUNITIVE OR | ||
# CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, DAMAGES | ||
# RELATED TO LOST REVENUE, LOST PROFITS, LOSS OF INCOME, LOSS OF | ||
# BUSINESS ADVANTAGE OR UNAVAILABILITY, OR LOSS OR CORRUPTION OF | ||
# DATA. | ||
# | ||
|
||
from dataclasses import dataclass | ||
import logging | ||
import os | ||
from typing import Dict, List, Type | ||
|
||
from .readers.pdf import PDFReader | ||
from .readers.nop import NopReader | ||
from llama_index.core.readers.base import BaseReader | ||
from llama_index.core.node_parser import SentenceSplitter | ||
from llama_index.core.schema import Document | ||
from llama_index.core.base.embeddings.base import BaseEmbedding | ||
from ...services.vector_store import VectorStore | ||
from llama_index.core.node_parser.interface import BaseNode | ||
|
||
logger = logging.getLogger(__name__) | ||
|
||
READERS: Dict[str, Type[BaseReader]] = { | ||
".pdf": PDFReader, | ||
".txt": NopReader, | ||
".md": NopReader, | ||
} | ||
CHUNKABLE_FILE_EXTENSIONS = set( | ||
[ | ||
".pdf", | ||
".txt", | ||
".md", | ||
] | ||
) | ||
|
||
@dataclass | ||
class NotSupportedFileExtensionError(Exception): | ||
file_extension: str | ||
|
||
class Indexer: | ||
def __init__(self, data_source_id: int, splitter: SentenceSplitter, embedding_model: BaseEmbedding, chunks_vector_store: VectorStore): | ||
self.data_source_id = data_source_id | ||
self.splitter = splitter | ||
self.embedding_model = embedding_model | ||
self.chunks_vector_store = chunks_vector_store | ||
|
||
def index_file(self, file_path: str, file_id: str): | ||
logger.debug(f"Indexing file: {file_path}") | ||
|
||
file_extension = os.path.splitext(file_path)[1] | ||
reader_cls = READERS.get(file_extension) | ||
if not reader_cls: | ||
raise NotSupportedFileExtensionError(file_extension) | ||
|
||
reader = reader_cls() | ||
|
||
logger.debug(f"Parsing file: {file_path}") | ||
|
||
documents = self._documents_in_file(reader, file_path, file_id) | ||
if file_extension in CHUNKABLE_FILE_EXTENSIONS: | ||
logger.debug(f"Chunking file: {file_path}") | ||
chunks = [chunk for document in documents for chunk in self._chunks_in_document(document)] | ||
else: | ||
chunks = documents | ||
|
||
texts = [chunk.text for chunk in chunks] | ||
logger.debug(f"Embedding {len(texts)} chunks") | ||
embeddings = self.embedding_model.get_text_embedding_batch(texts) | ||
|
||
for chunk, embedding in zip(chunks, embeddings): | ||
chunk.embedding = embedding | ||
jkwatson marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
logger.debug(f"Adding {len(chunks)} chunks to vector store") | ||
chunks_vector_store = self.chunks_vector_store.access_vector_store() | ||
chunks_vector_store.add(chunks) | ||
|
||
logger.debug(f"Indexing file: {file_path} completed") | ||
|
||
def _documents_in_file(self, reader: BaseReader, file_path: str, file_id: str) -> List[Document]: | ||
documents = reader.load_data(file_path) | ||
|
||
for i, document in enumerate(documents): | ||
# Update the document metadata | ||
document.metadata["file_id"] = file_id | ||
document.metadata["document_part_number"] = i | ||
document.metadata["data_source_id"] = self.data_source_id | ||
jkwatson marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
return documents | ||
|
||
def _chunks_in_document(self, document: Document) -> List[BaseNode]: | ||
chunks = self.splitter.get_nodes_from_documents([document]) | ||
|
||
for j, chunk in enumerate(chunks): | ||
chunk.metadata["file_id"] = document.metadata["file_id"] | ||
chunk.metadata["document_part_number"] = document.metadata["document_part_number"] | ||
chunk.metadata["chunk_number"] = j | ||
chunk.metadata["data_source_id"] = document.metadata["data_source_id"] | ||
|
||
return chunks |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
# | ||
# CLOUDERA APPLIED MACHINE LEARNING PROTOTYPE (AMP) | ||
# (C) Cloudera, Inc. 2024 | ||
# All rights reserved. | ||
# | ||
# Applicable Open Source License: Apache 2.0 | ||
# | ||
# NOTE: Cloudera open source products are modular software products | ||
# made up of hundreds of individual components, each of which was | ||
# individually copyrighted. Each Cloudera open source product is a | ||
# collective work under U.S. Copyright Law. Your license to use the | ||
# collective work is as provided in your written agreement with | ||
# Cloudera. Used apart from the collective work, this file is | ||
# licensed for your use pursuant to the open source license | ||
# identified above. | ||
# | ||
# This code is provided to you pursuant a written agreement with | ||
# (i) Cloudera, Inc. or (ii) a third-party authorized to distribute | ||
# this code. If you do not have a written agreement with Cloudera nor | ||
# with an authorized and properly licensed third party, you do not | ||
# have any rights to access nor to use this code. | ||
# | ||
# Absent a written agreement with Cloudera, Inc. ("Cloudera") to the | ||
# contrary, A) CLOUDERA PROVIDES THIS CODE TO YOU WITHOUT WARRANTIES OF ANY | ||
# KIND; (B) CLOUDERA DISCLAIMS ANY AND ALL EXPRESS AND IMPLIED | ||
# WARRANTIES WITH RESPECT TO THIS CODE, INCLUDING BUT NOT LIMITED TO | ||
# IMPLIED WARRANTIES OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY AND | ||
# FITNESS FOR A PARTICULAR PURPOSE; (C) CLOUDERA IS NOT LIABLE TO YOU, | ||
# AND WILL NOT DEFEND, INDEMNIFY, NOR HOLD YOU HARMLESS FOR ANY CLAIMS | ||
# ARISING FROM OR RELATED TO THE CODE; AND (D)WITH RESPECT TO YOUR EXERCISE | ||
# OF ANY RIGHTS GRANTED TO YOU FOR THE CODE, CLOUDERA IS NOT LIABLE FOR ANY | ||
# DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, PUNITIVE OR | ||
# CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, DAMAGES | ||
# RELATED TO LOST REVENUE, LOST PROFITS, LOSS OF INCOME, LOSS OF | ||
# BUSINESS ADVANTAGE OR UNAVAILABILITY, OR LOSS OR CORRUPTION OF | ||
# DATA. | ||
# |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,46 @@ | ||
# | ||
# CLOUDERA APPLIED MACHINE LEARNING PROTOTYPE (AMP) | ||
# (C) Cloudera, Inc. 2024 | ||
# All rights reserved. | ||
# | ||
# Applicable Open Source License: Apache 2.0 | ||
# | ||
# NOTE: Cloudera open source products are modular software products | ||
# made up of hundreds of individual components, each of which was | ||
# individually copyrighted. Each Cloudera open source product is a | ||
# collective work under U.S. Copyright Law. Your license to use the | ||
# collective work is as provided in your written agreement with | ||
# Cloudera. Used apart from the collective work, this file is | ||
# licensed for your use pursuant to the open source license | ||
# identified above. | ||
# | ||
# This code is provided to you pursuant a written agreement with | ||
# (i) Cloudera, Inc. or (ii) a third-party authorized to distribute | ||
# this code. If you do not have a written agreement with Cloudera nor | ||
# with an authorized and properly licensed third party, you do not | ||
# have any rights to access nor to use this code. | ||
# | ||
# Absent a written agreement with Cloudera, Inc. ("Cloudera") to the | ||
# contrary, A) CLOUDERA PROVIDES THIS CODE TO YOU WITHOUT WARRANTIES OF ANY | ||
# KIND; (B) CLOUDERA DISCLAIMS ANY AND ALL EXPRESS AND IMPLIED | ||
# WARRANTIES WITH RESPECT TO THIS CODE, INCLUDING BUT NOT LIMITED TO | ||
# IMPLIED WARRANTIES OF TITLE, NON-INFRINGEMENT, MERCHANTABILITY AND | ||
# FITNESS FOR A PARTICULAR PURPOSE; (C) CLOUDERA IS NOT LIABLE TO YOU, | ||
# AND WILL NOT DEFEND, INDEMNIFY, NOR HOLD YOU HARMLESS FOR ANY CLAIMS | ||
# ARISING FROM OR RELATED TO THE CODE; AND (D)WITH RESPECT TO YOUR EXERCISE | ||
# OF ANY RIGHTS GRANTED TO YOU FOR THE CODE, CLOUDERA IS NOT LIABLE FOR ANY | ||
# DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, PUNITIVE OR | ||
# CONSEQUENTIAL DAMAGES INCLUDING, BUT NOT LIMITED TO, DAMAGES | ||
# RELATED TO LOST REVENUE, LOST PROFITS, LOSS OF INCOME, LOSS OF | ||
# BUSINESS ADVANTAGE OR UNAVAILABILITY, OR LOSS OR CORRUPTION OF | ||
# DATA. | ||
# | ||
|
||
from typing import List | ||
from llama_index.core.readers.base import BaseReader | ||
from llama_index.core.schema import Document | ||
|
||
class NopReader(BaseReader): | ||
def load_data(self, file_path: str) -> List[Document]: | ||
with open(file_path, "r") as f: | ||
return [Document(text=f.read())] |
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.