Is it possible to concatinate two document content vector into one for better search? #26264
Replies: 1 comment 6 replies
-
Hello, @NaumovEvgeniy! I'm here to help you with any issues or questions you have about LangChain. Let's tackle this together! Yes, it is possible to concatenate two document content vectors into one for better search in LangChain. The Here is an example of how to create and use multiple vectors per document: from langchain.storage import InMemoryByteStore
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.retrievers.multi_vector import MultiVectorRetriever
import uuid
# Load and split documents
loaders = [
TextLoader("paul_graham_essay.txt"),
TextLoader("state_of_the_union.txt"),
]
docs = []
for loader in loaders:
docs.extend(loader.load())
text_splitter = RecursiveCharacterTextSplitter(chunk_size=10000)
docs = text_splitter.split_documents(docs)
# Initialize vector store and retriever
vectorstore = Chroma(collection_name="full_documents", embedding_function=OpenAIEmbeddings())
store = InMemoryByteStore()
id_key = "doc_id"
retriever = MultiVectorRetriever(vectorstore=vectorstore, byte_store=store, id_key=id_key)
# Generate document IDs
doc_ids = [str(uuid.uuid4()) for _ in docs]
# Split documents into smaller chunks and associate with parent document
child_text_splitter = RecursiveCharacterTextSplitter(chunk_size=400)
sub_docs = []
for i, doc in enumerate(docs):
_id = doc_ids[i]
_sub_docs = child_text_splitter.split_documents([doc])
for _doc in _sub_docs:
_doc.metadata[id_key] = _id
sub_docs.extend(_sub_docs)
# Index documents in vector store and document store
retriever.vectorstore.add_documents(sub_docs)
retriever.docstore.mset(list(zip(doc_ids, docs)))
# Example search
retrieved_docs = retriever.invoke("justice breyer")
print(len(retrieved_docs[0].page_content)) This code demonstrates how to split documents into smaller chunks, embed them, and associate them with the parent document for improved retrieval [1]. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Checked other resources
Commit to Help
Example Code
Description
I split the document into 4 to make embeddings. Does langchain have some mechanism to concat vectors? Split document example in this repo https://github.com/NaumovEvgeniy/langchain-concat-documents
System Info
langchain==0.2.16
langchain-community==0.2.16
langchain-core==0.2.38
langchain-text-splitters==0.2.4
platform(linux)
Python 3.8.19
Beta Was this translation helpful? Give feedback.
All reactions