Using AzureSearch with embeddings.embed_documents instead of embeddings.embed_query #28964

AsmaaMHadir · 2024-12-30T00:05:03Z

AsmaaMHadir
Dec 30, 2024

I have an Azure search vector store defined which uses AzureOpenAIEmbeddings to imbed incoming documents:

    embeddings = AzureOpenAIEmbeddings(
        azure_endpoint=os.getenv("AZURE_OPENAI_EMBEDDINGS_DEPLOYMENT"),
        api_version=os.getenv("AZURE_EMBEDDINGS_API_VERSION"),
        api_key=os.getenv("AZURE_OPENAI_API_KEY"),
    )
    
    #TODO: inspect 
    embedding_function = embeddings.embed_query

vector_store = AzureSearch(
        azure_search_endpoint=os.getenv("VECTOR_STORE_ADDRESS"),
        azure_search_key=os.getenv("VECTOR_STORE_PASSWORD"),
        index_name=INDEX_NAME,
        embedding_function=embedding_function,
        additional_search_client_options={"retry_total": 4},
        fields=fields
    )

such that batch is a list of Document objects.
then I use this script to index a my batches:

def sync_index_batch(batch):
    """
    Worker function to index a batch synchronously. Initializes its own vector store.
    """
    try:
        logging.info(f"Indexing batch with {len(batch)} documents")

        asyncio.run(vector_store.aadd_documents(documents=batch))  # Run async code synchronously
        logging.info(f"Successfully indexed a batch of {len(batch)} documents")
        return True
    except Exception as e:
        logging.error(f"Error occurred while indexing batch: {str(e)}")
        return False

def index_documents(docs, batch_size=10):
    """
    Index documents in parallel using multiprocessing.Pool.
    
    :param docs: List of documents to be indexed.
    :param batch_size: Number of documents per batch.

    """

    logging.info(f"Starting to index {len(docs)} documents")
    
    # Create batches of documents
    batches = [docs[i:i + batch_size] for i in range(0, len(docs), batch_size)]
    logging.info(f"Created {len(batches)} batches for parallel indexing")
    
    # Use multiprocessing to index the batches
    with Pool(processes=cpu_count()) as pool:
        results = pool.map(sync_index_batch, batches)
    
    # Check if all batches were processed successfully
    if all(results):
        logging.info(f"Successfully indexed {len(docs)} documents in {len(batches)} batches")
        return True
    else:
        logging.error(f"Some batches failed to index")
        raise Exception("Failed to index some batches")

Instead of using embed_query I would like to use OpenAIEmbeddings' embed_documents to embed my batch in lesser time. When I replace embeddings.embed_query with embeddings.embed_documents I encounter the error:

INFO:httpx:HTTP Request: POST https://xxxxxx.openai.azure.com/openai/deployments/text-embedding-3-small/embeddings?api-version=2023-05-15 "HTTP/1.1 503 Service Unavailable"
ERROR:root:Error occurred while indexing batch: Error code: 503 - {'error': {'code': 'InternalServerError', 'message': 'The service is temporarily unable to process your request. Please try again later.'}}

My ultimate goal is to batch-process my documents faster than I currently do. Can you help me incorporate embed_documents correctly? are there any other langchain functionalities I can use to speed up indexing for large documents?

@dosu · 2024-12-30T00:06:54Z

dosubot[bot]
bot Dec 30, 2024

To batch-process documents faster using embed_documents with AzureOpenAIEmbeddings, you can utilize the add_texts method, which supports batching if the embedding function is an instance of the Embeddings class. Here's how you can incorporate it:

def add_texts(
    self,
    texts: Iterable[str],
    metadatas: Optional[List[dict]] = None,
    *,
    keys: Optional[List[str]] = None,
    **kwargs: Any,
) -> List[str]:
    """Add texts data to an existing index."""
    if isinstance(self.embedding_function, Embeddings):
        try:
            embeddings = self.embedding_function.embed_documents(list(texts))
        except NotImplementedError:
            embeddings = [self.embedding_function.embed_query(x) for x in texts]
    else:
        embeddings = [self.embedding_function(x) for x in texts]

    if len(embeddings) == 0:
        logger.debug("Nothing to insert, skipping.")
        return []

    if (not keys) and ("ids" in kwargs) and (len(kwargs["ids"]) == len(embeddings)):
        keys = kwargs["ids"]

    return self.add_embeddings(zip(texts, embeddings), metadatas, keys=keys)

Additionally, Langchain provides asynchronous methods such as aadd_texts and aembed_documents for non-blocking operations, which can further speed up the indexing process for large documents:

async def aadd_texts(
    self,
    texts: Iterable[str],
    metadatas: Optional[List[dict]] = None,
    *,
    keys: Optional[List[str]] = None,
    **kwargs: Any,
) -> List[str]:
    if isinstance(self.embedding_function, Embeddings):
        try:
            embeddings = await self.embedding_function.aembed_documents(list(texts))
        except NotImplementedError:
            embeddings = [
                await self.embedding_function.aembed_query(x) for x in texts
            ]
    else:
        embeddings = [self.embedding_function(x) for x in texts]

    if len(embeddings) == 0:
        logger.debug("Nothing to insert, skipping.")
        return []

    if (not keys) and ("ids" in kwargs) and (len(kwargs["ids"]) == len(embeddings)):
        keys = kwargs["ids"]

    return await self.aadd_embeddings(zip(texts, embeddings), metadatas, keys=keys)

Moreover, the AzureOpenAIEmbeddings class allows you to specify a batch_size parameter, which can be adjusted to optimize the number of texts sent to the API in each batch. This can significantly reduce the number of API calls and improve processing time for large numbers of short texts. You can set up the AzureOpenAI embeddings as follows:

from langchain_openai import AzureOpenAIEmbeddings

embeddings = AzureOpenAIEmbeddings(
    model="text-embedding-3-large",
    # dimensions: Optional[int] = None, # Can specify dimensions with new text-embedding-3 models
    # azure_endpoint="https://<your-endpoint>.openai.azure.com/", If not provided, will read env variable AZURE_OPENAI_ENDPOINT
    # api_key=... # Can provide an API key directly. If missing read env variable AZURE_OPENAI_API_KEY
    # openai_api_version=..., # If not provided, will read env variable AZURE_OPENAI_API_VERSION
)

By using these methods and adjusting the batch size, you can efficiently batch-process and index documents in AzureSearch using Langchain's embedding functionalities ^[1]^[2]^[3].

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other}

0 replies

AsmaaMHadir · 2024-12-30T00:14:48Z

AsmaaMHadir
Dec 30, 2024
Author

@dosu

My question is how to I correctly pass aembed_documents to the AzureSearch instance?

When I make my embeddings function to be embeddings.aembed_documents I get this error:

coroutine 'OpenAIEmbeddings.aembed_documents' was never awaited
  vector_search_dimensions=len(embedding_function("Text"))

5 replies

dosubot[bot] bot Dec 30, 2024

To correctly pass aembed_documents to the AzureSearch instance, you need to ensure that you are using the await keyword when calling the asynchronous method aembed_documents. This is necessary because aembed_documents is a coroutine and must be awaited in an asynchronous context.

Here's an example of how you can use aembed_documents correctly in an asynchronous function:

async def add_texts_to_azure_search(azure_search_instance, texts):
    if isinstance(azure_search_instance.embedding_function, Embeddings):
        try:
            embeddings = await azure_search_instance.embedding_function.aembed_documents(list(texts))
        except NotImplementedError:
            embeddings = [await azure_search_instance.embedding_function.aembed_query(x) for x in texts]
    else:
        embeddings = [azure_search_instance.embedding_function(x) for x in texts]

    if len(embeddings) == 0:
        logger.debug("Nothing to insert, skipping.")
        return []

    return await azure_search_instance.aadd_embeddings(zip(texts, embeddings))

Make sure that the function containing this code is defined as async and is called within an asynchronous context. Additionally, ensure that your environment is correctly set up for using Azure endpoints with OpenAI, as described in the LangChain documentation ^[1]^[2].

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other}

AsmaaMHadir Dec 30, 2024
Author

@dosu

How about embedding a list of Document objects not a list of text strings?

dosubot[bot] bot Dec 30, 2024

To embed a list of Document objects, you need to extract the text content from each Document and then pass these texts to the embed_documents method. The Document objects in LangChain typically have a page_content attribute that contains the text to be embedded. Here's how you can do it:

Extract the text from each Document object.
Pass the extracted texts to the embed_documents method.

Here's an example:

from langchain_core.documents import Document

# Assuming `documents` is a list of Document objects
documents = [
    Document(page_content="This is the first document."),
    Document(page_content="This is the second document."),
    # Add more documents as needed
]

# Extract text content from each Document
texts = [doc.page_content for doc in documents]

# Use the embed_documents method
embeddings = embeddings.embed_documents(texts)

This approach ensures that you are correctly handling the Document objects by focusing on their text content, which is what the embedding functions require ^[1].

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other}

AsmaaMHadir Dec 30, 2024
Author

@dosu

I don't want to embed texts because I want to also send the document's metadata

dosubot[bot] bot Dec 30, 2024

To embed a list of Document objects along with their metadata, you can use the add_texts method of the AzureSearch class. This method allows you to pass both the text content and the metadata of each document. Here's how you can do it:

Extract the text content and metadata from each Document object.
Use the add_texts method to add these to the AzureSearch instance.

Here's an example:

from langchain_core.documents import Document

# Assuming `documents` is a list of Document objects
documents = [
    Document(page_content="This is the first document.", metadata={"author": "Author 1"}),
    Document(page_content="This is the second document.", metadata={"author": "Author 2"}),
    # Add more documents as needed
]

# Extract text content and metadata from each Document
texts = [doc.page_content for doc in documents]
metadatas = [doc.metadata for doc in documents]

# Add texts and metadata to the AzureSearch instance
vector_store.add_texts(texts, metadatas)

This approach ensures that both the text content and the associated metadata of each document are embedded and stored in the AzureSearch index ^[1].

To continue talking to Dosu, mention @dosu.

^{Help Dosu learn! Give it feedback: Great Response | Irrelevant Answer | Incorrect Sources | Too Verbose | Hallucination | Other}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Using AzureSearch with embeddings.embed_documents instead of embeddings.embed_query #28964

Uh oh!

{{title}}

Uh oh!

Replies: 2 comments 5 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Using AzureSearch with embeddings.embed_documents instead of embeddings.embed_query #28964

Uh oh!

AsmaaMHadir Dec 30, 2024

Replies: 2 comments · 5 replies

Uh oh!

dosubot[bot] bot Dec 30, 2024

Uh oh!

AsmaaMHadir Dec 30, 2024 Author

Uh oh!

dosubot[bot] bot Dec 30, 2024

Uh oh!

AsmaaMHadir Dec 30, 2024 Author

Uh oh!

dosubot[bot] bot Dec 30, 2024

Uh oh!

AsmaaMHadir Dec 30, 2024 Author

Uh oh!

dosubot[bot] bot Dec 30, 2024

AsmaaMHadir
Dec 30, 2024

Replies: 2 comments 5 replies

dosubot[bot]
bot Dec 30, 2024

AsmaaMHadir
Dec 30, 2024
Author

AsmaaMHadir Dec 30, 2024
Author

AsmaaMHadir Dec 30, 2024
Author