-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Few observations and suggestions to improve the performance and efficiency of current agentic RAG system (v1):
Embedding Retrieval:
In the EmbeddingRetriever class, we are fetching all the embeddings from the database and then performing similarity calculations in the application code. This can be slow, especially if we have a large number of embeddings.
Consider implementing the similarity search directly in the database using PostgreSQL's vector similarity search capabilities. We can use the cube or vector data types and create an index on the embedding column to speed up the similarity search.
Embedding Dimensionality:
We are using the text-embedding-3-large model for generating embeddings, which produces embeddings of 1024 dimensions. Working with high-dimensional embeddings can be computationally expensive.
Evaluate if using a smaller embedding model, such as text-embedding-ada-002 (1536 dimensions), can provide sufficient accuracy while reducing the computational overhead.
Batch Processing:
In the process_file_from_s3 method of the DBops class, we are processing the CSV file and generating embeddings for all questions at once. This can be memory-intensive and slow down the lambda function.
Consider processing the CSV file in smaller batches and updating the database incrementally. This will reduce the memory footprint and improve the responsiveness of the lambda function.
Caching:
Implement caching mechanisms to store frequently accessed embeddings or generated responses. This can help reduce the load on the database and improve response times.
We can use in-memory caching solutions like Redis or utilize AWS services like ElastiCache to store and retrieve cached data efficiently.
Asynchronous Processing:
Some tasks, such as embedding generation or database updates, can be performed asynchronously to improve the responsiveness of the lambda function.
Consider using AWS services like SQS (Simple Queue Service) or Step Functions to decouple time-consuming tasks from the main lambda function and process them asynchronously.