Allow same source files with different collection name using SQLRecordManager #21704

dellsekar · 2024-05-15T09:03:28Z

dellsekar
May 15, 2024

Checked other resources

I added a very descriptive title to this question.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.

Commit to Help

I commit to help with one of those options 👆

Example Code

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain.indexes import index,SQLRecordManager


def setup_vector_store_pg(embeddings, connection_string):
    vectorstore = PGVector(
        embeddings=embeddings,
        collection_name=EMPLOYEE_NAME,
        connection=connection_string
    )
    return vectorstore

# s3_files comes from different function	
def process_documents_pg(s3_files):
    embeddings = HuggingFaceEmbeddings(model_name="sentence-transformers/all-MiniLM-L12-v2", model_kwargs={'device': "cpu"}, encode_kwargs={"normalize_embeddings": False})
    text_splitter = RecursiveCharacterTextSplitter(length_function=len, chunk_size=int(512), chunk_overlap=int(16))

    for obj in s3_files:
        loader = S3FileLoader(
            my_s3_bucket,
            obj['Key'],
            verify=true,
            endpoint_url="https://my_s3_storage:9021",
            aws_access_key_id="my-id",
            aws_secret_access_key="my-pass"
        )

        document = loader.load()
        docs = text_splitter.split_documents(document)
        connection_string = CONNECTION_STRING
        load_result = ingest_documents_pgvector(docs, embeddings, connection_string)
        print(f"Result: {load_result}")

def ingest_documents_pgvector(docs, embeddings, connection_string):
    vectorstore = setup_vector_store_pg(embeddings, connection_string)
    record_manager = SQLRecordManager(namespace=f"pgvector/EMPLOYEE_NAME", db_url=connection_string)
    record_manager.create_schema()

    load = index(docs, record_manager, vectorstore, cleanup=full, source_id_key="source")
    return load

Description

To avoid duplicate entries in the PG vector database, I have started using the SQLRecordManager module from the indexes library. My goal is to avoid storing the same file content multiple times with collection name but want to allow the same file (source) and its content to exist with different collection names.

In my use case, the collection name represents an employee's name. Therefore, I create collections based on employee names. Employees will be using the same S3 bucket and loading identical files to test different LLM responses in an RAG approach.

However, with the current implementation, entries are being deleted from the table if the file name (source in the cmetadata column) is the same. I want to keep the same files if the collection names are different.

Is it possible to achieve this? If so, what changes do I need to make?

I want like this

System Info

System Information

OS: Linux
OS Version: #1 SMP Thu Dec 7 03:06:13 EST 2023
Python Version: 3.12.3 (main, May 14 2024, 07:34:56) [GCC 12.2.0]

Package Information

langchain_core: 0.1.45
langchain: 0.1.16
langchain_community: 0.0.34
langsmith: 0.1.57
langchain_cohere: 0.1.3
langchain_postgres: 0.0.3
langchain_text_splitters: 0.0.1
langgraph: 0.0.32

Packages not installed (Not Necessarily a Problem)

The following packages were not found:

langserve

@dosu · 2024-05-15T09:04:43Z

dosubot[bot]
bot May 15, 2024

To achieve your goal of storing the same file content multiple times under different collection names without creating duplicates, you should leverage the namespace attribute in the SQLRecordManager. This attribute allows the same key to exist in multiple namespaces, enabling you to store the same file content under different employee names as separate collections.

Here's how you can adjust your implementation:

Ensure each collection name (employee name in your case) is treated as a separate namespace when you insert or update records using the SQLRecordManager.
Instantiate or use different instances of SQLRecordManager with distinct namespace values for each collection name. This ensures that the same key can be associated with different namespaces without conflict.

For example:

# Assuming `engine` is your SQLAlchemy engine
record_manager_employee1 = SQLRecordManager(namespace="employee1", engine=engine)
record_manager_employee2 = SQLRecordManager(namespace="employee2", engine=engine)

# Upsert the same key into different collections (namespaces)
key = "same_file_key"
record_manager_employee1.update([key])
record_manager_employee2.update([key])

Replace "employee1" and "employee2" with the actual employee names to store the same file content under these names as separate collections. This approach allows you to maintain the same files across different collections without creating duplicates, aligning with your requirements.

Details

Sources

About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

3 replies

dellsekar May 16, 2024
Author

@dosu The above suggestion is not helped. We already implement with above suggestions, like we Instantiate SQLRecordManager for every employee and update. Let me explain my requirement in table format for clear understanding

Source (S3 Bucket)	Destination (PG DB)	Namespace/employee name	File name (Source)	Expected Result	Current status
s3_1	pg_1	employee_1	S3:/data/demo.txt	Insert	Working
s3_1	pg_1	employee_1	S3:/data/demo.txt	Skip	Working
s3_1	pg_1	employee_2	S3:/data/demo.txt	Insert	Not working
s3_1	pg_1	employee_1	S3:/data/demo123.txt	Insert	Working
s3_2	pg_1	employee_1	S3:/data2/demo.txt	Insert	Working
s3_2	pg_1	employee_2	S3:/data2/demo.txt	Insert	Not working

dosubot[bot] bot May 16, 2024

To address the issue where inserting the same file content under different employee names (namespaces) is not working as expected, it's important to ensure that your implementation correctly differentiates between namespaces for each employee. The SQLRecordManager is designed to allow the same key to exist in multiple namespaces, enabling the scenario you're aiming for.

However, from the table you've provided, it seems like the insert operation for the same file under a different namespace (employee name) is not functioning as intended. This suggests there might be a misconfiguration or misunderstanding in how namespaces are being applied in your implementation.

Here are a few steps to troubleshoot and resolve this issue:

Verify Namespace Usage: Ensure that each time you instantiate the SQLRecordManager for a different employee, you are indeed using a unique namespace corresponding to that employee's name. This is crucial for the database to differentiate between the same file content associated with different employees.

Check Implementation Details: Review the instantiation of SQLRecordManager objects to confirm that the namespace parameter is correctly set for each employee. It should look something like this:

record_manager_employee1 = SQLRecordManager(namespace="employee_1", db_url=connection_string)
record_manager_employee2 = SQLRecordManager(namespace="employee_2", db_url=connection_string)

Database Constraints: Ensure there are no additional constraints or triggers in your database that might be inadvertently affecting the insert operations across different namespaces.
Debugging: Add logging around the points where you instantiate the SQLRecordManager and perform insert operations. This can help identify if the correct namespace is being used and if the insert operation is being called as expected.

If after these checks you're still facing issues, it might be helpful to directly inspect the SQL queries being generated to ensure they correctly reflect the intended operations for different namespaces. This could uncover any discrepancies between the expected behavior and what's actually being executed against the database.

Details

Sources

About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

GhimBoon Mar 19, 2025

I'm seeing this issue occur again, it was something that was fixed a few months ago (if I recall correctly). However, now, I'm seeing it happen again in the latest Langchain version.

Are there any new requirements for using the indexing SQLRecordManager @dosu?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Allow same source files with different collection name using SQLRecordManager #21704

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment 3 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Details

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Details

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Allow same source files with different collection name using SQLRecordManager #21704

Uh oh!

dellsekar May 15, 2024

Checked other resources

Commit to Help

Example Code

Description

System Info

System Information

Package Information

Packages not installed (Not Necessarily a Problem)

Replies: 1 comment · 3 replies

Uh oh!

Uh oh!

dosubot[bot] bot May 15, 2024

Details

Uh oh!

dellsekar May 16, 2024 Author

Uh oh!

dosubot[bot] bot May 16, 2024

Details

Uh oh!

GhimBoon Mar 19, 2025

dellsekar
May 15, 2024

Replies: 1 comment 3 replies

dosubot[bot]
bot May 15, 2024

dellsekar May 16, 2024
Author