[Question]: Can metadata be given a value so that they can be distinguished when indexing from URLs? #8405

Ma-Fukudama · 2023-09-05T01:49:19Z

Ma-Fukudama
Sep 5, 2023

Question Validation

I have searched both the documentation and discord for an answer.

Question

I am creating indexes from files using SimpleDirectoryReader and from URLs using SimpleWebPageReader.
The file name is registered in the metadata from the file, but it is not registered from the URL.
In the future, when deleting data in the index, I think that it will be deleted from the node ID, but since the URL has only the node ID in doctor.json, I can not distinguish it.
Is there a way to register the URL like a file name when creating an index from the URL?

@dosu-bot · 2023-09-05T01:54:59Z

dosubot[bot]
bot Sep 5, 2023

🤖

Hello,

Yes, there is a way to register the URL like a file name when creating an index from the URL in LlamaIndex. You can do this by using the ref_doc_id field of the TextNode object. This field can be used to store the URL of the document. When you add a document to the index, you can set the ref_doc_id to the URL of the document. This way, you can distinguish it when deleting data in the index.

Here is an example of how you can do this:

from llama_index.schema import TextNode

# Create a TextNode with the URL as the ref_doc_id
node = TextNode(node_id="node1", text="This is a test document", ref_doc_id="http://example.com")

# Add the node to the document store
docstore.add_documents([node])

When you want to delete the document, you can use the delete_ref_doc method with the URL as the ref_doc_id:

# Delete the document with the URL as the ref_doc_id
docstore.delete_ref_doc("http://example.com")

This will delete the document and all its associated nodes from the index.

The LlamaIndex does support the ref_doc_id field in the TextNode object. This field is used to store the reference document ID, which could be the URL of the document.

Here is the relevant code snippet from the BaseNode class, which is the parent class of TextNode:

@property
def ref_doc_id(self) -> Optional[str]:
    """Deprecated: Get ref doc id."""
    source_node = self.source_node
    if source_node is None:
        return None
    return source_node.node_id

This property method ref_doc_id retrieves the node_id of the source_node, which could be the URL of the document.

In the context shared, it's shown that the ref_doc_id is also used in the DocumentSummaryIndex class. When adding nodes to the index, it checks if the ref_doc_id of a node is None and raises a ValueError if it is. This indicates that the ref_doc_id is a crucial part of the indexing process.

I hope this helps! Let me know if you have any other questions.

Best,
Dosu

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

Ma-Fukudama · 2023-09-05T02:57:49Z

Ma-Fukudama
Sep 5, 2023
Author

@dosu-beta

When creating a document using a file, the file name is registered in the metadata.
Can I register file names in metadata using docstore.add_documents([node])?

The sources I'm indexing right now are:

text_list = ['～/paul_graham_essay.txt']

index = ([])

filename_fn = lambda filename: {'file_name': filename}

# automatically sets the metadata of each document according to filename_fn
file_documents = SimpleDirectoryReader(input_files=text_list,  file_metadata=filename_fn).load_data()

# index = ListIndex.from_documents(documents, service_context=service_context)

llm = OpenAI(model="gpt-4", temperature=0, max_tokens=2048)
service_context = ServiceContext.from_defaults(llm=llm)


url_documents = SimpleWebPageReader(html_to_text=True).load_data(
    [
        "http://paulgraham.com/worked.html"
    ]
)

documents = file_documents + url_documents

index = VectorStoreIndex.from_documents(documents, service_context=service_context)

index.storage_context.persist(persist_dir="index")

0 replies

@dosu-bot · 2023-09-05T03:06:14Z

dosubot[bot]
bot Sep 5, 2023

🤖

Hello,

To register file names in metadata using docstore.add_documents([node]) in the LlamaIndex framework, you need to create a Document or TextNode object with the desired metadata and then add it to the document store. The metadata is a dictionary where you can include the file name or any other information you want to associate with the document.

Here is an example of how you can do it:

from llama_index.schema import Document, TextNode
from llama_index.storage.docstore import SimpleDocumentStore

# Create a Document or TextNode with metadata including the file name
doc = Document(text="hello world", id_="d1", metadata={"file_name": "file1.txt", "foo": "bar"})
node = TextNode(text="my node", id_="d2", metadata={"file_name": "file2.txt", "node": "info"})

# Create a SimpleDocumentStore and add the documents
docstore = SimpleDocumentStore()
docstore.add_documents([doc, node])

In this example, the metadata dictionary includes the file name for each document. You can retrieve the document later using its id and access its metadata (including the file name).

Regarding your current code, the SimpleDirectoryReader class has a parameter file_metadata which is a function that takes in a filename and returns a dictionary of metadata for the Document. This parameter is optional and its default value is None.

In the load_data method, if file_metadata is not None, it is used to generate metadata for each input file. This metadata is then passed to the file reader when loading data from the file.

Here is the relevant code:

def load_data(self) -> List[Document]:
    """Load data from the input directory.

    Returns:
        List[Document]: A list of documents.
    """
    documents = []
    for input_file in self.input_files:
        metadata: Optional[dict] = None
        if self.file_metadata is not None:
            metadata = self.file_metadata(str(input_file))

        file_suffix = input_file.suffix.lower()
        if (
            file_suffix in self.supported_suffix
            or file_suffix in self.file_extractor
        ):
            # use file readers
            if file_suffix not in self.file_extractor:
                # instantiate file reader if not already
                reader_cls = DEFAULT_FILE_READER_CLS[file_suffix]
                self.file_extractor[file_suffix] = reader_cls()
            reader = self.file_extractor[file_suffix]
            docs = reader.load_data(input_file, extra_info=metadata)

            # iterate over docs if needed
            if self.filename_as_id:
                for i, doc in enumerate(docs):
                    doc.id_ = f"{str(input_file)}_part_{i}"

            documents.extend(docs)
        else:
            # do standard read
            with open(
                input_file, "r", errors=self.errors, encoding=self.encoding
            ) as f:
                data = f.read()

            doc = Document(text=data, metadata=metadata or {})
            if self.filename_as_id:
                doc.id_ = str(input_file)

            documents.append(doc)

    return documents

In this code, self.file_metadata(str(input_file)) is used to generate metadata for each input file. This metadata is then passed as extra_info to the load_data method of the file reader. If file_metadata is None, then the metadata for the Document is an empty dictionary.

I hope this helps! Let me know if you have any other questions.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

Ma-Fukudama · 2023-09-05T06:45:40Z

Ma-Fukudama
Sep 5, 2023
Author

@dosu-bot

I don't know where to add it.
Where should I add it in the source below?

text_list = ['～/paul_graham_essay.txt']

index = ([])

filename_fn = lambda filename: {'file_name': filename}

# automatically sets the metadata of each document according to filename_fn
file_documents = SimpleDirectoryReader(input_files=text_list,  file_metadata=filename_fn).load_data()

# index = ListIndex.from_documents(documents, service_context=service_context)

llm = OpenAI(model="gpt-4", temperature=0, max_tokens=2048)
service_context = ServiceContext.from_defaults(llm=llm)


url_documents = SimpleWebPageReader(html_to_text=True).load_data(
    [
        "http://paulgraham.com/worked.html"
    ]
)

documents = file_documents + url_documents

index = VectorStoreIndex.from_documents(documents, service_context=service_context)

index.storage_context.persist(persist_dir="index")

0 replies

@dosu-bot · 2023-09-05T06:55:26Z

dosubot[bot]
bot Sep 5, 2023

🤖

Hello,

From your code, it seems like you're already on the right track with adding metadata to your documents. You're correctly using the file_metadata parameter of the SimpleDirectoryReader to add metadata to your documents based on the filename.

Here's how it works:

filename_fn = lambda filename: {'file_name': filename}
file_documents = SimpleDirectoryReader(input_files=text_list,  file_metadata=filename_fn).load_data()

In this code, filename_fn is a function that takes a filename as input and returns a dictionary with the filename as metadata. This function is then passed to the SimpleDirectoryReader as the file_metadata parameter. The SimpleDirectoryReader will call this function with each filename to generate the metadata for each document.

However, it seems like you're not adding metadata to the documents loaded from URLs. If you want to add metadata to these documents as well, you can do so by modifying the SimpleWebPageReader in a similar way. Unfortunately, the SimpleWebPageReader does not currently support a similar mechanism for adding metadata. You might need to extend the SimpleWebPageReader class to add this functionality.

Here's an example of how you could do this:

class CustomWebPageReader(SimpleWebPageReader):
    def __init__(self, html_to_text=False, url_metadata=None):
        super().__init__(html_to_text)
        self.url_metadata = url_metadata

    def load_data(self, urls):
        documents = super().load_data(urls)
        if self.url_metadata is not None:
            for doc, url in zip(documents, urls):
                metadata = self.url_metadata(url)
                doc.metadata.update(metadata)
        return documents

url_metadata_fn = lambda url: {'url': url}
url_documents = CustomWebPageReader(html_to_text=True, url_metadata=url_metadata_fn).load_data(
    [
        "http://paulgraham.com/worked.html"
    ]
)

In this code, CustomWebPageReader is a subclass of SimpleWebPageReader that adds a url_metadata parameter. This parameter works in the same way as the file_metadata parameter of the SimpleDirectoryReader. The load_data method of the CustomWebPageReader calls this function with each URL to generate the metadata for each document.

I hope this helps! Let me know if you have any other questions.

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

Ma-Fukudama · 2023-09-06T10:49:49Z

Ma-Fukudama
Sep 6, 2023
Author

@dosu-bot

Please tell me how to create a VectorStoreIndex using the source below and exclude one file in the index.

import os, sys, json,logging, openai
from dotenv import load_dotenv
from llama_index import SimpleDirectoryReader, StorageContext, VectorStoreIndex, LLMPredictor, Document, load_index_from_storage, ServiceContext, SimpleWebPageReader
from llama_index.llms import OpenAI
from llama_index.langchain_helpers.agents import create_llama_chat_agent, IndexToolConfig
from llama_index.schema import TextNode

load_dotenv()
openai.api_key = os.environ["OPENAI_API_KEY"] 

logging.basicConfig(stream=sys.stdout, level=logging.DEBUG, force=True)

index = ([])
text_list = ['/data/paul_graham_essay.txt]
filename_fn = lambda filename: {'file_name': filename}

# automatically sets the metadata of each document according to filename_fn
file_documents = SimpleDirectoryReader(input_files=text_list,  file_metadata=filename_fn).load_data()

# Llamaindexの設定
llm = OpenAI(model="gpt-4", temperature=0, max_tokens=2048)
service_context = ServiceContext.from_defaults(llm=llm)

class CustomWebPageReader(SimpleWebPageReader):
    def __init__(self, html_to_text=False, url_metadata=None):
        super().__init__(html_to_text)
        self.url_metadata = url_metadata

    def load_data(self, urls):
        documents = super().load_data(urls)
        if self.url_metadata is not None:
            for doc, url in zip(documents, urls):
                metadata = self.url_metadata(url)
                doc.metadata.update(metadata)
        return documents

url_metadata_fn = lambda url: {'file_name': url}
url_documents = CustomWebPageReader(html_to_text=True, url_metadata=url_metadata_fn).load_data(
    [
            'x.html'
    ]
)

documents = file_documents + url_documents

index = VectorStoreIndex.from_documents(documents, service_context=service_context)

index.storage_context.persist(persist_dir="index")

0 replies

@dosu-bot · 2023-09-06T10:58:11Z

dosubot[bot]
bot Sep 6, 2023

🤖

Hello,

To exclude a specific file from the LlamaIndex after it has been created, you can use the delete_ref_doc method from the KVDocumentStore class. This method deletes a document and its nodes from the index using the document's reference ID.

Here is an example of how you can use it:

docstore = KVDocumentStore(kvstore=your_kvstore)  # Initialize your document store
docstore.delete_ref_doc(ref_doc_id="your_file_id")

In this example, replace "your_file_id" with the ID of the file you want to exclude from the index. Please note that this will only work if the file was added to the index as a document. If the file was added in another way, you may need to use a different method to remove it.

The document reference ID in the LlamaIndex framework can be generated based on the filename. This is controlled by the filename_as_id parameter in the SimpleDirectoryReader class. If filename_as_id is set to True, the filename will be used as the document ID.

This is evident in the load_data method of the SimpleDirectoryReader class. If filename_as_id is True, the document ID is set to the filename of the input file. If the file reader returns multiple documents for a single file (for example, if the file is a multi-page PDF), the document ID is set to the filename followed by _part_ and the index of the document.

Here is the relevant code:

if self.filename_as_id:
    for i, doc in enumerate(docs):
        doc.id_ = f"{str(input_file)}_part_{i}"

and

if self.filename_as_id:
    doc.id_ = str(input_file)

If filename_as_id is False, the document ID is not set based on the filename. The method of generating the document ID in this case is not specified in the provided context.

I hope this helps! Let me know if you have any other questions.

Best,
Dosu

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

Ma-Fukudama · 2023-09-06T11:10:46Z

Ma-Fukudama
Sep 6, 2023
Author

dosu-bot
I do not understand.
Could you please specifically add it to the sources below?

import os, sys, json,logging, openai
from dotenv import load_dotenv
from llama_index import SimpleDirectoryReader, StorageContext, VectorStoreIndex, LLMPredictor, Document, load_index_from_storage, ServiceContext, SimpleWebPageReader
from llama_index.llms import OpenAI
from llama_index.langchain_helpers.agents import create_llama_chat_agent, IndexToolConfig
from llama_index.schema import TextNode

load_dotenv()
openai.api_key = os.environ["OPENAI_API_KEY"] 

logging.basicConfig(stream=sys.stdout, level=logging.DEBUG, force=True)

index = ([])
text_list = ['/data/paul_graham_essay.txt]
filename_fn = lambda filename: {'file_name': filename}

# automatically sets the metadata of each document according to filename_fn
file_documents = SimpleDirectoryReader(input_files=text_list,  file_metadata=filename_fn).load_data()

# Llamaindexの設定
llm = OpenAI(model="gpt-4", temperature=0, max_tokens=2048)
service_context = ServiceContext.from_defaults(llm=llm)

class CustomWebPageReader(SimpleWebPageReader):
    def __init__(self, html_to_text=False, url_metadata=None):
        super().__init__(html_to_text)
        self.url_metadata = url_metadata

    def load_data(self, urls):
        documents = super().load_data(urls)
        if self.url_metadata is not None:
            for doc, url in zip(documents, urls):
                metadata = self.url_metadata(url)
                doc.metadata.update(metadata)
        return documents

url_metadata_fn = lambda url: {'file_name': url}
url_documents = CustomWebPageReader(html_to_text=True, url_metadata=url_metadata_fn).load_data(
    [
            'x.html'
    ]
)

documents = file_documents + url_documents

index = VectorStoreIndex.from_documents(documents, service_context=service_context)

index.storage_context.persist(persist_dir="index")

0 replies

@dosu-bot · 2023-09-06T11:19:27Z

dosubot[bot]
bot Sep 6, 2023

🤖

Hello,

To exclude a specific file from the LlamaIndex after it has been created, you can indeed use the delete_ref_doc method from the BaseIndex class. This method deletes a document and its nodes from the index using the document's reference ID (ref_doc_id), which could be the file name or any unique identifier for the file.

Here is how you can modify your code to exclude a specific file:

# Assume that the file you want to exclude has a reference ID of `file_to_exclude`
index.delete_ref_doc(ref_doc_id='file_to_exclude', delete_from_docstore=True)

In this example, delete_from_docstore=True means that the document will also be deleted from the document store, not just the index. If you want to keep the document in the document store, you can set this parameter to False.

Please note that this operation cannot be undone. Once a file is excluded from the index, you will need to add it back to the index manually if you want to include it again.

Here's how the delete_ref_doc method works:

It first retrieves the reference document information (ref_doc_info) from the document store using the get_ref_doc_info method. If the ref_doc_info is None, it logs a warning and returns, as there's nothing to delete.
If the ref_doc_info is not None, it calls the delete_nodes method, passing the node IDs from the ref_doc_info and the delete_from_docstore flag (which is False by default).
If the delete_from_docstore flag is True, it also deletes the reference document from the document store using the delete_ref_doc method.

Here's the code snippet for the delete_ref_doc method:

def delete_ref_doc(
    self, ref_doc_id: str, delete_from_docstore: bool = False, **delete_kwargs: Any
) -> None:
    """Delete a document and it's nodes by using ref_doc_id."""
    ref_doc_info = self.docstore.get_ref_doc_info(ref_doc_id)
    if ref_doc_info is None:
        logger.warning(f"ref_doc_id {ref_doc_id} not found, nothing deleted.")
        return

    self.delete_nodes(
        ref_doc_info.node_ids,
        delete_from_docstore=False,
        **delete_kwargs,
    )

    if delete_from_docstore:
        self.docstore.delete_ref_doc(ref_doc_id, raise_error=False)

I hope this helps! Let me know if you have any other questions.

Sources

llama_index/indices/base.py

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

Ma-Fukudama · 2023-09-06T11:29:21Z

Ma-Fukudama
Sep 6, 2023
Author

dosu-bot
The json file is as below, but which one corresponds to ref_doc_id?

"docstore/ref_doc_info": {"83c7b926-2100-445a-a86d-2f30700d7725": {"node_ids": ["631ac936-392b-4cdb-881c-e689d8f7f0fc", "31079ece-f8ef-43d1-b69f-0f18567e53b8", "1c410e07-43bd-46f3-992e-e9d12b6990bb"], "metadata": {"file_name": "http://www.xxxx.html"}}}

1 reply

dosubot[bot] bot Oct 24, 2023

🤖

Hello,

Based on the information provided, the ref_doc_id in the LlamaIndex framework corresponds to the node_id of the source_node in the BaseNode class. This is derived from the ref_doc_id property in the BaseNode class. Here's the relevant code snippet:

@property
def ref_doc_id(self) -> Optional[str]:
    """Deprecated: Get ref doc id."""
    source_node = self.source_node
    if source_node is None:
        return None
    return source_node.node_id

In this code, source_node is a property that extracts the source node from the relationships field of the BaseNode class. The node_id of this source_node is then returned as the ref_doc_id.

In the context of your JSON file, it seems like the node_ids under each UUID (like "83c7b926-2100-445a-a86d-2f30700d7725") could potentially correspond to the ref_doc_id in the LlamaIndex framework. However, this is just an assumption based on the provided context and the structure of your JSON file.

Please note that the actual mapping might depend on how you're using the LlamaIndex framework and how your data is structured. If you have more specific details about your use case or if there's a specific part of the code you're working with, I'd be happy to help further.

Best,
Dosu

Sources

llama_index/schema.py

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

@dosu-bot · 2023-09-06T11:37:50Z

dosubot[bot]
bot Sep 6, 2023

🤖

Hello,

In the provided JSON content for the LlamaIndex data framework, the ref_doc_id corresponds to the UUID directly under the "docstore/ref_doc_info" key. In your case, the ref_doc_id would be "83c7b926-2100-445a-a86d-2f30700d7725".

This is inferred from the structure of your JSON content, where each ref_doc_id is a unique identifier for a document in the LlamaIndex data framework. This ref_doc_id is then associated with a list of node_ids and metadata for that specific document.

Please note that the provided context does not directly mention a ref_doc_id field in the LlamaIndex data framework. The context provided refers to a "source_id" field, which is populated by a ref_doc_id in a different context (converting documents to JSON). However, in the context of your JSON content, the ref_doc_id appears to be the UUID under the "docstore/ref_doc_info" key.

I hope this helps! If you have any other questions, feel free to ask.

Best,
Dosu

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

Ma-Fukudama · 2023-09-06T23:41:41Z

Ma-Fukudama
Sep 6, 2023
Author

@dosu-bot
doctore.jsonには「docstore/metadata」「docstore/data」「docstore/ref_doc_info」がありますが、index.delete_ref_docで指定するref_doc_idは「docstore/ref_doc_info」のもので大丈夫ですか？

また、metadataに"page_label"があるものは一つのファイルが複数に分割されていると思うのですが、どれかのref_doc_idを指定してindex.delete_ref_docをすれば分割されたファイルはすべて削除されますか？

0 replies

@dosu-bot · 2023-09-06T23:47:04Z

dosubot[bot]
bot Sep 6, 2023

🤖

Yes, you are correct. To delete a specific document from the LlamaIndex data framework using the index.delete_ref_doc method, you should use the ref_doc_id from "docstore/ref_doc_info". This is because the delete_ref_doc method uses the ref_doc_id to identify and delete the document and its nodes from the index. Here's the relevant part of the method:

def delete_ref_doc(
        self, ref_doc_id: str, delete_from_docstore: bool = False, **delete_kwargs: Any
    ) -> None:
        """Delete a document and it's nodes by using ref_doc_id."""
        ref_doc_info = self.docstore.get_ref_doc_info(ref_doc_id)
        if ref_doc_info is None:
            logger.warning(f"ref_doc_id {ref_doc_id} not found, nothing deleted.")
            return

        self.delete_nodes(
            ref_doc_info.node_ids,
            delete_from_docstore=False,
            **delete_kwargs,
        )

        if delete_from_docstore:
            self.docstore.delete_ref_doc(ref_doc_id, raise_error=False)

As for your second question, the "page_label" in the metadata does indicate that a single file has been split into multiple parts. However, when you use index.delete_ref_doc with the ref_doc_id of one of these split files, only the specified part (the document with the given ref_doc_id) will be deleted, not the entire file. This is because the delete_ref_doc method does not consider the "page_label" when deleting documents. It only uses the ref_doc_id to identify the document to be deleted.

I hope this answers your questions. If you have any more questions or need further clarification, feel free to ask.

Sources

llama_index/indices/base.py

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

Ma-Fukudama · 2023-09-07T00:14:39Z

Ma-Fukudama
Sep 7, 2023
Author

@dosu-bot
以前紹介されたKVSTOREを利用すれば複数に分割されたファイルも削除できるのですか？

docstore = KVDocumentStore(kvstore=your_kvstore)  # Initialize your document store
docstore.delete_ref_doc(ref_doc_id="your_file_id")

0 replies

@dosu-bot · 2023-09-07T00:19:42Z

dosubot[bot]
bot Sep 7, 2023

🤖

Hello,

The delete_ref_doc method from the KVDocumentStore class in the LlamaIndex data framework deletes all parts of a split file associated with the specified ref_doc_id.

Here's how it works:

def delete_ref_doc(self, ref_doc_id: str, raise_error: bool = True) -> None:
    """Delete a ref_doc and all it's associated nodes."""
    ref_doc_info = self.get_ref_doc_info(ref_doc_id)
    if ref_doc_info is None:
        if raise_error:
            raise ValueError(f"ref_doc_id {ref_doc_id} not found.")
        else:
            return

    for doc_id in ref_doc_info.node_ids:
        self.delete_document(doc_id, raise_error=False, remove_ref_doc_node=False)

    self._kvstore.delete(ref_doc_id, collection=self._metadata_collection)
    self._kvstore.delete(ref_doc_id, collection=self._ref_doc_collection)

In this method, it first retrieves the RefDocInfo for the given ref_doc_id. If it exists, it iterates over all the node_ids associated with the ref_doc_id and deletes each document using the delete_document method. Finally, it deletes the ref_doc_id from the metadata_collection and ref_doc_collection. Therefore, it deletes all parts of a split file associated with the specified ref_doc_id.

You can find this method in the llama_index/storage/docstore/keyval_docstore.py file.

I hope this clarifies your question. If you have any more questions, feel free to ask.

Best,
Dosu

Sources

llama_index/storage/docstore/keyval_docstore.py

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

Ma-Fukudama · 2023-09-07T00:26:23Z

Ma-Fukudama
Sep 7, 2023
Author

@dosu-bot
KVDocumentStoreとVectorStoreIndexは共存できますか？
KVDocumentStoreの作り方を具体的に教えてください

0 replies

@dosu-bot · 2023-09-07T00:33:04Z

dosubot[bot]
bot Sep 7, 2023

🤖

Hello,

Yes, the KVDocumentStore can coexist with the VectorStoreIndex in the LlamaIndex data framework. The KVDocumentStore is a type of document store that is used to store Node objects, and each node is assigned an ID. The same docstore can be reused across index structures, which allows you to reuse the same storage for multiple index structures.

Here is an example of how to create and use a KVDocumentStore:

from llama_index.storage.docstore.keyval_docstore import KVDocumentStore
from llama_index.storage.kvstore.types import BaseKVStore
from llama_index.schema import BaseNode, TextNode

# Assuming you have a kvstore instance of BaseKVStore
kvstore = BaseKVStore()

# Create a KVDocumentStore
docstore = KVDocumentStore(kvstore)

# Assuming you have a list of nodes
nodes = [BaseNode(), TextNode()]

# Add documents to the store
docstore.add_documents(nodes)

# Get a document from the store
doc_id = nodes[0].node_id
document = docstore.get_document(doc_id)

# Check if a document exists
exists = docstore.document_exists(doc_id)

# Delete a document from the store
docstore.delete_document(doc_id)

In the above example, we first create a KVDocumentStore by passing a kvstore instance of BaseKVStore to the KVDocumentStore constructor. Then, we add documents to the store using the add_documents method. We can get a document from the store using the get_document method and check if a document exists using the document_exists method. Finally, we delete a document from the store using the delete_document method.

As for the VectorStoreIndex, it is an index that is built on top of an existing vector store. It is a class that inherits from the BaseIndex class and is used to manage and manipulate data in the vector store. The specific roles of the VectorStoreIndex include initialization, building index, insertion, deletion, and retrieval. You can find more details in the source code.

I hope this helps! If you have any other questions, feel free to ask.

Best,
Dosu

Sources

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

Ma-Fukudama · 2023-09-07T00:37:01Z

Ma-Fukudama
Sep 7, 2023
Author

@dosu-bot
Kvstoreでdocumentを作る場合、ファイル名をmetadataに格納するのはどうすればいいですか？
今まではSimpleDirectoryReaderを使っていました。

0 replies

@dosu-bot · 2023-09-07T00:47:34Z

dosubot[bot]
bot Sep 7, 2023

🤖

Hello,

To store the file name in the metadata when creating documents with the KVDocumentStore in the LlamaIndex data framework, you can modify the add_documents method in the KVDocumentStore class.

Here is the modified add_documents method:

def add_documents(
    self, nodes: Sequence[BaseNode], file_name: str, allow_update: bool = True
) -> None:
    """Add a document to the store.
Args:
        docs (List[BaseDocument]): documents
        file_name (str): name of the file
        allow_update (bool): allow update of docstore from document

    """
    for node in nodes:
        # NOTE: doc could already exist in the store, but we overwrite it
        if not allow_update and self.document_exists(node.node_id):
            raise ValueError(
                f"node_id {node.node_id} already exists. "
                "Set allow_update to True to overwrite."
            )
        node_key = node.node_id
        data = doc_to_json(node)
        self._kvstore.put(node_key, data, collection=self._node_collection)

        # update doc_collection if needed
        metadata = {"doc_hash": node.hash, "file_name": file_name}
        if isinstance(node, TextNode) and node.ref_doc_id is not None:
            ref_doc_info = self.get_ref_doc_info(node.ref_doc_id) or RefDocInfo()
            ref_doc_info.node_ids.append(node.node_id)
            if not ref_doc_info.metadata:
                ref_doc_info.metadata = node.metadata or {}
            self._kvstore.put(
                node.ref_doc_id,
                ref_doc_info.to_dict(),
                collection=self._ref_doc_collection,
            )

            # update metadata with map
            metadata["ref_doc_id"] = node.ref_doc_id
            self._kvstore.put(
                node_key, metadata, collection=self._metadata_collection
            )
        else:
            self._kvstore.put(
                node_key, metadata, collection=self._metadata_collection
            )

In this modified method, file_name is an additional argument that you need to provide when calling add_documents. The file_name is then stored in the metadata dictionary for each node.

As for the structure of the BaseNode class used in the add_documents method of the KVDocumentStore class, I wasn't able to find specific information within the repository. If you could provide more details about the BaseNode class or its usage, I might be able to give a more accurate answer.

I hope this helps! Let me know if you have any other questions.

Sources

llama_index/storage/docstore/keyval_docstore.py

This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

If you want to continue the conversation, start your reply with @dosu-bot.

0 replies

logan-markewich · 2023-09-07T17:45:09Z

logan-markewich
Sep 7, 2023
Maintainer

@Ma-Fukudama this thread is pretty long 😅 Is the root issue solved or no? Can you re-iterate what the problem is?

0 replies

Ma-Fukudama · 2023-09-08T01:12:12Z

Ma-Fukudama
Sep 8, 2023
Author

@logan-markewich
I want the metadata to have a name in the document created from the URL as well as the document created from the file (solved here)
I want to delete all split files and URLs when deleting.

0 replies

[Question]: Can metadata be given a value so that they can be distinguished when indexing from URLs? #8405

Uh oh!

Ma-Fukudama Sep 5, 2023

Question Validation

Question

Replies: 21 comments · 1 reply

Uh oh!

dosubot[bot] bot Sep 5, 2023

Sources

Uh oh!

Ma-Fukudama Sep 5, 2023 Author

Uh oh!

dosubot[bot] bot Sep 5, 2023

Sources

Uh oh!

Ma-Fukudama Sep 5, 2023 Author

Uh oh!

dosubot[bot] bot Sep 5, 2023

Sources

Uh oh!

Ma-Fukudama Sep 6, 2023 Author

Uh oh!

dosubot[bot] bot Sep 6, 2023

Sources

Uh oh!

Ma-Fukudama Sep 6, 2023 Author

Uh oh!

dosubot[bot] bot Sep 6, 2023

Sources

Uh oh!

Ma-Fukudama Sep 6, 2023 Author

Uh oh!

dosubot[bot] bot Oct 24, 2023

Sources

Uh oh!

dosubot[bot] bot Sep 6, 2023

Sources

Uh oh!

Ma-Fukudama Sep 6, 2023 Author

Uh oh!

dosubot[bot] bot Sep 6, 2023

Sources

Uh oh!

Ma-Fukudama Sep 7, 2023 Author

Uh oh!

dosubot[bot] bot Sep 7, 2023

Sources

Uh oh!

Ma-Fukudama Sep 7, 2023 Author

Uh oh!

dosubot[bot] bot Sep 7, 2023

Sources

Uh oh!

Ma-Fukudama Sep 7, 2023 Author

Uh oh!

dosubot[bot] bot Sep 7, 2023

Sources

Uh oh!

logan-markewich Sep 7, 2023 Maintainer

Uh oh!

Ma-Fukudama Sep 8, 2023 Author

Ma-Fukudama
Sep 5, 2023

Replies: 21 comments 1 reply

dosubot[bot]
bot Sep 5, 2023

Ma-Fukudama
Sep 5, 2023
Author

dosubot[bot]
bot Sep 5, 2023

Ma-Fukudama
Sep 5, 2023
Author

dosubot[bot]
bot Sep 5, 2023

Ma-Fukudama
Sep 6, 2023
Author

dosubot[bot]
bot Sep 6, 2023

Ma-Fukudama
Sep 6, 2023
Author

dosubot[bot]
bot Sep 6, 2023

Ma-Fukudama
Sep 6, 2023
Author

dosubot[bot]
bot Sep 6, 2023

Ma-Fukudama
Sep 6, 2023
Author

dosubot[bot]
bot Sep 6, 2023

Ma-Fukudama
Sep 7, 2023
Author

dosubot[bot]
bot Sep 7, 2023

Ma-Fukudama
Sep 7, 2023
Author

dosubot[bot]
bot Sep 7, 2023

Ma-Fukudama
Sep 7, 2023
Author

dosubot[bot]
bot Sep 7, 2023

logan-markewich
Sep 7, 2023
Maintainer

Ma-Fukudama
Sep 8, 2023
Author