[Question]: Can metadata be given a value so that they can be distinguished when indexing from URLs? #8405
Replies: 21 comments 1 reply
-
🤖 Hello, Yes, there is a way to register the URL like a file name when creating an index from the URL in LlamaIndex. You can do this by using the Here is an example of how you can do this: from llama_index.schema import TextNode
# Create a TextNode with the URL as the ref_doc_id
node = TextNode(node_id="node1", text="This is a test document", ref_doc_id="http://example.com")
# Add the node to the document store
docstore.add_documents([node]) When you want to delete the document, you can use the # Delete the document with the URL as the ref_doc_id
docstore.delete_ref_doc("http://example.com") This will delete the document and all its associated nodes from the index. The LlamaIndex does support the Here is the relevant code snippet from the @property
def ref_doc_id(self) -> Optional[str]:
"""Deprecated: Get ref doc id."""
source_node = self.source_node
if source_node is None:
return None
return source_node.node_id This property method In the context shared, it's shown that the I hope this helps! Let me know if you have any other questions. Best, Sources
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot. |
Beta Was this translation helpful? Give feedback.
-
@dosu-beta When creating a document using a file, the file name is registered in the metadata. The sources I'm indexing right now are:
|
Beta Was this translation helpful? Give feedback.
-
🤖 Hello, To register file names in metadata using Here is an example of how you can do it: from llama_index.schema import Document, TextNode
from llama_index.storage.docstore import SimpleDocumentStore
# Create a Document or TextNode with metadata including the file name
doc = Document(text="hello world", id_="d1", metadata={"file_name": "file1.txt", "foo": "bar"})
node = TextNode(text="my node", id_="d2", metadata={"file_name": "file2.txt", "node": "info"})
# Create a SimpleDocumentStore and add the documents
docstore = SimpleDocumentStore()
docstore.add_documents([doc, node]) In this example, the Regarding your current code, the In the Here is the relevant code: def load_data(self) -> List[Document]:
"""Load data from the input directory.
Returns:
List[Document]: A list of documents.
"""
documents = []
for input_file in self.input_files:
metadata: Optional[dict] = None
if self.file_metadata is not None:
metadata = self.file_metadata(str(input_file))
file_suffix = input_file.suffix.lower()
if (
file_suffix in self.supported_suffix
or file_suffix in self.file_extractor
):
# use file readers
if file_suffix not in self.file_extractor:
# instantiate file reader if not already
reader_cls = DEFAULT_FILE_READER_CLS[file_suffix]
self.file_extractor[file_suffix] = reader_cls()
reader = self.file_extractor[file_suffix]
docs = reader.load_data(input_file, extra_info=metadata)
# iterate over docs if needed
if self.filename_as_id:
for i, doc in enumerate(docs):
doc.id_ = f"{str(input_file)}_part_{i}"
documents.extend(docs)
else:
# do standard read
with open(
input_file, "r", errors=self.errors, encoding=self.encoding
) as f:
data = f.read()
doc = Document(text=data, metadata=metadata or {})
if self.filename_as_id:
doc.id_ = str(input_file)
documents.append(doc)
return documents In this code, I hope this helps! Let me know if you have any other questions. Sources
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot. |
Beta Was this translation helpful? Give feedback.
-
I don't know where to add it.
|
Beta Was this translation helpful? Give feedback.
-
🤖 Hello, From your code, it seems like you're already on the right track with adding metadata to your documents. You're correctly using the Here's how it works: filename_fn = lambda filename: {'file_name': filename}
file_documents = SimpleDirectoryReader(input_files=text_list, file_metadata=filename_fn).load_data() In this code, However, it seems like you're not adding metadata to the documents loaded from URLs. If you want to add metadata to these documents as well, you can do so by modifying the Here's an example of how you could do this: class CustomWebPageReader(SimpleWebPageReader):
def __init__(self, html_to_text=False, url_metadata=None):
super().__init__(html_to_text)
self.url_metadata = url_metadata
def load_data(self, urls):
documents = super().load_data(urls)
if self.url_metadata is not None:
for doc, url in zip(documents, urls):
metadata = self.url_metadata(url)
doc.metadata.update(metadata)
return documents
url_metadata_fn = lambda url: {'url': url}
url_documents = CustomWebPageReader(html_to_text=True, url_metadata=url_metadata_fn).load_data(
[
"http://paulgraham.com/worked.html"
]
) In this code, I hope this helps! Let me know if you have any other questions. SourcesThis response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot. |
Beta Was this translation helpful? Give feedback.
-
Please tell me how to create a VectorStoreIndex using the source below and exclude one file in the index.
|
Beta Was this translation helpful? Give feedback.
-
🤖 Hello, To exclude a specific file from the LlamaIndex after it has been created, you can use the Here is an example of how you can use it: docstore = KVDocumentStore(kvstore=your_kvstore) # Initialize your document store
docstore.delete_ref_doc(ref_doc_id="your_file_id") In this example, replace The document reference ID in the LlamaIndex framework can be generated based on the filename. This is controlled by the This is evident in the Here is the relevant code: if self.filename_as_id:
for i, doc in enumerate(docs):
doc.id_ = f"{str(input_file)}_part_{i}" and if self.filename_as_id:
doc.id_ = str(input_file) If I hope this helps! Let me know if you have any other questions. Best, Sources
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot. |
Beta Was this translation helpful? Give feedback.
-
dosu-bot
|
Beta Was this translation helpful? Give feedback.
-
🤖 Hello, To exclude a specific file from the LlamaIndex after it has been created, you can indeed use the Here is how you can modify your code to exclude a specific file: # Assume that the file you want to exclude has a reference ID of `file_to_exclude`
index.delete_ref_doc(ref_doc_id='file_to_exclude', delete_from_docstore=True) In this example, Please note that this operation cannot be undone. Once a file is excluded from the index, you will need to add it back to the index manually if you want to include it again. Here's how the
Here's the code snippet for the def delete_ref_doc(
self, ref_doc_id: str, delete_from_docstore: bool = False, **delete_kwargs: Any
) -> None:
"""Delete a document and it's nodes by using ref_doc_id."""
ref_doc_info = self.docstore.get_ref_doc_info(ref_doc_id)
if ref_doc_info is None:
logger.warning(f"ref_doc_id {ref_doc_id} not found, nothing deleted.")
return
self.delete_nodes(
ref_doc_info.node_ids,
delete_from_docstore=False,
**delete_kwargs,
)
if delete_from_docstore:
self.docstore.delete_ref_doc(ref_doc_id, raise_error=False) I hope this helps! Let me know if you have any other questions. SourcesThis response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot. |
Beta Was this translation helpful? Give feedback.
-
dosu-bot
|
Beta Was this translation helpful? Give feedback.
-
🤖 Hello, In the provided JSON content for the LlamaIndex data framework, the This is inferred from the structure of your JSON content, where each Please note that the provided context does not directly mention a I hope this helps! If you have any other questions, feel free to ask. Best, Sources
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot. |
Beta Was this translation helpful? Give feedback.
-
@dosu-bot また、metadataに"page_label"があるものは一つのファイルが複数に分割されていると思うのですが、どれかのref_doc_idを指定してindex.delete_ref_docをすれば分割されたファイルはすべて削除されますか? |
Beta Was this translation helpful? Give feedback.
-
🤖 Yes, you are correct. To delete a specific document from the LlamaIndex data framework using the def delete_ref_doc(
self, ref_doc_id: str, delete_from_docstore: bool = False, **delete_kwargs: Any
) -> None:
"""Delete a document and it's nodes by using ref_doc_id."""
ref_doc_info = self.docstore.get_ref_doc_info(ref_doc_id)
if ref_doc_info is None:
logger.warning(f"ref_doc_id {ref_doc_id} not found, nothing deleted.")
return
self.delete_nodes(
ref_doc_info.node_ids,
delete_from_docstore=False,
**delete_kwargs,
)
if delete_from_docstore:
self.docstore.delete_ref_doc(ref_doc_id, raise_error=False) As for your second question, the "page_label" in the metadata does indicate that a single file has been split into multiple parts. However, when you use I hope this answers your questions. If you have any more questions or need further clarification, feel free to ask. SourcesThis response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot. |
Beta Was this translation helpful? Give feedback.
-
@dosu-bot
|
Beta Was this translation helpful? Give feedback.
-
🤖 Hello, The Here's how it works: def delete_ref_doc(self, ref_doc_id: str, raise_error: bool = True) -> None:
"""Delete a ref_doc and all it's associated nodes."""
ref_doc_info = self.get_ref_doc_info(ref_doc_id)
if ref_doc_info is None:
if raise_error:
raise ValueError(f"ref_doc_id {ref_doc_id} not found.")
else:
return
for doc_id in ref_doc_info.node_ids:
self.delete_document(doc_id, raise_error=False, remove_ref_doc_node=False)
self._kvstore.delete(ref_doc_id, collection=self._metadata_collection)
self._kvstore.delete(ref_doc_id, collection=self._ref_doc_collection) In this method, it first retrieves the You can find this method in the llama_index/storage/docstore/keyval_docstore.py file. I hope this clarifies your question. If you have any more questions, feel free to ask. Best, SourcesThis response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot. |
Beta Was this translation helpful? Give feedback.
-
@dosu-bot |
Beta Was this translation helpful? Give feedback.
-
🤖 Hello, Yes, the KVDocumentStore can coexist with the VectorStoreIndex in the LlamaIndex data framework. The KVDocumentStore is a type of document store that is used to store Node objects, and each node is assigned an ID. The same docstore can be reused across index structures, which allows you to reuse the same storage for multiple index structures. Here is an example of how to create and use a KVDocumentStore: from llama_index.storage.docstore.keyval_docstore import KVDocumentStore
from llama_index.storage.kvstore.types import BaseKVStore
from llama_index.schema import BaseNode, TextNode
# Assuming you have a kvstore instance of BaseKVStore
kvstore = BaseKVStore()
# Create a KVDocumentStore
docstore = KVDocumentStore(kvstore)
# Assuming you have a list of nodes
nodes = [BaseNode(), TextNode()]
# Add documents to the store
docstore.add_documents(nodes)
# Get a document from the store
doc_id = nodes[0].node_id
document = docstore.get_document(doc_id)
# Check if a document exists
exists = docstore.document_exists(doc_id)
# Delete a document from the store
docstore.delete_document(doc_id) In the above example, we first create a KVDocumentStore by passing a kvstore instance of BaseKVStore to the KVDocumentStore constructor. Then, we add documents to the store using the add_documents method. We can get a document from the store using the get_document method and check if a document exists using the document_exists method. Finally, we delete a document from the store using the delete_document method. As for the VectorStoreIndex, it is an index that is built on top of an existing vector store. It is a class that inherits from the BaseIndex class and is used to manage and manipulate data in the vector store. The specific roles of the VectorStoreIndex include initialization, building index, insertion, deletion, and retrieval. You can find more details in the source code. I hope this helps! If you have any other questions, feel free to ask. Best, SourcesThis response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot. |
Beta Was this translation helpful? Give feedback.
-
@dosu-bot |
Beta Was this translation helpful? Give feedback.
-
🤖 Hello, To store the file name in the metadata when creating documents with the KVDocumentStore in the LlamaIndex data framework, you can modify the Here is the modified def add_documents(
self, nodes: Sequence[BaseNode], file_name: str, allow_update: bool = True
) -> None:
"""Add a document to the store.
Args:
docs (List[BaseDocument]): documents
file_name (str): name of the file
allow_update (bool): allow update of docstore from document
"""
for node in nodes:
# NOTE: doc could already exist in the store, but we overwrite it
if not allow_update and self.document_exists(node.node_id):
raise ValueError(
f"node_id {node.node_id} already exists. "
"Set allow_update to True to overwrite."
)
node_key = node.node_id
data = doc_to_json(node)
self._kvstore.put(node_key, data, collection=self._node_collection)
# update doc_collection if needed
metadata = {"doc_hash": node.hash, "file_name": file_name}
if isinstance(node, TextNode) and node.ref_doc_id is not None:
ref_doc_info = self.get_ref_doc_info(node.ref_doc_id) or RefDocInfo()
ref_doc_info.node_ids.append(node.node_id)
if not ref_doc_info.metadata:
ref_doc_info.metadata = node.metadata or {}
self._kvstore.put(
node.ref_doc_id,
ref_doc_info.to_dict(),
collection=self._ref_doc_collection,
)
# update metadata with map
metadata["ref_doc_id"] = node.ref_doc_id
self._kvstore.put(
node_key, metadata, collection=self._metadata_collection
)
else:
self._kvstore.put(
node_key, metadata, collection=self._metadata_collection
) In this modified method, As for the structure of the I hope this helps! Let me know if you have any other questions. SourcesThis response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research. Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant. If you want to continue the conversation, start your reply with @dosu-bot. |
Beta Was this translation helpful? Give feedback.
-
@Ma-Fukudama this thread is pretty long 😅 Is the root issue solved or no? Can you re-iterate what the problem is? |
Beta Was this translation helpful? Give feedback.
-
@logan-markewich |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Question Validation
Question
I am creating indexes from files using SimpleDirectoryReader and from URLs using SimpleWebPageReader.
The file name is registered in the metadata from the file, but it is not registered from the URL.
In the future, when deleting data in the index, I think that it will be deleted from the node ID, but since the URL has only the node ID in doctor.json, I can not distinguish it.
Is there a way to register the URL like a file name when creating an index from the URL?
Beta Was this translation helpful? Give feedback.
All reactions