Building a solution in which nodes do not contain text data, but only a reference to this data #13755
Replies: 1 comment
-
Hey there, @m-borkowski! I'm here to help you with any questions or issues you have while waiting for a human maintainer. Let's tackle this together! To implement a solution in LlamaIndex where nodes only contain embeddings and metadata with references to the original text, you can use the Here is an example of how you can create such nodes: from llama_index.core.schema import TextNode, MetadataMode
# Create a TextNode with embedding and metadata
node = TextNode(
text="",
embedding=[0.1, 0.2, 0.3], # Example embedding
metadata={"source": "original_text_reference"}
)
# Access the embedding and metadata
embedding = node.get_embedding()
metadata = node.metadata
# Print the node's content (which will be empty in this case)
print(node.get_content(metadata_mode=MetadataMode.NONE)) Regarding vector database integration, LlamaIndex supports integration with ChromaVectorStore, LanceDBVectorStore, TiDBVectorStore, and DeepLakeVectorStore, all of which can store and query nodes based on their embeddings and metadata. Here is an example of how you can use DeepLakeVectorStore with LlamaIndex: from llama_index.vector_stores.deeplake import DeepLakeVectorStore
from llama_index.core.schema import TextNode, NodeRelationship, RelatedNodeInfo
from llama_index.core.vector_stores.types import VectorStoreQuery
# Create a DeepLakeVectorStore instance
deeplake_store = DeepLakeVectorStore(
dataset_path="your_dataset_path",
overwrite=True
)
# Create a TextNode with embedding and metadata
node = TextNode(
text="",
embedding=[0.1, 0.2, 0.3], # Example embedding
metadata={"source": "original_text_reference"},
relationships={NodeRelationship.SOURCE: RelatedNodeInfo(node_id="original_text_id")}
)
# Add the node to the DeepLakeVectorStore
deeplake_store.add([node])
# Query the store
query_embedding = [0.1, 0.2, 0.3]
result = deeplake_store.query(VectorStoreQuery(query_embedding=query_embedding, similarity_top_k=1))
# Access the result
for res_node in result.nodes:
print(res_node.get_content())
print(res_node.metadata) This approach ensures that nodes contain only embeddings and metadata, with references to the original text stored in the metadata. ChromaVectorStore, LanceDBVectorStore, TiDBVectorStore, and DeepLakeVectorStore are all suitable vector database integrations for this approach [1][2][3][4][5]. To ensure efficient and reliable data retrieval when integrating external APIs within the
Here is an example of how these best practices can be applied to the import logging
import requests
from typing import List
from llama_index.core.readers.base import BaseReader
from llama_index.core.schema import Document
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry
logger = logging.getLogger(__name__)
logger.setLevel(logging.ERROR)
class OpenAlexReader(BaseReader):
def __init__(self, email) -> None:
self.email = email
self.session = requests.Session()
retries = Retry(total=5, backoff_factor=1, status_forcelist=[429, 500, 502, 503, 504])
self.session.mount('https://', HTTPAdapter(max_retries=retries))
def _search_openalex(self, query, fields):
base_url = "https://api.openalex.org/works?search="
fields_param = f"&select={fields}"
email_param = f"&mailto={self.email}"
full_url = base_url + query + fields_param + email_param
try:
response = self.session.get(full_url, timeout=10)
response.raise_for_status()
data = response.json()
if "error" in data:
raise ValueError(f"API returned error: {data['error']}")
return data
except requests.exceptions.HTTPError as http_error:
logger.error(f"HTTP error occurred: {http_error}")
except requests.exceptions.RequestException as request_error:
logger.error(f"Error occurred: {request_error}")
except ValueError as value_error:
logger.error(value_error)
return None
def _fulltext_search_openalex(self, query, fields):
base_url = "https://api.openalex.org/works?filter=fulltext.search:"
fields_param = f"&select={fields}"
email_param = f"&mailto={self.email}"
full_url = base_url + query + fields_param + email_param
try:
response = self.session.get(full_url, timeout=10)
response.raise_for_status()
data = response.json()
if "error" in data:
raise ValueError(f"API returned error: {data['error']}")
return data
except requests.exceptions.HTTPError as http_error:
logger.error(f"HTTP error occurred: {http_error}")
except requests.exceptions.RequestException as request_error:
logger.error(f"Error occurred: {request_error}")
except ValueError as value_error:
logger.error(value_error)
return None
def _invert_abstract(self, inv_index):
if inv_index is not None:
l_inv = [(w, p) for w, pos in inv_index.items() for p in pos]
return " ".join(x[0] for x in sorted(l_inv, key=lambda x: x[1]))
return None
def load_data(self, query: str, full_text=False, fields=None) -> List[Document]:
if fields is None:
fields = "title,abstract_inverted_index,publication_year,keywords,authorships,primary_location"
if full_text:
works = self._fulltext_search_openalex(query, fields)
else:
works = self._search_openalex(query, fields)
if works is None:
return []
documents = []
for work in works["results"]:
if work["abstract_inverted_index"] is not None:
abstract = self._invert_abstract(work["abstract_inverted_index"])
else:
abstract = None
title = work.get("title", None)
text = None
if abstract and title:
text = title + " " + abstract
elif not abstract:
text = title
try:
primary_location = work["primary_location"]["source"]["display_name"]
except (KeyError, TypeError):
primary_location = None
metadata = {
"title": work.get("title", None),
"keywords": work.get("keywords", None),
"primary_location": primary_location,
"publication_year": work.get("publication_year", None),
"authorships": [
item["author"]["display_name"] for item in work["authorships"]
],
}
documents.append(Document(text=text, extra_info=metadata))
return documents By following these practices, you can ensure efficient and reliable data retrieval when integrating external APIs within the LlamaIndex also handles caching of node content using the from llama_index.legacy.ingestion import IngestionCache
from llama_index.legacy.schema import TextNode
# Create an IngestionCache instance
cache = IngestionCache()
# Create a TextNode with embedding and metadata
node = TextNode(
text="",
embedding=[0.1, 0.2, 0.3], # Example embedding
metadata={"source": "original_text_reference"}
)
# Store the node in the cache
cache.put("node_key", [node])
# Retrieve the node from the cache
cached_nodes = cache.get("node_key")
for cached_node in cached_nodes:
print(cached_node.get_content())
print(cached_node.metadata) By using caching, you can reduce the number of API calls and improve the performance of your solution [11][12][13][14].
|
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Context
Hello team,
I'm working on a solution where I don't store the raw text in the vector database, but only a link to that text.
I noticed that standard LlamaIndex pipelines work on nodes that have raw text in them.
I would like to avoid this and work on nodes that only contain embedding along with metadata that contains references to the original text/
I was thinking about introducing a custom class that inherits from BaseNode and will have implemented
get_content
method in which I will retrieve this data on the fly (using some API calls).Question
Could you tell me whether you have already encountered such a problem and maybe you already have any solution to this approach?
I wonder whether such a solution will be compatible with the LlamaIndex framework.
My main concern is the performance issue if
get_content
is retrieved multiple times while the agent is running.Additionally, maybe you know a vector db integration which supports the proposed approach (keeping only references with vector instead of text + vector)?
Beta Was this translation helpful? Give feedback.
All reactions