Building a solution in which nodes do not contain text data, but only a reference to this data #13755

m-borkowski · 2024-05-27T12:29:39Z

m-borkowski
May 27, 2024

Context

Hello team,
I'm working on a solution where I don't store the raw text in the vector database, but only a link to that text.
I noticed that standard LlamaIndex pipelines work on nodes that have raw text in them.
I would like to avoid this and work on nodes that only contain embedding along with metadata that contains references to the original text/
I was thinking about introducing a custom class that inherits from BaseNode and will have implemented get_content method in which I will retrieve this data on the fly (using some API calls).

Question

Could you tell me whether you have already encountered such a problem and maybe you already have any solution to this approach?
I wonder whether such a solution will be compatible with the LlamaIndex framework.
My main concern is the performance issue if get_content is retrieved multiple times while the agent is running.

Additionally, maybe you know a vector db integration which supports the proposed approach (keeping only references with vector instead of text + vector)?

@m-borkowski · 2024-05-27T12:40:00Z

dosubot[bot]
bot May 27, 2024

Hey there, @m-borkowski! I'm here to help you with any questions or issues you have while waiting for a human maintainer. Let's tackle this together!

To implement a solution in LlamaIndex where nodes only contain embeddings and metadata with references to the original text, you can use the BaseNode class and its derived classes like TextNode. These classes allow you to store embeddings and metadata, and you can use the relationships attribute to reference the original text.

Here is an example of how you can create such nodes:

from llama_index.core.schema import TextNode, MetadataMode

# Create a TextNode with embedding and metadata
node = TextNode(
    text="",
    embedding=[0.1, 0.2, 0.3],  # Example embedding
    metadata={"source": "original_text_reference"}
)

# Access the embedding and metadata
embedding = node.get_embedding()
metadata = node.metadata

# Print the node's content (which will be empty in this case)
print(node.get_content(metadata_mode=MetadataMode.NONE))

Regarding vector database integration, LlamaIndex supports integration with ChromaVectorStore, LanceDBVectorStore, TiDBVectorStore, and DeepLakeVectorStore, all of which can store and query nodes based on their embeddings and metadata. Here is an example of how you can use DeepLakeVectorStore with LlamaIndex:

from llama_index.vector_stores.deeplake import DeepLakeVectorStore
from llama_index.core.schema import TextNode, NodeRelationship, RelatedNodeInfo
from llama_index.core.vector_stores.types import VectorStoreQuery

# Create a DeepLakeVectorStore instance
deeplake_store = DeepLakeVectorStore(
    dataset_path="your_dataset_path",
    overwrite=True
)

# Create a TextNode with embedding and metadata
node = TextNode(
    text="",
    embedding=[0.1, 0.2, 0.3],  # Example embedding
    metadata={"source": "original_text_reference"},
    relationships={NodeRelationship.SOURCE: RelatedNodeInfo(node_id="original_text_id")}
)

# Add the node to the DeepLakeVectorStore
deeplake_store.add([node])

# Query the store
query_embedding = [0.1, 0.2, 0.3]
result = deeplake_store.query(VectorStoreQuery(query_embedding=query_embedding, similarity_top_k=1))

# Access the result
for res_node in result.nodes:
    print(res_node.get_content())
    print(res_node.metadata)

This approach ensures that nodes contain only embeddings and metadata, with references to the original text stored in the metadata. ChromaVectorStore, LanceDBVectorStore, TiDBVectorStore, and DeepLakeVectorStore are all suitable vector database integrations for this approach ^[1]^[2]^[3]^[4]^[5].

To ensure efficient and reliable data retrieval when integrating external APIs within the get_content method, you can follow these best practices:

Error Handling: Implement comprehensive error handling to manage different types of exceptions such as HTTP errors, request timeouts, and value errors.
Timeouts: Set appropriate timeout values for API requests to prevent your application from hanging indefinitely if the API is slow to respond.
Logging: Use logging to record errors and other significant events.
Retries: Implement retry logic for transient errors to improve reliability.
Rate Limiting: Respect the API's rate limits to avoid being throttled or banned.
Data Validation: Validate the data received from the API to ensure it meets your application's requirements before processing it.

Here is an example of how these best practices can be applied to the get_content method:

import logging
import requests
from typing import List
from llama_index.core.readers.base import BaseReader
from llama_index.core.schema import Document
from requests.adapters import HTTPAdapter
from requests.packages.urllib3.util.retry import Retry

logger = logging.getLogger(__name__)
logger.setLevel(logging.ERROR)

class OpenAlexReader(BaseReader):
    def __init__(self, email) -> None:
        self.email = email
        self.session = requests.Session()
        retries = Retry(total=5, backoff_factor=1, status_forcelist=[429, 500, 502, 503, 504])
        self.session.mount('https://', HTTPAdapter(max_retries=retries))

    def _search_openalex(self, query, fields):
        base_url = "https://api.openalex.org/works?search="
        fields_param = f"&select={fields}"
        email_param = f"&mailto={self.email}"
        full_url = base_url + query + fields_param + email_param
        try:
            response = self.session.get(full_url, timeout=10)
            response.raise_for_status()
            data = response.json()
            if "error" in data:
                raise ValueError(f"API returned error: {data['error']}")
            return data
        except requests.exceptions.HTTPError as http_error:
            logger.error(f"HTTP error occurred: {http_error}")
        except requests.exceptions.RequestException as request_error:
            logger.error(f"Error occurred: {request_error}")
        except ValueError as value_error:
            logger.error(value_error)
        return None

    def _fulltext_search_openalex(self, query, fields):
        base_url = "https://api.openalex.org/works?filter=fulltext.search:"
        fields_param = f"&select={fields}"
        email_param = f"&mailto={self.email}"
        full_url = base_url + query + fields_param + email_param
        try:
            response = self.session.get(full_url, timeout=10)
            response.raise_for_status()
            data = response.json()
            if "error" in data:
                raise ValueError(f"API returned error: {data['error']}")
            return data
        except requests.exceptions.HTTPError as http_error:
            logger.error(f"HTTP error occurred: {http_error}")
        except requests.exceptions.RequestException as request_error:
            logger.error(f"Error occurred: {request_error}")
        except ValueError as value_error:
            logger.error(value_error)
        return None

    def _invert_abstract(self, inv_index):
        if inv_index is not None:
            l_inv = [(w, p) for w, pos in inv_index.items() for p in pos]
            return " ".join(x[0] for x in sorted(l_inv, key=lambda x: x[1]))
        return None

    def load_data(self, query: str, full_text=False, fields=None) -> List[Document]:
        if fields is None:
            fields = "title,abstract_inverted_index,publication_year,keywords,authorships,primary_location"

        if full_text:
            works = self._fulltext_search_openalex(query, fields)
        else:
            works = self._search_openalex(query, fields)

        if works is None:
            return []

        documents = []
        for work in works["results"]:
            if work["abstract_inverted_index"] is not None:
                abstract = self._invert_abstract(work["abstract_inverted_index"])
            else:
                abstract = None
            title = work.get("title", None)
            text = None
            if abstract and title:
                text = title + " " + abstract
            elif not abstract:
                text = title
            try:
                primary_location = work["primary_location"]["source"]["display_name"]
            except (KeyError, TypeError):
                primary_location = None

            metadata = {
                "title": work.get("title", None),
                "keywords": work.get("keywords", None),
                "primary_location": primary_location,
                "publication_year": work.get("publication_year", None),
                "authorships": [
                    item["author"]["display_name"] for item in work["authorships"]
                ],
            }

            documents.append(Document(text=text, extra_info=metadata))

        return documents

By following these practices, you can ensure efficient and reliable data retrieval when integrating external APIs within the get_content method ^[6]^[7]^[8]^[9]^[10].

LlamaIndex also handles caching of node content using the IngestionCache class, which can be leveraged to mitigate performance issues when retrieving content on the fly. The IngestionCache class uses a key-value store (SimpleCache, RedisCache, MongoDBCache, or FirestoreCache) to store and retrieve nodes efficiently. Here is an example of how to use IngestionCache:

from llama_index.legacy.ingestion import IngestionCache
from llama_index.legacy.schema import TextNode

# Create an IngestionCache instance
cache = IngestionCache()

# Create a TextNode with embedding and metadata
node = TextNode(
    text="",
    embedding=[0.1, 0.2, 0.3],  # Example embedding
    metadata={"source": "original_text_reference"}
)

# Store the node in the cache
cache.put("node_key", [node])

# Retrieve the node from the cache
cached_nodes = cache.get("node_key")
for cached_node in cached_nodes:
    print(cached_node.get_content())
    print(cached_node.metadata)

By using caching, you can reduce the number of API calls and improve the performance of your solution ^[11]^[12]^[13]^[14].

Details

About Dosu
This response is meant to be useful and save you time. It is not meant to be a precise solution, but rather a starting point for your own research.

Help me be more useful! Please leave a 👍 if this is helpful and 👎 if it is irrelevant.

To continue the conversation, mention @dosu.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Building a solution in which nodes do not contain text data, but only a reference to this data #13755

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Details

Select a reply

Uh oh!

Building a solution in which nodes do not contain text data, but only a reference to this data #13755

Uh oh!

m-borkowski May 27, 2024

Context

Question

Replies: 1 comment

Uh oh!

dosubot[bot] bot May 27, 2024

Details

m-borkowski
May 27, 2024

dosubot[bot]
bot May 27, 2024