Feature Request: Add JSON Support #1816

LouisLiuNova · 2025-01-14T13:22:52Z

LouisLiuNova
Jan 14, 2025

Currently, the project supports plaintext well for embedding and retrieval. However, adding JSON support would significantly enhance its usability by enabling users to integrate the RAG framework more seamlessly into modern workflows where JSON is the standard format for data interchange. Will LightRAG add support for structured data like JSON?

nickjfrench · 2025-01-21T16:08:27Z

nickjfrench
Jan 21, 2025

Interested to learn more about this as I was planning to insert JSON. How does it behave currently when you pass a JSON file in?

0 replies

LouisLiuNova · 2025-01-29T12:02:39Z

LouisLiuNova
Jan 29, 2025
Author

Interested to learn more about this as I was planning to insert JSON. How does it behave currently when you pass a JSON file in?

I plan to use this for storing JSON-formatted data, but I haven’t extracted the semantic data from my dataset yet. Once I complete that, I’ll provide more detailed information.

0 replies

nickjfrench · 2025-01-30T13:59:03Z

nickjfrench
Jan 30, 2025

I plan to use this for storing JSON-formatted data, but I haven’t extracted the semantic data from my dataset yet. Once I complete that, I’ll provide more detailed information.

I tried uploading a somewhat large Json Structure, it was somewhat okay, but wasn't great. I think the lack of sentences and fluff didn't allow the LightRAG processing calls to understand the context of the information. E.g. the nesting of objects should almost be the edges in the graph (at least that's how I see it in my data structure), but it failed to output it like that.

0 replies

LouisLiuNova · 2025-02-01T07:09:08Z

LouisLiuNova
Feb 1, 2025
Author

I tried uploading a somewhat large Json Structure, it was somewhat okay, but wasn't great. I think the lack of sentences and fluff didn't allow the LightRAG processing calls to understand the context of the information. E.g. the nesting of objects should almost be the edges in the graph (at least that's how I see it in my data structure), but it failed to output it like that.

I anticipated this scenario, as LightRAG is primarily designed for natural language processing and storage. My plan is to refactor the JSON data into a complete sentence for storage, rather than directly inputting structured data.

0 replies

nickjfrench · 2025-02-03T18:50:37Z

nickjfrench
Feb 3, 2025

I anticipated this scenario, as LightRAG is primarily designed for natural language processing and storage. My plan is to refactor the JSON data into a complete sentence for storage, rather than directly inputting structured data.

If you don't mind sharing what you end up doing, I'd love to see what you found was the best method.

I had brainstormed doing a synthesize pass to create those single fact sentences ("owner has fact"). But wondered if it would be best practice to pass one sentence in at a time or could I pass the entire converted json-as-sentence text in, and how that would link to the embedded unstructured data chunks.

0 replies

LouisLiuNova · 2025-02-05T15:03:50Z

LouisLiuNova
Feb 5, 2025
Author

I plan to implement this solution and will evaluate its outcomes in the next phase.

0 replies

LouisLiuNova · 2025-02-11T15:36:40Z

LouisLiuNova
Feb 11, 2025
Author

If you don't mind sharing what you end up doing, I'd love to see what you found was the best method.

I had brainstormed doing a synthesize pass to create those single fact sentences ("owner has fact"). But wondered if it would be best practice to pass one sentence in at a time or could I pass the entire converted json-as-sentence text in, and how that would link to the embedded unstructured data chunks.

Thank you for sharing your thoughts! I'd be happy to share my approach and findings so far, though I am still working through some challenges. Here's an outline of what I have done and the issues I encountered:

Currently, I have designed a template to convert my custom pydantic.BaseModel into a narrative string for insertion into the LightRAG framework. This approach is intended to facilitate entity extraction and allow for structured data to be processed more effectively. While the initial step of converting the model to a narrative string was successful, I noticed that the extracted entities were incomplete and did not meet my expectations. As a result, I have raised an issue to seek further insights from the community: #749

Below is the template I used to convert a CVE descriptor model into a paragraph:

def convert_model_to_string(model: RAGModel) -> str:
    """
    Convert a RAGModel object to a string for insertion into the LightRAG.
    """
    return f"""
    {model.cve_meta.cve_number} is a vulnerability with title {model.cve_meta.title}. There are several functions related to this vulnerability:
    {chr(10).join([f"{func.function_name}: {func.general_purpose}" for func in model.desc.funcs_desc.functional_desc])}
    {model.cve_meta.cve_number} is related to the following CWEs:{model.cve_meta.weaknesses}
    {model.cve_meta.cve_number} is caused by the following flaws:{model.desc.sec_desc.description}: {model.desc.sec_desc.vulnerability_cause_details}
    {model.cve_meta.cve_number} is fixed by deploying the following patch methods:{model.desc.sec_desc.patch_details}
    """
    # chr(10) is the \n to bypass the SyntaxError: f-string expression part cannot include a backslash

While the narrative string generation works as intended, the entity extraction results are not complete. For example, key terms such as enhance_image, channel_map, and PixelInfo were not extracted as entities, which diminishes the utility of the knowledge graph. Given this, I am uncertain if the issue lies in the configuration of the RAG pipeline, the entity recognition logic, or something else.

Questions I Am Exploring:

Should single fact sentences (e.g., "owner has fact") be generated and passed one at a time as an isolated doc or chunk, or is it better to pass the entire narrative string as a single input? i.e. single insertion or incremental insertion?
How can I ensure that all relevant terms are captured during the entity extraction step? Are there specific optimizations or configurations I should consider?

I hope this provides some clarity on my approach. I would greatly appreciate any insights or recommendations from collaborators to address these challenges. If you'd like, I can share more details as I iterate further on this and work together.

0 replies

nickjfrench · 2025-02-11T19:58:47Z

nickjfrench
Feb 11, 2025

Thank you for this. That's an interesting approach on the narrative string, have you played with how an LLM could generate that dynamically by passing in the model_dump method on the BaseModel class? My data structure is still going to change often, so I want to avoid manually declaring it like that.

While the initial step of converting the model to a narrative string was successful, I noticed that the extracted entities were incomplete and did not meet my expectations.

I found similar results with LightRAG. Have you tried passing the same input into an alternative like GraphRAG (it has had a lot of optimizing updates since LightRAG's release)?

Questions I Am Exploring:

Two questions I would also like to understand. I think the current approach in LightRAG works well for large bodies of text, like the "A Christmas Story" book, but when I'm passing in highly tailored data incrementally, would I be better off just getting an LLM to generate Graph queries without the RAG index pipeline?

Maybe an external (from LightRAG) evaluator or guardrail would work better, but that doesn't seem like the most efficient approach.

0 replies

LouisLiuNova · 2025-02-12T03:56:23Z

LouisLiuNova
Feb 12, 2025
Author

Thank you for this. That's an interesting approach on the narrative string, have you played with how an LLM could generate that dynamically by passing in the model_dump method on the BaseModel class? My data structure is still going to change often, so I want to avoid manually declaring it like that.

That's an interesting point. In my initial implementation, I manually created a template and populated it with the relevant fields. If your data models are highly dynamic, leveraging an LLM could indeed be a practical solution. Specifically, you could send a request to the LLM to generate a narrative paragraph based on the model_dump output. However, do note that the LLM's responses can vary significantly, which might lead to inconsistencies in how the graph is generated or interpreted.

I found similar results with LightRAG. Have you tried passing the same input into an alternative like GraphRAG (it has had a lot of optimizing updates since LightRAG's release)?

I’ll take another look at the latest version of GraphRAG. I did try it previously, but the results didn’t quite meet my expectations, which is why I shifted to this approach 😊. That said, I’ll revisit GraphRAG to see if the updates address the earlier issues.

Would I be better off just getting an LLM to generate Graph queries without the RAG index pipeline?

That's a valid idea. It might be worth exploring whether LLM-generated Graph queries can work effectively without relying on the RAG index pipeline. However, before diving into this, I’d like to confirm with the LightRAG collaborators whether my results are expected and correct. If not, I’ll consider alternative methods or approaches to refine the workflow.

0 replies

bzImage · 2025-03-10T16:14:26Z

bzImage
Mar 10, 2025

I also have a much of techincal cybersecurity data i need to process and i have then in json strctured format..

as stated on #962

I’ll take another look at the latest version of GraphRAG. I did try it previously, but the results didn’t quite meet my expectations, which is why I shifted to this approach 😊. That said, I’ll revisit GraphRAG to see if the updates address the earlier issues.

Did you made some progress on GraphRAG ? i visited it months ago but it was buggy and it seems to only works with text-story corpus (tailored to process the book and if you put any other thing else.. it breaks)..

0 replies

bzImage · 2025-03-12T18:26:51Z

bzImage
Mar 12, 2025

If we change the "entity_extraction" prompt from prompts.py.. to reflect that he is being feed a structured json structure works .. ? something like :

source json data

{ "text" : "bla bla bla bla..",
   "entities" : [ "joe", "martha", "pedro" ]
}

and modify the entity extraction to something like

"""You will be given a json structure in wich the "text" key contains the relevant document to where u will extract the entites listed on the "entities" key..

""""

that will work ?

0 replies

LouisLiuNova · 2025-03-14T14:06:16Z

LouisLiuNova
Mar 14, 2025
Author

I also have a much of techincal cybersecurity data i need to process and i have then in json strctured format..

as stated on #962

I’ll take another look at the latest version of GraphRAG. I did try it previously, but the results didn’t quite meet my expectations, which is why I shifted to this approach 😊. That said, I’ll revisit GraphRAG to see if the updates address the earlier issues.

Did you made some progress on GraphRAG ? i visited it months ago but it was buggy and it seems to only works with text-story corpus (tailored to process the book and if you put any other thing else.. it breaks)..

Unfortunately, it still remains on my to-do list. I will explore an alternative solution in a few days and maybe share some views then.

0 replies

Feature Request: Add JSON Support #1816

Uh oh!

LouisLiuNova Jan 14, 2025

Replies: 12 comments

Uh oh!

nickjfrench Jan 21, 2025

Uh oh!

LouisLiuNova Jan 29, 2025 Author

Uh oh!

nickjfrench Jan 30, 2025

Uh oh!

Uh oh!

LouisLiuNova Feb 1, 2025 Author

Uh oh!

nickjfrench Feb 3, 2025

Uh oh!

LouisLiuNova Feb 5, 2025 Author

Uh oh!

LouisLiuNova Feb 11, 2025 Author

Uh oh!

nickjfrench Feb 11, 2025

Uh oh!

LouisLiuNova Feb 12, 2025 Author

Uh oh!

bzImage Mar 10, 2025

Uh oh!

Uh oh!

bzImage Mar 12, 2025

Uh oh!

LouisLiuNova Mar 14, 2025 Author

LouisLiuNova
Jan 14, 2025

nickjfrench
Jan 21, 2025

LouisLiuNova
Jan 29, 2025
Author

nickjfrench
Jan 30, 2025

LouisLiuNova
Feb 1, 2025
Author

nickjfrench
Feb 3, 2025

LouisLiuNova
Feb 5, 2025
Author

LouisLiuNova
Feb 11, 2025
Author

nickjfrench
Feb 11, 2025

LouisLiuNova
Feb 12, 2025
Author

bzImage
Mar 10, 2025

bzImage
Mar 12, 2025

LouisLiuNova
Mar 14, 2025
Author