Outputing multiple records with structured_llm? #25736

owenffff · 2024-08-25T17:16:11Z

owenffff
Aug 25, 2024

Checked other resources

I added a very descriptive title to this question.
I searched the LangChain documentation with the integrated search.
I used the GitHub search to find a similar question and didn't find it.

Commit to Help

I commit to help with one of those options 👆

Example Code

from langchain.prompts import PromptTemplate
from pydantic import BaseModel, Field, validator
from typing import Optional, Union
import pandas as pd


class Quotation(BaseModel):
    date: str = Field(..., description="Date of the quotation")
    quote_number: str = Field(..., description="Unique identifier for the quote")
    currency: str = Field(..., description="Currency used in the quotation")
    quote_name: str = Field(..., description="Name or title of the quotation")
    description: str = Field(..., description="Description of the service")
    quantity: int = Field(..., description="Quantity of the item")
    one_time_charge: Optional[float] = Field(None, description="One-time charge of the item")
    monthly_charge: Optional[float] = Field(None, description="Monthly charge of the item")
    monthly_recurring_service_tax: Optional[float] = Field(None, description="Monthly recurring service tax")
    total_price_otc_mrc: Optional[float] = Field(None, description="Total price (One-Time Charge + Monthly Recurring Charge)")


quotation_prompt_template = PromptTemplate(
    input_variables=["content"],
    template="""
    Given the following markdown table content, extract and list only the selected service items, including their quantities, any associated charges, and additional quotation details. Ignore rows where the quantity or charges are "N/A" or "Not Applicable."

    Content:

    {content}

    """
)



structured_llm = llm.with_structured_output(Quotation)

all_results_quo = []
folder_path = r"C:\Users\oliang005\Downloads\Quotation_output_all_tables"

for filename in os.listdir(folder_path):
    if filename.endswith(".txt"):
        file_path = os.path.join(folder_path, filename)
        
        with open(file_path, "r") as file:
            content = file.read()
        
        quo_prompt = quotation_prompt_template.format(content=content)
        
        quo_output = structured_llm.invoke(quo_prompt)
        
        output = quo_output.dict()
        output = {'filename': filename[:-4] + ".pdf" , **output}
        all_results_quo.append(output)


df_quo = pd.DataFrame(all_results_quo)
df_quo

Description

Hi,

I'm currently experimenting with structured_llm and use it for information extraction from a number of Quotation Contracts.

When i was using structured_llm with pydantic model defined, i found that if i provide documents (that contains two or more records within a single document) to structured llm, it will not natively output all records for you. the default behavior of the structured_llm with a Pydantic model is to return a single instance of the model, which limits output to one record per query.

Wondering if there are any workarounds for letting the structured_llm to capture more than one record within a single document? Thanks!

System Info

langchain version = 0.2.12

@owenffff · 2024-08-25T17:18:28Z

dosubot[bot]
bot Aug 25, 2024

Hey there, @owenffff! I'm here to assist you with any bugs, questions, or contributions you need help with. Let's tackle this together!

To capture and output multiple records from a single document using a Pydantic model in LangChain, you can use the with_structured_output method. Here is an example of how to do this:

from typing import List, Optional
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.pydantic_v1 import BaseModel, Field

class Quotation(BaseModel):
    date: str = Field(..., description="Date of the quotation")
    quote_number: str = Field(..., description="Unique identifier for the quote")
    currency: str = Field(..., description="Currency used in the quotation")
    quote_name: str = Field(..., description="Name or title of the quotation")
    description: str = Field(..., description="Description of the service")
    quantity: int = Field(..., description="Quantity of the item")
    one_time_charge: Optional[float] = Field(None, description="One-time charge of the item")
    monthly_charge: Optional[float] = Field(None, description="Monthly charge of the item")
    monthly_recurring_service_tax: Optional[float] = Field(None, description="Monthly recurring service tax")
    total_price_otc_mrc: Optional[float] = Field(None, description="Total price (One-Time Charge + Monthly Recurring Charge)")

class Document(BaseModel):
    quotations: List[Quotation] = Field(..., description="List of quotations mentioned in the document")

llm = ChatOpenAI(model="gpt-3.5-turbo-0125", temperature=0)
prompt = ChatPromptTemplate.from_messages(
    [
        ("system", "You are an extraction algorithm. Please extract every possible instance of quotation information."),
        ("human", "{input}")
    ]
)
structured_llm = llm.with_structured_output(Document)

result = structured_llm.invoke({"input": "Your document content here"})
print(result)
# -> Document(quotations=[Quotation(...), Quotation(...), ...])

This example demonstrates how to define a Pydantic model to capture multiple records (in this case, multiple quotations) from a single document and use LangChain's with_structured_output to extract and output the structured data ^[1]^[2]^[3].

To continue talking to Dosu, mention @dosu.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Outputing multiple records with structured_llm? #25736

Uh oh!

{{title}}

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Outputing multiple records with structured_llm? #25736

Uh oh!

owenffff Aug 25, 2024

Checked other resources

Commit to Help

Example Code

Description

System Info

Replies: 1 comment

Uh oh!

dosubot[bot] bot Aug 25, 2024

owenffff
Aug 25, 2024

dosubot[bot]
bot Aug 25, 2024