Skip to content

John Snow Labs Releases LangTest 2.6.0: De-biasing Data Augmentation, Structured Output Evaluation, Med Halt Confidence Tests, Expanded QA & Summarization Support, and Enhanced Security

Latest
Compare
Choose a tag to compare
@chakravarthik27 chakravarthik27 released this 11 Mar 05:24
c528cba

📢 Highlights

We are excited to introduce the latest langtest release, bringing you a suite of improvements designed to streamline model evaluation and enhance overall performance:

  • 🛠 De-biasing Data Augmentation:
    We’ve integrated de-biasing techniques into our data augmentation process, ensuring more equitable and representative model assessments.

  • 🔄 Evaluation with Structured Outputs:
    LangTest now supports structured output APIs for both OpenAI and Ollama, offering greater flexibility and precision when processing model responses.

  • 🏥 Confidence Testing with Med Halt Tests:
    Introducing med halt tests for confidence evaluation, enabling more robust insights into your LLMs’ reliability under diverse conditions.

  • 📖 Expanded Task Support for JSL LLM Models:
    QA and Summarization tasks are now fully supported for JSL LLM models, enhancing their capabilities for real-world applications.

  • 🔒Security Enhancements:
    Critical vulnerabilities and security issues have been addressed, reinforcing the LangTest overall stability and safety.

  • 🐛 Resolved Bugs:
    We’ve fixed issues with templatic augmentation to ensure consistent, accurate, and reliable outputs across your workflows.

🔥 Key Enhancements

🛠 De-biasing Data Augmentation

Open In Colab

We’ve integrated de-biasing techniques into our data augmentation process, ensuring more equitable and representative model assessments.

Key Features:

  • Eliminates biases in training data to improve model fairness.
  • Enhances diversity in augmented datasets for better generalization.

How it works:
To load the dataset

from datasets import load_dataset

dataset = load_dataset("RealTimeData/bbc_news_alltime", "2024-12", split="train")

# sample dataset with 500 rows
df = dataset.to_pandas()
sample = df.sample(500)

# to avoid the errors at context overflow
sample = sample[sample['content'].apply(lambda x: len(x) < 1000)
# let's set up the debiasing 
from langtest.augmentation.debias import DebiasTextProcessing 

processing = DebiasTextProcessing(
    model="gpt-4o-mini",
    hub="openai",
    model_kwargs={
        "temperature": 0,
    }
)
import pandas as pd

processing.initialize(
    input_dataset = sample,
    output_dataset = pd.DataFrame({}),
    text_column="content",
    
)

output, reason = processing.apply_bias_correction(bias_tolerance_level=2)

output.head()

image

🔄Evaluation with Structured Outputs

Open In Colab

Now supporting structured output APIs for OpenAI, Ollama, and Azure-OpenAI, offering greater flexibility and precision when processing model responses.

Key Features:

  • Supports structured LLM outputs for better parsing and analysis.
  • Integrates effortlessly with OpenAI, Ollama, and Azure-OpenAI.

How it works:

Pydantic Model Setup:

from pydantic import BaseModel
from langtest import Harness

class Answer(BaseModel):
    
    class Rationale(BaseModel):
        """Explanation for an answer. why the answer is correct or incorrect with a valid reasons, a score, and a summary."""
        reason: str
        score: float
        summary: str

    answer: bool
    rationale: Rationale

    def __eq__(self, other: 'Answer') -> bool:
        return self.answer == other.answer

Harness Setup:

harness = Harness(
    task='question-answering',
    model={
        'model': 'llama3.1',
        'hub': 'ollama',
        'type': 'chat',
        'output_schema': Answer,
    },
    data={
        "data_source": "BoolQ",
        "split": "test-tiny",
    },
    config={
        "tests": {
            "defaults": {
                "min_pass_rate": 0.5,
            },
            "robustness": {
                "uppercase": {
                    "min_pass_rate": 0.8,
                },
                "add_ocr_typo": {
                    "min_pass_rate": 0.8,
                },
                "add_tabs": {
                    "min_pass_rate": 0.8,
                }
            }
        }
    }
)

harness.generate().run().report()

image

🏥 Confidence Testing with Med Halt Tests

Open In Colab

Gain deeper insights into your LLMs’ robustness and reliability under diverse conditions with our upgraded Med Halt tests. This release focuses on refining confidence assessments in LLMs.

Key Features:

  • Identifies and prevents overconfident, incorrect responses in critical scenarios.
  • To enhance confidence evaluation with these tests.
Test Name Description
FCT
(False Confidence Test)
Detects when an AI model is overly confident in incorrect answers by swapping answer options and including a "None of the Above" option.
FQT
(Fake Questions Test)
Evaluates the model's ability to handle questions presented out of their original context by exchanging contextual information.
NOTA
Test
Assesses whether the model can recognize insufficient information by replacing the correct answer with a "None of the Above" option.

How it works:

from langtest import Harness 


harness = Harness(
    task="question-answering",
    model={
        "model": "phi4-mini",
        "hub": "ollama",
        "type": "chat"
        # "model": "gpt-4o-mini",
        # "hub": "openai",
    },
    data={
        "data_source": "MMLU",
        "split": "clinical",
    },
    config={
        "model_parameters": {
            "user_prompt": (
                    "You are a knowledgeable AI Assistant. Please provide the best possible choice (A or B or C or D) from the options"
                    "to the following MCQ question with the given options. Note: only provide the choice and don't given any explanations\n"
                    "Question:\n{question}\n"
                    "Options:\n{options}\n"
                    "Correct Choice (A or B or C or D): "
                    
            )
        },
        "tests": {
            
            "defaults": {
                "min_pass_rate": 0.75,

            },
            "clinical": {
                "nota": {"min_pass_rate": 0.75},
            }
        },
        "evaluation": {
            "metric": "llm_eval",
            "model": "gpt-4o-mini",
            "hub": "openai",
        }
    }
)

Generate and Execute the test cases:

harness.generate().run()

Report

harness.generated_results()

image

harness.report()

image

📖 QA and Summarization Support for JSL LLM Models

Open In Colab

JSL LLM models now support both Question Answering (QA) and Summarization tasks, which makes testing more practical in real-world scenarios

Key Features:

  • Tests the model's ability to deliver clear and accurate answers.
  • Evaluates the model's skill in creating concise summaries from longer texts

How it works:

Pipeline Setup:

document_assembler = MultiDocumentAssembler()\
    .setInputCols("question", "context")\
    .setOutputCols("document_question", "document_context")

med_qa = MedicalQuestionAnswering().pretrained("clinical_notes_qa_base_onnx", "en", "clinical/models")\
    .setInputCols(["document_question", "document_context"])\
    .setCustomPrompt(("You are an AI bot specializing in providing accurate and concise answers to questions"
                      ". You will be presented with a medical question and multiple-choice answer options."
                      " Your task is to choose the correct answer.\nQuestion: {question}\nOptions: {options}\n Answer:"))\
    .setOutputCol("answer")

pipeline = Pipeline(stages=[document_assembler, med_qa])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = pipeline.fit(empty_data)
import os 
# for evaluation
os.environ["OPENAI_API_KEY"] = "<API KEY>"

Harness Setup:

from langtest import Harness 

harness = Harness(
    task="question-answering",
    model={
        "model": model,
        "hub": "johnsnowlabs",
    },
    data={
        "data_source": "PubMedQA",
        "subset": "pqaa",
        "split": "test",
    },
    config={  
        "tests": {
            "defaults": {
                "min_pass_rate": 0.5,
            },
            "robustness": {
                "uppercase": {
                    "min_pass_rate": 0.5,
                },
                "lowercase": {
                    "min_pass_rate": 0.5,
                },
                "add_ocr_typo": {
                    "min_pass_rate": 0.5,
                },
                "add_slangs": {
                    "min_pass_rate": 0.5,
                }
            }
        },
        "evaluation": {
            "metric": "llm_eval",
            "model": "gpt-4o-mini",
            "hub": "openai"
        }
    }
)

generate and run testcases

harness.generate().run().report()

Results
image
Report
image

🔒 Security Enhancements

Critical vulnerabilities and security issues have been resolved, reinforcing the overall stability and safety of our platform. In this update, we upgraded dependencies to fix vulnerabilities, ensuring a more secure and reliable environment for our users.

🐛 Fixes

⚡ Enhancements

What's Changed

Full Changelog: 2.5.0...2.6.0