Release John Snow Labs Releases LangTest 2.6.0: De-biasing Data Augmentation, Structured Output Evaluation, Med Halt Confidence Tests, Expanded QA & Summarization Support, and Enhanced Security · JohnSnowLabs/langtest

📢 Highlights

We are excited to introduce the latest langtest release, bringing you a suite of improvements designed to streamline model evaluation and enhance overall performance:

🛠 De-biasing Data Augmentation:
We’ve integrated de-biasing techniques into our data augmentation process, ensuring more equitable and representative model assessments.
🔄 Evaluation with Structured Outputs:
LangTest now supports structured output APIs for both OpenAI and Ollama, offering greater flexibility and precision when processing model responses.
🏥 Confidence Testing with Med Halt Tests:
Introducing med halt tests for confidence evaluation, enabling more robust insights into your LLMs’ reliability under diverse conditions.
📖 Expanded Task Support for JSL LLM Models:
QA and Summarization tasks are now fully supported for JSL LLM models, enhancing their capabilities for real-world applications.
🔒Security Enhancements:
Critical vulnerabilities and security issues have been addressed, reinforcing the LangTest overall stability and safety.
🐛 Resolved Bugs:
We’ve fixed issues with templatic augmentation to ensure consistent, accurate, and reliable outputs across your workflows.

🔥 Key Enhancements

🛠 De-biasing Data Augmentation

We’ve integrated de-biasing techniques into our data augmentation process, ensuring more equitable and representative model assessments.

Key Features:

Eliminates biases in training data to improve model fairness.
Enhances diversity in augmented datasets for better generalization.

How it works:
To load the dataset

from datasets import load_dataset

dataset = load_dataset("RealTimeData/bbc_news_alltime", "2024-12", split="train")

# sample dataset with 500 rows
df = dataset.to_pandas()
sample = df.sample(500)

# to avoid the errors at context overflow
sample = sample[sample['content'].apply(lambda x: len(x) < 1000)

# let's set up the debiasing 
from langtest.augmentation.debias import DebiasTextProcessing 

processing = DebiasTextProcessing(
    model="gpt-4o-mini",
    hub="openai",
    model_kwargs={
        "temperature": 0,
    }
)

import pandas as pd

processing.initialize(
    input_dataset = sample,
    output_dataset = pd.DataFrame({}),
    text_column="content",
    
)

output, reason = processing.apply_bias_correction(bias_tolerance_level=2)

output.head()

🔄Evaluation with Structured Outputs

Now supporting structured output APIs for OpenAI, Ollama, and Azure-OpenAI, offering greater flexibility and precision when processing model responses.

Key Features:

Supports structured LLM outputs for better parsing and analysis.
Integrates effortlessly with OpenAI, Ollama, and Azure-OpenAI.

How it works:

Pydantic Model Setup:

from pydantic import BaseModel
from langtest import Harness

class Answer(BaseModel):
    
    class Rationale(BaseModel):
        """Explanation for an answer. why the answer is correct or incorrect with a valid reasons, a score, and a summary."""
        reason: str
        score: float
        summary: str

    answer: bool
    rationale: Rationale

    def __eq__(self, other: 'Answer') -> bool:
        return self.answer == other.answer

Harness Setup:

harness = Harness(
    task='question-answering',
    model={
        'model': 'llama3.1',
        'hub': 'ollama',
        'type': 'chat',
        'output_schema': Answer,
    },
    data={
        "data_source": "BoolQ",
        "split": "test-tiny",
    },
    config={
        "tests": {
            "defaults": {
                "min_pass_rate": 0.5,
            },
            "robustness": {
                "uppercase": {
                    "min_pass_rate": 0.8,
                },
                "add_ocr_typo": {
                    "min_pass_rate": 0.8,
                },
                "add_tabs": {
                    "min_pass_rate": 0.8,
                }
            }
        }
    }
)

harness.generate().run().report()

🏥 Confidence Testing with Med Halt Tests

Gain deeper insights into your LLMs’ robustness and reliability under diverse conditions with our upgraded Med Halt tests. This release focuses on refining confidence assessments in LLMs.

Key Features:

Identifies and prevents overconfident, incorrect responses in critical scenarios.
To enhance confidence evaluation with these tests.

Test Name	Description
FCT (False Confidence Test)	Detects when an AI model is overly confident in incorrect answers by swapping answer options and including a "None of the Above" option.
FQT (Fake Questions Test)	Evaluates the model's ability to handle questions presented out of their original context by exchanging contextual information.
NOTA Test	Assesses whether the model can recognize insufficient information by replacing the correct answer with a "None of the Above" option.

How it works:

from langtest import Harness 


harness = Harness(
    task="question-answering",
    model={
        "model": "phi4-mini",
        "hub": "ollama",
        "type": "chat"
        # "model": "gpt-4o-mini",
        # "hub": "openai",
    },
    data={
        "data_source": "MMLU",
        "split": "clinical",
    },
    config={
        "model_parameters": {
            "user_prompt": (
                    "You are a knowledgeable AI Assistant. Please provide the best possible choice (A or B or C or D) from the options"
                    "to the following MCQ question with the given options. Note: only provide the choice and don't given any explanations\n"
                    "Question:\n{question}\n"
                    "Options:\n{options}\n"
                    "Correct Choice (A or B or C or D): "
                    
            )
        },
        "tests": {
            
            "defaults": {
                "min_pass_rate": 0.75,

            },
            "clinical": {
                "nota": {"min_pass_rate": 0.75},
            }
        },
        "evaluation": {
            "metric": "llm_eval",
            "model": "gpt-4o-mini",
            "hub": "openai",
        }
    }
)

Generate and Execute the test cases:

harness.generate().run()

Report

harness.generated_results()

harness.report()

📖 QA and Summarization Support for JSL LLM Models

JSL LLM models now support both Question Answering (QA) and Summarization tasks, which makes testing more practical in real-world scenarios

Key Features:

Tests the model's ability to deliver clear and accurate answers.
Evaluates the model's skill in creating concise summaries from longer texts

How it works:

Pipeline Setup:

document_assembler = MultiDocumentAssembler()\
    .setInputCols("question", "context")\
    .setOutputCols("document_question", "document_context")

med_qa = MedicalQuestionAnswering().pretrained("clinical_notes_qa_base_onnx", "en", "clinical/models")\
    .setInputCols(["document_question", "document_context"])\
    .setCustomPrompt(("You are an AI bot specializing in providing accurate and concise answers to questions"
                      ". You will be presented with a medical question and multiple-choice answer options."
                      " Your task is to choose the correct answer.\nQuestion: {question}\nOptions: {options}\n Answer:"))\
    .setOutputCol("answer")

pipeline = Pipeline(stages=[document_assembler, med_qa])

empty_data = spark.createDataFrame([[""]]).toDF("text")

model = pipeline.fit(empty_data)

import os 
# for evaluation
os.environ["OPENAI_API_KEY"] = "<API KEY>"

Harness Setup:

from langtest import Harness 

harness = Harness(
    task="question-answering",
    model={
        "model": model,
        "hub": "johnsnowlabs",
    },
    data={
        "data_source": "PubMedQA",
        "subset": "pqaa",
        "split": "test",
    },
    config={  
        "tests": {
            "defaults": {
                "min_pass_rate": 0.5,
            },
            "robustness": {
                "uppercase": {
                    "min_pass_rate": 0.5,
                },
                "lowercase": {
                    "min_pass_rate": 0.5,
                },
                "add_ocr_typo": {
                    "min_pass_rate": 0.5,
                },
                "add_slangs": {
                    "min_pass_rate": 0.5,
                }
            }
        },
        "evaluation": {
            "metric": "llm_eval",
            "model": "gpt-4o-mini",
            "hub": "openai"
        }
    }
)

generate and run testcases

harness.generate().run().report()

Results

Report

🔒 Security Enhancements

Critical vulnerabilities and security issues have been resolved, reinforcing the overall stability and safety of our platform. In this update, we upgraded dependencies to fix vulnerabilities, ensuring a more secure and reliable environment for our users.

🐛 Fixes

fix: better handling of extra model params in Harness by @chakravarthik27 in #1183
fixes: resolving the bugs 2_6_0rc versions by @chakravarthik27 in #1182
Fix vulnerabilities and security issues by @chakravarthik27 in #1160
fix(bug): update model handling in OpenAI and AzureOpenAI configurations by @chakravarthik27 in #1178

⚡ Enhancements

vulnerabilities and security issues by @chakravarthik27 in #1161
chore: update certifi, idna, zipp versions and add extras in poetry.lock by @chakravarthik27 in #1162
updated the openai dependencies by @chakravarthik27 in #1172
feat: add support for generating templates using Ollama provider by @chakravarthik27 in #1180

What's Changed

website updates for public view by @chakravarthik27 in #1158
Fix vulnerabilities and security issues by @chakravarthik27 in #1160
vulnerabilities and security issues by @chakravarthik27 in #1161
chore: update certifi, idna, zipp versions and add extras in poetry.lock by @chakravarthik27 in #1162
Update the Medical_Dataset NB by @chakravarthik27 in #1169
Feature/data augmentation for de biasing by @chakravarthik27 in #1164
updated the openai dependencies by @chakravarthik27 in #1172
feat: enhance model handling with additional info and output schema s… by @chakravarthik27 in #1168
feat: add support for question answering model in JSL model handler by @chakravarthik27 in #1174
fix(bug): update model handling in OpenAI and AzureOpenAI configurations by @chakravarthik27 in #1178
Feature/add integration to deepseek by @chakravarthik27 in #1176
Feature/implement med halt tests for robust model evaluation by @chakravarthik27 in #1170
feat: add support for generating templates using Ollama provider by @chakravarthik27 in #1180
fixes: resolving the bugs 2_6_0rc versions by @chakravarthik27 in #1182
fix: better handling of extra model params in Harness by @chakravarthik27 in #1183
chore: update version to 2.6.0 by @chakravarthik27 in #1185
Release/2.6.0 by @chakravarthik27 in #1184

Full Changelog: 2.5.0...2.6.0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

John Snow Labs Releases LangTest 2.6.0: De-biasing Data Augmentation, Structured Output Evaluation, Med Halt Confidence Tests, Expanded QA & Summarization Support, and Enhanced Security

📢 Highlights

🔥 Key Enhancements

🛠 De-biasing Data Augmentation

🔄Evaluation with Structured Outputs

🏥 Confidence Testing with Med Halt Tests

📖 QA and Summarization Support for JSL LLM Models

🔒 Security Enhancements

🐛 Fixes

⚡ Enhancements

What's Changed

Contributors

Uh oh!