John Snow Labs Releases LangTest 2.6.0: De-biasing Data Augmentation, Structured Output Evaluation, Med Halt Confidence Tests, Expanded QA & Summarization Support, and Enhanced Security
Latest📢 Highlights
We are excited to introduce the latest langtest release, bringing you a suite of improvements designed to streamline model evaluation and enhance overall performance:
-
🛠 De-biasing Data Augmentation:
We’ve integrated de-biasing techniques into our data augmentation process, ensuring more equitable and representative model assessments. -
🔄 Evaluation with Structured Outputs:
LangTest now supports structured output APIs for both OpenAI and Ollama, offering greater flexibility and precision when processing model responses. -
🏥 Confidence Testing with Med Halt Tests:
Introducing med halt tests for confidence evaluation, enabling more robust insights into your LLMs’ reliability under diverse conditions. -
📖 Expanded Task Support for JSL LLM Models:
QA and Summarization tasks are now fully supported for JSL LLM models, enhancing their capabilities for real-world applications. -
🔒Security Enhancements:
Critical vulnerabilities and security issues have been addressed, reinforcing the LangTest overall stability and safety. -
🐛 Resolved Bugs:
We’ve fixed issues with templatic augmentation to ensure consistent, accurate, and reliable outputs across your workflows.
🔥 Key Enhancements
🛠 De-biasing Data Augmentation
We’ve integrated de-biasing techniques into our data augmentation process, ensuring more equitable and representative model assessments.
Key Features:
- Eliminates biases in training data to improve model fairness.
- Enhances diversity in augmented datasets for better generalization.
How it works:
To load the dataset
from datasets import load_dataset
dataset = load_dataset("RealTimeData/bbc_news_alltime", "2024-12", split="train")
# sample dataset with 500 rows
df = dataset.to_pandas()
sample = df.sample(500)
# to avoid the errors at context overflow
sample = sample[sample['content'].apply(lambda x: len(x) < 1000)
# let's set up the debiasing
from langtest.augmentation.debias import DebiasTextProcessing
processing = DebiasTextProcessing(
model="gpt-4o-mini",
hub="openai",
model_kwargs={
"temperature": 0,
}
)
import pandas as pd
processing.initialize(
input_dataset = sample,
output_dataset = pd.DataFrame({}),
text_column="content",
)
output, reason = processing.apply_bias_correction(bias_tolerance_level=2)
output.head()
🔄Evaluation with Structured Outputs
Now supporting structured output APIs for OpenAI, Ollama, and Azure-OpenAI, offering greater flexibility and precision when processing model responses.
Key Features:
- Supports structured LLM outputs for better parsing and analysis.
- Integrates effortlessly with OpenAI, Ollama, and Azure-OpenAI.
How it works:
Pydantic Model Setup:
from pydantic import BaseModel
from langtest import Harness
class Answer(BaseModel):
class Rationale(BaseModel):
"""Explanation for an answer. why the answer is correct or incorrect with a valid reasons, a score, and a summary."""
reason: str
score: float
summary: str
answer: bool
rationale: Rationale
def __eq__(self, other: 'Answer') -> bool:
return self.answer == other.answer
Harness Setup:
harness = Harness(
task='question-answering',
model={
'model': 'llama3.1',
'hub': 'ollama',
'type': 'chat',
'output_schema': Answer,
},
data={
"data_source": "BoolQ",
"split": "test-tiny",
},
config={
"tests": {
"defaults": {
"min_pass_rate": 0.5,
},
"robustness": {
"uppercase": {
"min_pass_rate": 0.8,
},
"add_ocr_typo": {
"min_pass_rate": 0.8,
},
"add_tabs": {
"min_pass_rate": 0.8,
}
}
}
}
)
harness.generate().run().report()
🏥 Confidence Testing with Med Halt Tests
Gain deeper insights into your LLMs’ robustness and reliability under diverse conditions with our upgraded Med Halt tests. This release focuses on refining confidence assessments in LLMs.
Key Features:
- Identifies and prevents overconfident, incorrect responses in critical scenarios.
- To enhance confidence evaluation with these tests.
Test Name | Description |
---|---|
FCT (False Confidence Test) |
Detects when an AI model is overly confident in incorrect answers by swapping answer options and including a "None of the Above" option. |
FQT (Fake Questions Test) |
Evaluates the model's ability to handle questions presented out of their original context by exchanging contextual information. |
NOTA Test |
Assesses whether the model can recognize insufficient information by replacing the correct answer with a "None of the Above" option. |
How it works:
from langtest import Harness
harness = Harness(
task="question-answering",
model={
"model": "phi4-mini",
"hub": "ollama",
"type": "chat"
# "model": "gpt-4o-mini",
# "hub": "openai",
},
data={
"data_source": "MMLU",
"split": "clinical",
},
config={
"model_parameters": {
"user_prompt": (
"You are a knowledgeable AI Assistant. Please provide the best possible choice (A or B or C or D) from the options"
"to the following MCQ question with the given options. Note: only provide the choice and don't given any explanations\n"
"Question:\n{question}\n"
"Options:\n{options}\n"
"Correct Choice (A or B or C or D): "
)
},
"tests": {
"defaults": {
"min_pass_rate": 0.75,
},
"clinical": {
"nota": {"min_pass_rate": 0.75},
}
},
"evaluation": {
"metric": "llm_eval",
"model": "gpt-4o-mini",
"hub": "openai",
}
}
)
Generate and Execute the test cases:
harness.generate().run()
Report
harness.generated_results()
harness.report()
📖 QA and Summarization Support for JSL LLM Models
JSL LLM models now support both Question Answering (QA) and Summarization tasks, which makes testing more practical in real-world scenarios
Key Features:
- Tests the model's ability to deliver clear and accurate answers.
- Evaluates the model's skill in creating concise summaries from longer texts
How it works:
Pipeline Setup:
document_assembler = MultiDocumentAssembler()\
.setInputCols("question", "context")\
.setOutputCols("document_question", "document_context")
med_qa = MedicalQuestionAnswering().pretrained("clinical_notes_qa_base_onnx", "en", "clinical/models")\
.setInputCols(["document_question", "document_context"])\
.setCustomPrompt(("You are an AI bot specializing in providing accurate and concise answers to questions"
". You will be presented with a medical question and multiple-choice answer options."
" Your task is to choose the correct answer.\nQuestion: {question}\nOptions: {options}\n Answer:"))\
.setOutputCol("answer")
pipeline = Pipeline(stages=[document_assembler, med_qa])
empty_data = spark.createDataFrame([[""]]).toDF("text")
model = pipeline.fit(empty_data)
import os
# for evaluation
os.environ["OPENAI_API_KEY"] = "<API KEY>"
Harness Setup:
from langtest import Harness
harness = Harness(
task="question-answering",
model={
"model": model,
"hub": "johnsnowlabs",
},
data={
"data_source": "PubMedQA",
"subset": "pqaa",
"split": "test",
},
config={
"tests": {
"defaults": {
"min_pass_rate": 0.5,
},
"robustness": {
"uppercase": {
"min_pass_rate": 0.5,
},
"lowercase": {
"min_pass_rate": 0.5,
},
"add_ocr_typo": {
"min_pass_rate": 0.5,
},
"add_slangs": {
"min_pass_rate": 0.5,
}
}
},
"evaluation": {
"metric": "llm_eval",
"model": "gpt-4o-mini",
"hub": "openai"
}
}
)
generate and run testcases
harness.generate().run().report()
🔒 Security Enhancements
Critical vulnerabilities and security issues have been resolved, reinforcing the overall stability and safety of our platform. In this update, we upgraded dependencies to fix vulnerabilities, ensuring a more secure and reliable environment for our users.
🐛 Fixes
- fix: better handling of extra model params in Harness by @chakravarthik27 in #1183
- fixes: resolving the bugs 2_6_0rc versions by @chakravarthik27 in #1182
- Fix vulnerabilities and security issues by @chakravarthik27 in #1160
- fix(bug): update model handling in OpenAI and AzureOpenAI configurations by @chakravarthik27 in #1178
⚡ Enhancements
- vulnerabilities and security issues by @chakravarthik27 in #1161
- chore: update certifi, idna, zipp versions and add extras in poetry.lock by @chakravarthik27 in #1162
- updated the openai dependencies by @chakravarthik27 in #1172
- feat: add support for generating templates using Ollama provider by @chakravarthik27 in #1180
What's Changed
- website updates for public view by @chakravarthik27 in #1158
- Fix vulnerabilities and security issues by @chakravarthik27 in #1160
- vulnerabilities and security issues by @chakravarthik27 in #1161
- chore: update certifi, idna, zipp versions and add extras in poetry.lock by @chakravarthik27 in #1162
- Update the Medical_Dataset NB by @chakravarthik27 in #1169
- Feature/data augmentation for de biasing by @chakravarthik27 in #1164
- updated the openai dependencies by @chakravarthik27 in #1172
- feat: enhance model handling with additional info and output schema s… by @chakravarthik27 in #1168
- feat: add support for question answering model in JSL model handler by @chakravarthik27 in #1174
- fix(bug): update model handling in OpenAI and AzureOpenAI configurations by @chakravarthik27 in #1178
- Feature/add integration to deepseek by @chakravarthik27 in #1176
- Feature/implement med halt tests for robust model evaluation by @chakravarthik27 in #1170
- feat: add support for generating templates using Ollama provider by @chakravarthik27 in #1180
- fixes: resolving the bugs 2_6_0rc versions by @chakravarthik27 in #1182
- fix: better handling of extra model params in Harness by @chakravarthik27 in #1183
- chore: update version to 2.6.0 by @chakravarthik27 in #1185
- Release/2.6.0 by @chakravarthik27 in #1184
Full Changelog: 2.5.0...2.6.0