.Net: Summarization and translation evaluation examples with Filters #6262

dmytrostruk · 2024-05-15T05:02:35Z

Motivation and Context

This example demonstrates how to perform quality check on LLM results for such tasks as text summarization and translation with Semantic Kernel Filters.

Metrics used in this example:

BERTScore - leverages the pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity.
BLEU (BiLingual Evaluation Understudy) - evaluates the quality of text which has been machine-translated from one natural language to another.
METEOR (Metric for Evaluation of Translation with Explicit ORdering) - evaluates the similarity between the generated summary and the reference summary, taking into account grammar and semantics.
COMET (Crosslingual Optimized Metric for Evaluation of Translation) - is an open-source framework used to train Machine Translation metrics that achieve high levels of correlation with different types of human judgments.

In this example, SK Filters call dedicated server which is responsible for task evaluation using metrics described above. If evaluation score of specific metric doesn't meet configured threshold, an exception is thrown with evaluation details.

Hugging Face Evaluate Metric library is used to evaluate summarization and translation results.

Contribution Checklist

The code builds clean without any errors or warnings
The PR follows the SK Contribution Guidelines and the pre-submission formatting script raises no violations
All unit tests pass, and I have added new tests where possible
I didn't break anyone 😄

…icrosoft#6262) ### Motivation and Context  This example demonstrates how to perform quality check on LLM results for such tasks as text summarization and translation with Semantic Kernel Filters. Metrics used in this example: - [BERTScore](https://github.com/Tiiiger/bert_score) - leverages the pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity. - [BLEU](https://en.wikipedia.org/wiki/BLEU) (BiLingual Evaluation Understudy) - evaluates the quality of text which has been machine-translated from one natural language to another. - [METEOR](https://en.wikipedia.org/wiki/METEOR) (Metric for Evaluation of Translation with Explicit ORdering) - evaluates the similarity between the generated summary and the reference summary, taking into account grammar and semantics. - [COMET](https://unbabel.github.io/COMET) (Crosslingual Optimized Metric for Evaluation of Translation) - is an open-source framework used to train Machine Translation metrics that achieve high levels of correlation with different types of human judgments. In this example, SK Filters call dedicated server which is responsible for task evaluation using metrics described above. If evaluation score of specific metric doesn't meet configured threshold, an exception is thrown with evaluation details. [Hugging Face Evaluate Metric](https://github.com/huggingface/evaluate) library is used to evaluate summarization and translation results. ### Contribution Checklist  - [x] The code builds clean without any errors or warnings - [x] The PR follows the [SK Contribution Guidelines](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md) and the [pre-submission formatting script](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md#development-scripts) raises no violations - [x] All unit tests pass, and I have added new tests where possible - [x] I didn't break anyone 😄

dmytrostruk added 6 commits May 14, 2024 16:50

Added python server for summary and translation quality checks

dc60341

Added summary evaluation examples

5f86785

Added translation evaluation example

5da09c7

Added documentation

dbe58f2

Merge branch 'main' into quality-check-examples

a176b17

Removed commented code

70d9047

dmytrostruk self-assigned this May 15, 2024

dmytrostruk requested a review from a team as a code owner May 15, 2024 05:02

markwallace-microsoft added .NET Issue or Pull requests regarding .NET code documentation labels May 15, 2024

Fixed typos

2ad5320

dmytrostruk mentioned this pull request May 15, 2024

.Net: Filters use cases to be supported before making the feature non-experimental #5436

Closed

6 tasks

dmytrostruk added 2 commits May 15, 2024 11:20

Merge branch 'main' into quality-check-examples

a54da3f

Merge branch 'main' into quality-check-examples

668c553

markwallace-microsoft approved these changes May 16, 2024

View reviewed changes

SergeyMenshykh approved these changes May 17, 2024

View reviewed changes

dmytrostruk added this pull request to the merge queue May 17, 2024

Merged via the queue into microsoft:main with commit 51af5ee May 17, 2024
15 checks passed

dmytrostruk deleted the quality-check-examples branch May 17, 2024 14:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

.Net: Summarization and translation evaluation examples with Filters #6262

.Net: Summarization and translation evaluation examples with Filters #6262

Uh oh!

dmytrostruk commented May 15, 2024

Uh oh!

Uh oh!

Uh oh!

.Net: Summarization and translation evaluation examples with Filters #6262

.Net: Summarization and translation evaluation examples with Filters #6262

Uh oh!

Conversation

dmytrostruk commented May 15, 2024

Motivation and Context

Contribution Checklist

Uh oh!

Uh oh!

Uh oh!