Skip to content

.Net: Summarization and translation evaluation examples with Filters #6262

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
May 17, 2024

Conversation

dmytrostruk
Copy link
Member

Motivation and Context

This example demonstrates how to perform quality check on LLM results for such tasks as text summarization and translation with Semantic Kernel Filters.

Metrics used in this example:

  • BERTScore - leverages the pre-trained contextual embeddings from BERT and matches words in candidate and reference sentences by cosine similarity.
  • BLEU (BiLingual Evaluation Understudy) - evaluates the quality of text which has been machine-translated from one natural language to another.
  • METEOR (Metric for Evaluation of Translation with Explicit ORdering) - evaluates the similarity between the generated summary and the reference summary, taking into account grammar and semantics.
  • COMET (Crosslingual Optimized Metric for Evaluation of Translation) - is an open-source framework used to train Machine Translation metrics that achieve high levels of correlation with different types of human judgments.

In this example, SK Filters call dedicated server which is responsible for task evaluation using metrics described above. If evaluation score of specific metric doesn't meet configured threshold, an exception is thrown with evaluation details.

Hugging Face Evaluate Metric library is used to evaluate summarization and translation results.

Contribution Checklist

@dmytrostruk dmytrostruk self-assigned this May 15, 2024
@dmytrostruk dmytrostruk requested a review from a team as a code owner May 15, 2024 05:02
@markwallace-microsoft markwallace-microsoft added .NET Issue or Pull requests regarding .NET code documentation labels May 15, 2024
@dmytrostruk dmytrostruk added this pull request to the merge queue May 17, 2024
Merged via the queue into microsoft:main with commit 51af5ee May 17, 2024
15 checks passed
@dmytrostruk dmytrostruk deleted the quality-check-examples branch May 17, 2024 14:30
johnoliver pushed a commit to johnoliver/semantic-kernel that referenced this pull request Jun 5, 2024
…icrosoft#6262)

### Motivation and Context

<!-- Thank you for your contribution to the semantic-kernel repo!
Please help reviewers and future users, providing the following
information:
  1. Why is this change required?
  2. What problem does it solve?
  3. What scenario does it contribute to?
  4. If it fixes an open issue, please link to the issue here.
-->

This example demonstrates how to perform quality check on LLM results
for such tasks as text summarization and translation with Semantic
Kernel Filters.

Metrics used in this example:
- [BERTScore](https://github.com/Tiiiger/bert_score) - leverages the
pre-trained contextual embeddings from BERT and matches words in
candidate and reference sentences by cosine similarity.
- [BLEU](https://en.wikipedia.org/wiki/BLEU) (BiLingual Evaluation
Understudy) - evaluates the quality of text which has been
machine-translated from one natural language to another.
- [METEOR](https://en.wikipedia.org/wiki/METEOR) (Metric for Evaluation
of Translation with Explicit ORdering) - evaluates the similarity
between the generated summary and the reference summary, taking into
account grammar and semantics.
- [COMET](https://unbabel.github.io/COMET) (Crosslingual Optimized
Metric for Evaluation of Translation) - is an open-source framework used
to train Machine Translation metrics that achieve high levels of
correlation with different types of human judgments.

In this example, SK Filters call dedicated server which is responsible
for task evaluation using metrics described above. If evaluation score
of specific metric doesn't meet configured threshold, an exception is
thrown with evaluation details.

[Hugging Face Evaluate Metric](https://github.com/huggingface/evaluate)
library is used to evaluate summarization and translation results.

### Contribution Checklist

<!-- Before submitting this PR, please make sure: -->

- [x] The code builds clean without any errors or warnings
- [x] The PR follows the [SK Contribution
Guidelines](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md)
and the [pre-submission formatting
script](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md#development-scripts)
raises no violations
- [x] All unit tests pass, and I have added new tests where possible
- [x] I didn't break anyone 😄
johnoliver pushed a commit to johnoliver/semantic-kernel that referenced this pull request Jun 5, 2024
…icrosoft#6262)

### Motivation and Context

<!-- Thank you for your contribution to the semantic-kernel repo!
Please help reviewers and future users, providing the following
information:
  1. Why is this change required?
  2. What problem does it solve?
  3. What scenario does it contribute to?
  4. If it fixes an open issue, please link to the issue here.
-->

This example demonstrates how to perform quality check on LLM results
for such tasks as text summarization and translation with Semantic
Kernel Filters.

Metrics used in this example:
- [BERTScore](https://github.com/Tiiiger/bert_score) - leverages the
pre-trained contextual embeddings from BERT and matches words in
candidate and reference sentences by cosine similarity.
- [BLEU](https://en.wikipedia.org/wiki/BLEU) (BiLingual Evaluation
Understudy) - evaluates the quality of text which has been
machine-translated from one natural language to another.
- [METEOR](https://en.wikipedia.org/wiki/METEOR) (Metric for Evaluation
of Translation with Explicit ORdering) - evaluates the similarity
between the generated summary and the reference summary, taking into
account grammar and semantics.
- [COMET](https://unbabel.github.io/COMET) (Crosslingual Optimized
Metric for Evaluation of Translation) - is an open-source framework used
to train Machine Translation metrics that achieve high levels of
correlation with different types of human judgments.

In this example, SK Filters call dedicated server which is responsible
for task evaluation using metrics described above. If evaluation score
of specific metric doesn't meet configured threshold, an exception is
thrown with evaluation details.

[Hugging Face Evaluate Metric](https://github.com/huggingface/evaluate)
library is used to evaluate summarization and translation results.

### Contribution Checklist

<!-- Before submitting this PR, please make sure: -->

- [x] The code builds clean without any errors or warnings
- [x] The PR follows the [SK Contribution
Guidelines](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md)
and the [pre-submission formatting
script](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md#development-scripts)
raises no violations
- [x] All unit tests pass, and I have added new tests where possible
- [x] I didn't break anyone 😄
LudoCorporateShark pushed a commit to LudoCorporateShark/semantic-kernel that referenced this pull request Aug 25, 2024
…icrosoft#6262)

### Motivation and Context

<!-- Thank you for your contribution to the semantic-kernel repo!
Please help reviewers and future users, providing the following
information:
  1. Why is this change required?
  2. What problem does it solve?
  3. What scenario does it contribute to?
  4. If it fixes an open issue, please link to the issue here.
-->

This example demonstrates how to perform quality check on LLM results
for such tasks as text summarization and translation with Semantic
Kernel Filters.

Metrics used in this example:
- [BERTScore](https://github.com/Tiiiger/bert_score) - leverages the
pre-trained contextual embeddings from BERT and matches words in
candidate and reference sentences by cosine similarity.
- [BLEU](https://en.wikipedia.org/wiki/BLEU) (BiLingual Evaluation
Understudy) - evaluates the quality of text which has been
machine-translated from one natural language to another.
- [METEOR](https://en.wikipedia.org/wiki/METEOR) (Metric for Evaluation
of Translation with Explicit ORdering) - evaluates the similarity
between the generated summary and the reference summary, taking into
account grammar and semantics.
- [COMET](https://unbabel.github.io/COMET) (Crosslingual Optimized
Metric for Evaluation of Translation) - is an open-source framework used
to train Machine Translation metrics that achieve high levels of
correlation with different types of human judgments.

In this example, SK Filters call dedicated server which is responsible
for task evaluation using metrics described above. If evaluation score
of specific metric doesn't meet configured threshold, an exception is
thrown with evaluation details.

[Hugging Face Evaluate Metric](https://github.com/huggingface/evaluate)
library is used to evaluate summarization and translation results.

### Contribution Checklist

<!-- Before submitting this PR, please make sure: -->

- [x] The code builds clean without any errors or warnings
- [x] The PR follows the [SK Contribution
Guidelines](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md)
and the [pre-submission formatting
script](https://github.com/microsoft/semantic-kernel/blob/main/CONTRIBUTING.md#development-scripts)
raises no violations
- [x] All unit tests pass, and I have added new tests where possible
- [x] I didn't break anyone 😄
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation .NET Issue or Pull requests regarding .NET code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants