Skip to content

Feature: multilingual machine translation #943

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 79 commits into from
Apr 10, 2025

Conversation

SnowMasaya
Copy link
Contributor

@SnowMasaya SnowMasaya commented Oct 9, 2024

This revision enables translation support for a streamlined the use case. Some items such
as support for multiple target languages in a single run have been omitted in favor of a
more straight forward user experience and reduced ambiguity in the scope of results.

Translator classes have been implemented using the Configurable pattern and the plugin
loader. This introduced a new paradigm of providing configuration for a list of instances
with specific configuration required at runtime where previous Configurable class
configuration has been for all instances of a specific class or module. The processing and
attribute names used to create this instance list may evolve further.

Usage

Translation function is configured in the run section of a configuration see the doc
page in the PR for details.

New default configuration values for run.target_lang and run.translators are in the
updated documentation and allow for backwards compatible configuration with existing runs.

There are still some existing TODO: comments and notes about location that may need
further testing before landing this upstream. Most noteworthy are comments still in the
code of the atkgen probe that require further scrutiny to validate the attack technique
is applied correctly.

It may be appropriate to gate this functionality as experimental for initial release, this
would required some additional guard code to ensure limited impact to report formats and
internal state.

Example

python -m garak -m huggingface.Model --config hf_RigoChat_gpu.yml -p lmrc --report_prefix RigoChat-21fde039

hf_RigoChat_gpu.yml:

run:
  target_lang: "es"
  translators:
    - language: es,en
      model_type: local
      model_name: facebook/m2m100_418M
      hf_args:
        device: cuda
    - language: en,es
      model_type: local
      model_name: facebook/m2m100_418M
      hf_args:
        device: cuda
plugins:
  generators:
    huggingface:
      Model:
        name: IIC/RigoChat-7b-v2
        hf_args:
          device: cuda
          trust_remote_code: true
          torch_dtype: float16

Copy link
Contributor

github-actions bot commented Oct 9, 2024

DCO Assistant Lite bot All contributors have signed the DCO ✍️ ✅

- Integrated support for multiple translation services including local and external APIs.
  - local: Huggingface model uses for translation
  - deepl:DeepL uses for translation
  - nim: NIM uses for translation
- Implemented utility functions for language detection and text processing.

Signed-off-by: Masaya Ogushi <mogushi@nvidia.com>
- Addd translation function for base probe class
- prompts and triggers translate by base class method
- attempt_descr translation

Signed-off-by: Masaya Ogushi <mogushi@nvidia.com>
- Translation handling for detector keywords and substrings, triggers.

Signed-off-by: Masaya Ogushi <mogushi@nvidia.com>
- Added support for specifying translation services directly from the CLI.
- Implemented options to set  target languages for translation.

Signed-off-by: Masaya Ogushi <mogushi@nvidia.com>
- Added new dependencies required for enhanced translation features.

Signed-off-by: Masaya Ogushi <mogushi@nvidia.com>
- Added detailed explanations of the translationn method
- Included examples of how translation services are configured and utilized within the codebase.

Signed-off-by: Masaya Ogushi <mogushi@nvidia.com>
@SnowMasaya SnowMasaya force-pushed the feature/multilingual branch from b781a6f to 6bb7da3 Compare October 9, 2024 01:41
@SnowMasaya
Copy link
Contributor Author

SnowMasaya commented Oct 9, 2024

I have read the DCO Document and I hereby sign the DCO

Copy link
Collaborator

@jmartin-tech jmartin-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review, testing is still in progress.

The test failures in macOS look to be an incomplete dependency requirement that may need to be reworked or removed. A default installation should not require install of an external library. Hence the dependency on pyenchant here may be problematic.

Traceback:
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/importlib/__init__.py:90: in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
tests/detectors/test_detectors_riskywords.py:8: in <module>
    import garak.detectors.base
garak/detectors/__init__.py:1: in <module>
    from .base import *
garak/detectors/base.py:17: in <module>
    from garak.translator import SimpleTranslator, LocalTranslator, is_english
garak/translator.py:15: in <module>
    import enchant
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/enchant/__init__.py:81: in <module>
    from enchant import _enchant as _e
/Library/Frameworks/Python.framework/Versions/3.12/lib/python3.12/site-packages/enchant/_enchant.py:[157](https://github.com/leondz/garak/actions/runs/11246708414/job/31296143918?pr=943#step:5:158): in <module>
    raise ImportError(msg)
E   ImportError: The 'enchant' C library was not found and maybe needs to be installed.
E   See  https://pyenchant.github.io/pyenchant/install.html
E   for details

update remove punctuation
update english judge
add translate function
add logging translate result
add Reverse translate for hf detector and snowball probes

Signed-off-by: Masaya Ogushi <mogushi@nvidia.com>
check translator instance
remove translate function
reset config

Signed-off-by: Masaya Ogushi <mogushi@nvidia.com>
check translator instance
add reverse translator
add test reverse translator

Signed-off-by: Masaya Ogushi <mogushi@nvidia.com>
remove argument
using generator_option_file

Signed-off-by: Masaya Ogushi <mogushi@nvidia.com>
add load translator instance

Signed-off-by: Masaya Ogushi <mogushi@nvidia.com>
check storage size
set up each instance for each test

Signed-off-by: Masaya Ogushi <mogushi@nvidia.com>
remove pyenchant
Using nltk instead of pyenchant

Signed-off-by: Masaya Ogushi <mogushi@nvidia.com>
update how to use translation function

Signed-off-by: Masaya Ogushi <mogushi@nvidia.com>
github-actions bot added a commit that referenced this pull request Oct 23, 2024
SnowMasaya and others added 3 commits October 23, 2024 14:13
Signed-off-by: SnowGushiGit <snow.akogi.pgel@gmail.com>
Signed-off-by: Masaya Ogushi <mogushi@nvidia.com>
Copy link
Collaborator

@leondz leondz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, this is a significant amount of work.

This is for translation-based multilingual, rather than general multilingual support, so we should be careful to separate these two functions. The ability to select which languages garak uses is distinct from how that language is achieved (in prompts, detectors, and so on).

There are some refactorings needed to get the PR to a sustainable state. We may need a couple more hooks, either injected into base classes or added there. It might be simplest to add these hooks to the harness. Happy to set up a call or instant message chat to discuss this. Given the breadth of the changes, it is likely to be beneficial to discuss plans and get good alignment while doing the rest of the changes.

Copy link
Collaborator

@jmartin-tech jmartin-tech left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@SnowMasaya thank you, this is a significant benefit for the project.

Adding a number of my pending comments here, please be aware these are likely similar to many @leondz made, although possibly from different reasoning. Happy to continue iterating.

@leondz
Copy link
Collaborator

leondz commented Oct 30, 2024

Adding a number of my pending comments here, please be aware these are likely similar to many @leondz made, although possibly from different reasoning. Happy to continue iterating.

Indeed, sorry for the dupe review, there was quite a lot of code so I reviewed directly instead of processing @jmartin-tech 's comments as well

add mean judge for reverse translation
change translation model size

Signed-off-by: Masaya Ogushi <mogushi@nvidia.com>
translate trigger words

Signed-off-by: Masaya Ogushi <mogushi@nvidia.com>
add reverse translation
remove trigger translation
fix test code

Signed-off-by: Masaya Ogushi <mogushi@nvidia.com>
* mark private functions more clearly
* remove translator from detectors
* clarify runtime service actions as distinct method
* remove guards on `self.translator` as existing on all probes
* lazy load local `transformers` classes
* reduce setup in detector translation tests

Signed-off-by: Jeffrey Martin <jemartin@nvidia.com>
Copy link
Collaborator

@leondz leondz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We're converging. Skipping over comments on some of the tests, for now, until the core patterns are completely settled.

]
# non_none_offsets = [k for k,v in enumerate(attempt.all_outputs) if v is not None] # we'll want this when outputs+scores need to align
all_outputs = attempt.all_outputs
if attempt.bcp47 != "*" and attempt.bcp47 != self.bcp47:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's one example of the confusion I guess. Propose then:

Suggested change
if attempt.bcp47 != "*" and attempt.bcp47 != self.bcp47:
if attempt.prompt_bcp47 != "*" and attempt.prompt_bcp47 != self.bcp47:

# non_none_offsets = [k for k,v in enumerate(attempt.all_outputs) if v is not None] # we'll want this when outputs+scores need to align
all_outputs = attempt.all_outputs
if attempt.bcp47 != "*" and attempt.bcp47 != self.bcp47:
all_outputs = attempt.reverse_translator_outputs
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, that's fair. If users want to override detector selection that's up to them.

the idea here is that the probe is responsible for properly handling reverse translation requirements and detectors will utilize the Attempt they are presented.

Thanks. In this case, it's not right that detector code is referencing reverse_translator_outputs - the probe is responsible for properly handling requirements. Or is that responsibility actually split between probe and detector?

@@ -67,8 +67,18 @@ def detect(self, attempt: Attempt) -> List[float]:
import torch

premise = attempt.prompt
all_outputs = attempt.all_outputs
if attempt.bcp47 != "*" and attempt.bcp47 != self.bcp47:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I like this. Then the test (in many places) would be if _use_translation_entries(attempt):. Where would this function sit? Maybe in garak.attempt ?

Signed-off-by: Jeffrey Martin <jemartin@nvidia.com>
Signed-off-by: Jeffrey Martin <jemartin@nvidia.com>
* rename translate_prompts -> translate
* extract translate_descr into goodside.Tag as only usage

Signed-off-by: Jeffrey Martin <jemartin@nvidia.com>
Signed-off-by: Jeffrey Martin <jemartin@nvidia.com>
* add missing SPDX
* prefer more specific object resolution in message
* adjust and reduce excess log entries

Signed-off-by: Jeffrey Martin <jemartin@nvidia.com>
Signed-off-by: Jeffrey Martin <jemartin@nvidia.com>
Signed-off-by: Jeffrey Martin <jemartin@nvidia.com>
Signed-off-by: Jeffrey Martin <jemartin@nvidia.com>
Signed-off-by: Jeffrey Martin <jemartin@nvidia.com>
@leondz
Copy link
Collaborator

leondz commented Apr 3, 2025

lgtm

Todos for after landing:

Signed-off-by: Jeffrey Martin <jemartin@nvidia.com>
Signed-off-by: Jeffrey Martin <jemartin@nvidia.com>
The `-` character is used for sub-dialect selection.

* use `,` for language pairs
* add dialect override for riva `es` for latest service
* avoid changing the parsed language & store a private overload
* sanity check remote service with a simple translation request
* add supported langauge validation to `m2m100` models from model card

Signed-off-by: Jeffrey Martin <jemartin@nvidia.com>
This pattern needs to be reworked, any service supported dialect should be held
first and if a more general value is configured then the fallback should be used.

Signed-off-by: Jeffrey Martin <jemartin@nvidia.com>
* only attempt to test a translator on first initialization
* ensure client is set on each call in `RivaTranslator` as serialization clears the client
* correct param syntax for initial test in `DeeplTranslator`

This may need further iteration as the `riva.client.Auth` object hold the tcp channel for the client
and is likely re-established excessively in this code when multiprocessing serializes the core process
object that is currerntly not consumed in the worker process.

Signed-off-by: Jeffrey Martin <jemartin@nvidia.com>
This is a hack and should be reworked!

Signed-off-by: Jeffrey Martin <jemartin@nvidia.com>
Signed-off-by: Jeffrey Martin <jemartin@nvidia.com>
* `TreeSearchProbe` should not attempt any reverse translation

Signed-off-by: Jeffrey Martin <jemartin@nvidia.com>
When processing a list of outputs `None` may be in the list
do not strip this value from the list and guard internal calls

Signed-off-by: Jeffrey Martin <jemartin@nvidia.com>
This should become a more explicit test in a future iteration

Signed-off-by: Jeffrey Martin <jemartin@nvidia.com>
Signed-off-by: Jeffrey Martin <jemartin@nvidia.com>
Signed-off-by: Jeffrey Martin <jemartin@nvidia.com>
Copy link
Collaborator

@leondz leondz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great to me. rocket emoji etc etc. this is super cool

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

trivia point: the Marian author was one of the participants in the demon paper!

@@ -105,6 +122,10 @@ class DeeplTranslator(Translator):
"en"
]
# fmt: on
# Applied when a service only supports regions specific codes
bcp47_overrides = {
"en": "en-US",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm disdainful of not making English English default English. I guess empire instincts die hard.

@jmartin-tech jmartin-tech changed the title Feature/multilingual Feature: multilingual machine translation Apr 10, 2025
@jmartin-tech jmartin-tech merged commit b087184 into NVIDIA:main Apr 10, 2025
9 checks passed
@github-actions github-actions bot locked and limited conversation to collaborators Apr 10, 2025
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants