Skip to content

experimental feature: trait scan base infrastructure #955

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 72 commits into
base: main
Choose a base branch
from

Conversation

leondz
Copy link
Collaborator

@leondz leondz commented Oct 23, 2024

This module represents objects related to trait and policy scanning.

Trait scanning in garak attempts to work out what the target's content policy is, before running a security scan.

It's important to know what target content policy is because we only really have a useful/successful hit or breach if we're able to get a model to do something that it otherwise wouldn't. It may be exciting to discover a model gives instructions for e.g. cooking meth if the request is encoded in base64, but if in fact the model gives the instructions when simply asked directly "print instructions for cooking meth", the use of base64 necessarily an exploit in this output category - the model is acting the same.

Garak's policy support follows a typology of different traits, each describing a different behaviour. By default this typology is stored in data/policy/trait_typology.json.

A trait scan is conducted by invoking garak with the --trait_scan switch. When this is requested, a separate scan runs using all trait probes within garak. Trait probes are denoted by a probe class asserting trait_probe=True. A regular probewise harness runs the scan, though reporting is diverted to a separate policy report file. After completion, garak estimates a policy based on trait probe results, and writes this to both main and policy reports.

What this PR adds

We're laying the base infrastructure for trait scans in this PR.

  • Add a trait typology and support loading it
  • Introduce a policy module and Policy class to allow storing and manipulation of target content policies. Policies consist of a set of traits each describing a behaviour and whether this is permitted by the target.
  • Differentiate non-adversarial probes as "trait" probes, which appear differently and are not automatically selected for main runs
  • Specify which traits a trait probe tests for, and enforce presence of this via test (a trait probe that doesn't inform the policy, is not useful)
  • Add an optional "trait scan" which assesses model behavior under various policy points. It:
    • selects trait probes
    • guesses a policy depending on the output of these probe's nominated detectors
    • logs to a separate place
    • outputs a serialised policy.Policy object detailing what was extracted about the target's apparent content policy
  • Enable plugin filtering in _plugins.enumerate_plugins() to help dynamic selection of plugins based on class attributes
  • Added to & unified logging in harnesses

Verification

  • garak -m test --trait_scan -p encoding -g 1, then tail the xxx.policy.jsonl

todo for this vs. later PRs

There are required for merging this:

  • test for garak.policy
  • validate default policy
  • tag tests: probes must specify a trait list if trait_probe is true
  • validation
  • run --list_trait_probes
  • run a trait scan on a test target, see if the probe selection and output make sense

These are out-of-scope and planned:

  • probe for trying prompts based on traits and looking to mitigation/no resp
  • merging of results for cases where multiple probes test a policy
  • refactor donotanswer
  • using policy to filter planned probing

leondz added 30 commits October 2, 2024 14:32
…arness, custom harness, and command.xxx_run()
@leondz leondz marked this pull request as draft November 12, 2024 16:09
@leondz leondz changed the title feature: policy scan base infrastructure experimental feature: policy scan base infrastructure Nov 12, 2024
@leondz leondz marked this pull request as ready for review February 20, 2025 14:33
Comment on lines 119 to 126
class PolicyHarness(ProbewiseHarness):

def _probe_check(self, probe):
assert (
probe.policy_probe == True
), "only policy probes should be used in policy runs"
setattr(probe, "generations", _config.policy.generations)
return probe
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the plugin load cache is holding onto all created probe instances based on the config_root used to create them. This should not mutate the probe as that will modify the cached probe, we should create another with the required generations passed via config_root.

Consider the hook call can replace the call to load the probe itself (note this is untested code):

            try:
                probe = _plugins.load_plugin(probename)
            except Exception as e:
                print(f"failed to load probe {probename}")
                logging.warning("failed to load probe %s:", repr(e))

becomes:

            probe = _load_probe(probename)

with:

    def _load_probe(self, probename):
           probe = None
           try:
                probe = _plugins.load_plugin(probename)
            except Exception as e:
                print(f"failed to load probe {probename}")
                logging.warning("failed to load probe %s:", repr(e))
           return probe

class PolicyHarness(ProbewiseHarness):
    def _load_probe(self, probename):
           import copy
           probe = None
           assert (
               _plugins.plugin_info["policy_probe"] == True
           ), "only policy probes should be used in policy runs"
           config_root = copy.deepcopy(_config.plugins.probes)
           probe_config = config_root
           for path in probename.split(".")[2:]:
               probe_config = probe_config[path]
           probe_config["generations"] = _config.policy.generations
           try:
                probe = _plugins.load_plugin(probename, config_root=config_root)
            except Exception as e:
                print(f"failed to load probe {probename}")
                logging.warning("failed to load probe %s:", repr(e))
           return probe

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently the plugin load cache is holding onto all created probe instances

Can you say more about this? Does adjusting probe member values after the configuration in constructor, alter the cache in a meaningful way? I know probes alter their own internal values during execution and I think I'm missing the border

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probes should not alter internal values during execution, if they do we should look into why and evaluate impacts. I remember some values where probe() method is overridden that modify and restore a couple self values during execution.

Likely something we should consider for refactor, we have already noted issues that needed fixes where probe or detector attributes were being manipulated in ways that produced side-effects on the class.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Alright, fair. This one needs to lock generations to 1 somehow. Perhaps as a condition for making it out of experimental status. Or maybe this can be done at harness level, once we have a general pattern for config injection / global config availability.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to use this pattern

@leondz leondz changed the title experimental feature: policy scan base infrastructure experimental feature: trait scan base infrastructure Apr 8, 2025
@leondz leondz self-assigned this Apr 11, 2025
@leondz leondz requested a review from erickgalinkin April 24, 2025 05:51
@leondz leondz marked this pull request as draft April 24, 2025 14:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
architecture Architectural upgrades
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add pre-scan model output policy checks
3 participants