experimental feature: trait scan base infrastructure #955

leondz · 2024-10-23T13:11:35Z

This module represents objects related to trait and policy scanning.

Trait scanning in garak attempts to work out what the target's content policy is, before running a security scan.

It's important to know what target content policy is because we only really have a useful/successful hit or breach if we're able to get a model to do something that it otherwise wouldn't. It may be exciting to discover a model gives instructions for e.g. cooking meth if the request is encoded in base64, but if in fact the model gives the instructions when simply asked directly "print instructions for cooking meth", the use of base64 necessarily an exploit in this output category - the model is acting the same.

Garak's policy support follows a typology of different traits, each describing a different behaviour. By default this typology is stored in data/policy/trait_typology.json.

A trait scan is conducted by invoking garak with the --trait_scan switch. When this is requested, a separate scan runs using all trait probes within garak. Trait probes are denoted by a probe class asserting trait_probe=True. A regular probewise harness runs the scan, though reporting is diverted to a separate policy report file. After completion, garak estimates a policy based on trait probe results, and writes this to both main and policy reports.

What this PR adds

We're laying the base infrastructure for trait scans in this PR.

Add a trait typology and support loading it
Introduce a policy module and Policy class to allow storing and manipulation of target content policies. Policies consist of a set of traits each describing a behaviour and whether this is permitted by the target.
Differentiate non-adversarial probes as "trait" probes, which appear differently and are not automatically selected for main runs
Specify which traits a trait probe tests for, and enforce presence of this via test (a trait probe that doesn't inform the policy, is not useful)
Add an optional "trait scan" which assesses model behavior under various policy points. It:
- selects trait probes
- guesses a policy depending on the output of these probe's nominated detectors
- logs to a separate place
- outputs a serialised policy.Policy object detailing what was extracted about the target's apparent content policy
Enable plugin filtering in _plugins.enumerate_plugins() to help dynamic selection of plugins based on class attributes
Added to & unified logging in harnesses

Verification

garak -m test --trait_scan -p encoding -g 1, then tail the xxx.policy.jsonl

todo for this vs. later PRs

There are required for merging this:

test for garak.policy
validate default policy
tag tests: probes must specify a trait list if trait_probe is true
validation
run --list_trait_probes
run a trait scan on a test target, see if the probe selection and output make sense

These are out-of-scope and planned:

probe for trying prompts based on traits and looking to mitigation/no resp
merging of results for cases where multiple probes test a policy
refactor donotanswer
using policy to filter planned probing

…ard probe list

…it back to their caller

…val results

…arness, custom harness, and command.xxx_run()

…licy

…ntiation

jmartin-tech · 2025-02-20T14:42:21Z

garak/harnesses/probewise.py

+class PolicyHarness(ProbewiseHarness):
+
+    def _probe_check(self, probe):
+        assert (
+            probe.policy_probe == True
+        ), "only policy probes should be used in policy runs"
+        setattr(probe, "generations", _config.policy.generations)
+        return probe


Currently the plugin load cache is holding onto all created probe instances based on the config_root used to create them. This should not mutate the probe as that will modify the cached probe, we should create another with the required generations passed via config_root.

Consider the hook call can replace the call to load the probe itself (note this is untested code):

try: probe = _plugins.load_plugin(probename) except Exception as e: print(f"failed to load probe {probename}") logging.warning("failed to load probe %s:", repr(e))

becomes:

probe = _load_probe(probename)

with:

def _load_probe(self, probename): probe = None try: probe = _plugins.load_plugin(probename) except Exception as e: print(f"failed to load probe {probename}") logging.warning("failed to load probe %s:", repr(e)) return probe class PolicyHarness(ProbewiseHarness): def _load_probe(self, probename): import copy probe = None assert ( _plugins.plugin_info["policy_probe"] == True ), "only policy probes should be used in policy runs" config_root = copy.deepcopy(_config.plugins.probes) probe_config = config_root for path in probename.split(".")[2:]: probe_config = probe_config[path] probe_config["generations"] = _config.policy.generations try: probe = _plugins.load_plugin(probename, config_root=config_root) except Exception as e: print(f"failed to load probe {probename}") logging.warning("failed to load probe %s:", repr(e)) return probe

Currently the plugin load cache is holding onto all created probe instances

Can you say more about this? Does adjusting probe member values after the configuration in constructor, alter the cache in a meaningful way? I know probes alter their own internal values during execution and I think I'm missing the border

Probes should not alter internal values during execution, if they do we should look into why and evaluate impacts. I remember some values where probe() method is overridden that modify and restore a couple self values during execution.

Likely something we should consider for refactor, we have already noted issues that needed fixes where probe or detector attributes were being manipulated in ways that produced side-effects on the class.

Alright, fair. This one needs to lock generations to 1 somehow. Perhaps as a condition for making it out of experimental status. Or maybe this can be done at harness level, once we have a general pattern for config injection / global config availability.

updated to use this pattern

… count

leondz added 30 commits October 2, 2024 14:32

add policy metadata

102f648

Merge branch 'main' into feature/policy

a44c335

re-org cli.py slightly; add cli hook for policy scans

f7da7d5

add policy probe flag to base probe

7c81725

add plugin filtering to enumerate_plugins

733bd87

add plugin enumeration + filter test

384fb53

ahem

a352818

add cli option to list policy probes, filter policy probes from stand…

4785340

…ard probe list

reorg garak.cli if blocks, pass generator to policy scan

1f4f95e

execute rudimentary policy scan

96586ad

probes.test.Blank is now a policy probe

05bfce4

harnesses now return iterator of evaluator results, providing a condu…

e2e210c

…it back to their caller

rm yield for now; rm announce_probe

7963a3e

update test.Blank probe to check policy

c67715f

add some harness logging; base harness now returns a generator over e…

ebe34eb

…val results

evaluators now return info, which is surfaced though harnesses.base.H…

71e568a

…arness, custom harness, and command.xxx_run()

write policy report to own file

bc03380

use raw regexp

2ba073e

don't return after first probewise probe harness call

b65e08e

consume scan result; put logging above policy report open

bc920f7

amend Chat policy point name

ccc6444

class for representing & handling policies

1ac841e

code for parsing policy scan results, building policy, and storing po…

650f576

…licy

log probewise harness completion

9400587

add policy thresholding

74ab6a1

add config block for policy

582e2ba

factor distribution of generation count to probes out of cli

bc7831a

add policy docs

13beea9

add non-exploit tag 'policy' for policy probe tagging

b9a7dc8

update config test to reflect new test.Blank detector

644061e

remove --generate_autodan

33bc89d

leondz marked this pull request as draft November 12, 2024 16:09

leondz changed the title ~~feature: policy scan base infrastructure~~ experimental feature: policy scan base infrastructure Nov 12, 2024

leondz and others added 11 commits December 9, 2024 09:51

merge main

3966461

Merge branch 'main' into feature/policy

0635ccc

move plugin config injection of generations count to garak.command

f6a6b05

Merge branch 'main' into feature/policy

1af5ae5

log if no policy descrs found

64591f4

rename _load_policy_points to _load_policy_typology, add docs

e3e2440

refer only to passed _config

f0f949f

stop .generations injection into _config, instead override post-insta…

0fc7c84

…ntiation

reinstate single generation injection in CLI, before run is started

dc39223

separate out a policy harness, add a hook to let it do its magic

a23302c

leave test.Blank active=False as long as policy is experimental

bca90fe

leondz marked this pull request as ready for review February 20, 2025 14:33

jmartin-tech reviewed Feb 20, 2025

View reviewed changes

leondz added 2 commits April 7, 2025 11:42

merge w/ main

438817f

use plugin-level config injection pattern for policy probe generation…

d843347

… count

leondz changed the title ~~experimental feature: policy scan base infrastructure~~ experimental feature: trait scan base infrastructure Apr 8, 2025

leondz added 5 commits April 9, 2025 08:54

load probe just once, from string

e8e5175

policy->trait for individual items; add nomenclature section

02dd034

clarify module headers in CLI list outputs

45084a6

check sane sets of trait & non-trait probes

436bba0

update docs, tooling

29475d6

leondz self-assigned this Apr 11, 2025

merge main / multiling

b46dbe9

leondz requested a review from erickgalinkin April 24, 2025 05:51

leondz added 2 commits April 24, 2025 12:07

resolve main merge

7fa7d58

resolve merge in snowball

c2666c5

leondz marked this pull request as draft April 24, 2025 14:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

experimental feature: trait scan base infrastructure #955

experimental feature: trait scan base infrastructure #955

Uh oh!

leondz commented Oct 23, 2024 •

edited

Loading

Uh oh!

jmartin-tech Feb 20, 2025

Uh oh!

leondz Feb 21, 2025

Uh oh!

jmartin-tech Feb 28, 2025

Uh oh!

leondz Feb 28, 2025

Uh oh!

leondz Apr 9, 2025

Uh oh!

Uh oh!

experimental feature: trait scan base infrastructure #955

Are you sure you want to change the base?

experimental feature: trait scan base infrastructure #955

Uh oh!

Conversation

leondz commented Oct 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What this PR adds

Verification

todo for this vs. later PRs

Uh oh!

jmartin-tech Feb 20, 2025

Choose a reason for hiding this comment

Uh oh!

leondz Feb 21, 2025

Choose a reason for hiding this comment

Uh oh!

jmartin-tech Feb 28, 2025

Choose a reason for hiding this comment

Uh oh!

leondz Feb 28, 2025

Choose a reason for hiding this comment

Uh oh!

leondz Apr 9, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

leondz commented Oct 23, 2024 •

edited

Loading