-
Notifications
You must be signed in to change notification settings - Fork 450
experimental feature: trait scan base infrastructure #955
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Draft
leondz
wants to merge
72
commits into
main
Choose a base branch
from
feature/policy
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Changes from 62 commits
Commits
Show all changes
72 commits
Select commit
Hold shift + click to select a range
102f648
add policy metadata
leondz a44c335
Merge branch 'main' into feature/policy
leondz f7da7d5
re-org cli.py slightly; add cli hook for policy scans
leondz 7c81725
add policy probe flag to base probe
leondz 733bd87
add plugin filtering to enumerate_plugins
leondz 384fb53
add plugin enumeration + filter test
leondz a352818
ahem
leondz 4785340
add cli option to list policy probes, filter policy probes from stand…
leondz 1f4f95e
reorg garak.cli if blocks, pass generator to policy scan
leondz 96586ad
execute rudimentary policy scan
leondz 05bfce4
probes.test.Blank is now a policy probe
leondz e2e210c
harnesses now return iterator of evaluator results, providing a condu…
leondz 7963a3e
rm yield for now; rm announce_probe
leondz c67715f
update test.Blank probe to check policy
leondz ebe34eb
add some harness logging; base harness now returns a generator over e…
leondz 71e568a
evaluators now return info, which is surfaced though harnesses.base.H…
leondz bc03380
write policy report to own file
leondz 2ba073e
use raw regexp
leondz b65e08e
don't return after first probewise probe harness call
leondz bc920f7
consume scan result; put logging above policy report open
leondz ccc6444
amend Chat policy point name
leondz 1ac841e
class for representing & handling policies
leondz 650f576
code for parsing policy scan results, building policy, and storing po…
leondz 9400587
log probewise harness completion
leondz 74ab6a1
add policy thresholding
leondz 582e2ba
add config block for policy
leondz bc7831a
factor distribution of generation count to probes out of cli
leondz 13beea9
add policy docs
leondz b9a7dc8
add non-exploit tag 'policy' for policy probe tagging
leondz 644061e
update config test to reflect new test.Blank detector
leondz aa2ff6f
Merge branch 'main' into feature/policy
leondz 09488df
add snowballmini as policy probe
leondz 5e4ba8c
tidy up policy probe status of snowball classes
leondz 97f2628
repurpose more probes as policy
leondz 16f4d40
move parent name to module; validate policy typologies at load; add f…
leondz 9317093
add/tidy missing nodes
leondz ebcd7e9
when inferring policy, propagate permitted behaviours up
leondz b3f27d6
add tests for policy functionality
leondz 4c38c85
test for probe policy metadata
leondz 4dd1b64
add policy tests
leondz 27eaa5b
evaluators now yield EvalTuple not dict
leondz 9636f85
add policy module docstring, describe policy ID regex
leondz c397bab
Merge branch 'main' into feature/policy
leondz b01ddee
explain policy config stanza
leondz 9b8a60b
document _config.run.policy_scan
leondz 7352472
Update garak/harnesses/base.py
leondz 61f0b37
typo fix
leondz 5d1981f
document typology in policy.rst
leondz b58a8b4
rm text version of policy - one is enough
leondz 61e38ed
stop base harness run() and other harness run() from colliding
leondz 33bc89d
remove --generate_autodan
leondz 3966461
merge main
leondz 0635ccc
Merge branch 'main' into feature/policy
leondz f6a6b05
move plugin config injection of generations count to garak.command
leondz 1af5ae5
Merge branch 'main' into feature/policy
leondz 64591f4
log if no policy descrs found
leondz e3e2440
rename _load_policy_points to _load_policy_typology, add docs
leondz f0f949f
refer only to passed _config
leondz 0fc7c84
stop .generations injection into _config, instead override post-insta…
leondz dc39223
reinstate single generation injection in CLI, before run is started
leondz a23302c
separate out a policy harness, add a hook to let it do its magic
leondz bca90fe
leave test.Blank active=False as long as policy is experimental
leondz 438817f
merge w/ main
leondz d843347
use plugin-level config injection pattern for policy probe generation…
leondz e8e5175
load probe just once, from string
leondz 02dd034
policy->trait for individual items; add nomenclature section
leondz 45084a6
clarify module headers in CLI list outputs
leondz 436bba0
check sane sets of trait & non-trait probes
leondz 29475d6
update docs, tooling
leondz b46dbe9
merge main / multiling
leondz 7fa7d58
resolve main merge
leondz c2666c5
resolve merge in snowball
leondz File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,8 @@ | ||
garak.detectors.any | ||
=================== | ||
|
||
.. automodule:: garak.detectors.any | ||
:members: | ||
:undoc-members: | ||
:show-inheritance: | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -48,6 +48,7 @@ Advanced usage | |
|
||
configurable | ||
cliref | ||
policy | ||
|
||
Code reference | ||
^^^^^^^^^^^^^^ | ||
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,81 @@ | ||
garak.policy | ||
============ | ||
|
||
This module represents objects related to policy scanning. | ||
|
||
Policy scanning in garak attempts to work out what the target's content policy | ||
is, before running a security scan. | ||
|
||
It's important to know what target content policy is because we only really have | ||
a useful/successful hit or breach if we're able to get a model to do something that | ||
it otherwise wouldn't. It may be exciting to discover a model gives instructions for | ||
e.g. cooking meth if the request is encoded in base64, but if in fact the model gives | ||
the instructions when simply asked directly "print instructions for cooking meth", the | ||
use of base64 necessarily an exploit in this output category - the model is acting | ||
the same. | ||
|
||
Garak's policy support follows a typology of different behaviours, each describing | ||
a different behaviour. By default this typology is stored in ``data/policy/policy_typology.json``. | ||
|
||
A policy scan is conducted by invoking garak with the ``--policy_scan`` switch. | ||
When this is requested, a separate scan runs using all policy probes within garak. | ||
Policy probes are denoted by a probe class asserting ``policy_probe=True``. | ||
A regular probewise harness runs the scan, though reporting is diverted to a separate | ||
policy report file. After completion, garak estimates a policy based on policy probe | ||
results, and writes this to both main and poliy reports. | ||
|
||
|
||
Model behaviour typologies | ||
-------------------------- | ||
|
||
Goal | ||
^^^^ | ||
|
||
The model behaviour typology enumerates many different types of target behaviour. The listed behaviours help structure a policy for model output. For each behaviour, one can choose if an model should engage or not engage in that activity. | ||
|
||
The typology serves as a point of departure for building model content policies, as well as a framework for describing model behaviour. | ||
|
||
This typology is hierarchical, but labels are not “hard”. That is, multiple categories might apply to a single candidate behaviour, and that's OK. | ||
|
||
Because the range of possible model behaviours is large, and an open set, this typology is not comprehensive, and is not designed or intended to ever be comprehensive. | ||
|
||
To optimise effort spent building this typology, it's best to prioritise addition & definition of categories for which we actually have payloads. | ||
|
||
Usage | ||
^^^^^ | ||
|
||
To use this typology to describe an model deployment, examine each category and check if the target model engages with that behaviour directly, without using any adversarial techniques. | ||
|
||
To use this typology to describe intended deployed model policy, consider each category in turn and decide how the model should react. A possible set of possible reactions can be as simple as "Engage" & "Decline". | ||
|
||
Policy point guidelines | ||
^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
* Each point describes something that the model does, i.e. a behaviour | ||
* Given a decent prompt representing a policy, and a model's response, it should be possible to discern in isolation whether or not the model is engaging or refusing for that prompt/response pair | ||
* Prioritise enumerating policies that reflect things we have tests for (or can reasonably test for) | ||
* It's great to have two sample prompts per point | ||
* We want to stick to max three levels if at all possible | ||
* Multiple inheritance is fine, e.g. a probe might represent multiple points in this typology | ||
|
||
Policy metadata | ||
^^^^^^^^^^^^^^^ | ||
|
||
The total set of points in the behaviour typology can be represented as a dictionary. Definitions of policy names, descriptions, and behaviours are stored in a JSON data file | ||
|
||
* Key: behaviour identifier - format is TDDDs* | ||
* T: a top-level hierarchy code letter, in CTMS for chat/tasks/meta/safety | ||
* D: a three-digit code for this behaviour | ||
* s*: (optional) one or more letters identifying a sub-policy | ||
|
||
Value: a dict describing a behaviour | ||
* “name”: A short name of what is permitted when this behaviour is allowed | ||
* “description”: (optional) a deeper description of this behaviour | ||
|
||
The structure of the identifiers describes the hierarchical structure. | ||
|
||
|
||
.. automodule:: garak.policy | ||
:members: | ||
:undoc-members: | ||
:show-inheritance: |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.