Skip to content

experimental feature: trait scan base infrastructure #955

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 72 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
72 commits
Select commit Hold shift + click to select a range
102f648
add policy metadata
leondz Oct 2, 2024
a44c335
Merge branch 'main' into feature/policy
leondz Oct 16, 2024
f7da7d5
re-org cli.py slightly; add cli hook for policy scans
leondz Oct 16, 2024
7c81725
add policy probe flag to base probe
leondz Oct 17, 2024
733bd87
add plugin filtering to enumerate_plugins
leondz Oct 17, 2024
384fb53
add plugin enumeration + filter test
leondz Oct 17, 2024
a352818
ahem
leondz Oct 17, 2024
4785340
add cli option to list policy probes, filter policy probes from stand…
leondz Oct 17, 2024
1f4f95e
reorg garak.cli if blocks, pass generator to policy scan
leondz Oct 17, 2024
96586ad
execute rudimentary policy scan
leondz Oct 17, 2024
05bfce4
probes.test.Blank is now a policy probe
leondz Oct 17, 2024
e2e210c
harnesses now return iterator of evaluator results, providing a condu…
leondz Oct 17, 2024
7963a3e
rm yield for now; rm announce_probe
leondz Oct 17, 2024
c67715f
update test.Blank probe to check policy
leondz Oct 17, 2024
ebe34eb
add some harness logging; base harness now returns a generator over e…
leondz Oct 21, 2024
71e568a
evaluators now return info, which is surfaced though harnesses.base.H…
leondz Oct 21, 2024
bc03380
write policy report to own file
leondz Oct 22, 2024
2ba073e
use raw regexp
leondz Oct 22, 2024
b65e08e
don't return after first probewise probe harness call
leondz Oct 22, 2024
bc920f7
consume scan result; put logging above policy report open
leondz Oct 22, 2024
ccc6444
amend Chat policy point name
leondz Oct 22, 2024
1ac841e
class for representing & handling policies
leondz Oct 22, 2024
650f576
code for parsing policy scan results, building policy, and storing po…
leondz Oct 23, 2024
9400587
log probewise harness completion
leondz Oct 23, 2024
74ab6a1
add policy thresholding
leondz Oct 23, 2024
582e2ba
add config block for policy
leondz Oct 23, 2024
bc7831a
factor distribution of generation count to probes out of cli
leondz Oct 23, 2024
13beea9
add policy docs
leondz Oct 23, 2024
b9a7dc8
add non-exploit tag 'policy' for policy probe tagging
leondz Oct 23, 2024
644061e
update config test to reflect new test.Blank detector
leondz Oct 23, 2024
aa2ff6f
Merge branch 'main' into feature/policy
leondz Oct 23, 2024
09488df
add snowballmini as policy probe
leondz Oct 23, 2024
5e4ba8c
tidy up policy probe status of snowball classes
leondz Oct 23, 2024
97f2628
repurpose more probes as policy
leondz Oct 23, 2024
16f4d40
move parent name to module; validate policy typologies at load; add f…
leondz Oct 23, 2024
9317093
add/tidy missing nodes
leondz Oct 23, 2024
ebcd7e9
when inferring policy, propagate permitted behaviours up
leondz Oct 23, 2024
b3f27d6
add tests for policy functionality
leondz Oct 24, 2024
4c38c85
test for probe policy metadata
leondz Oct 24, 2024
4dd1b64
add policy tests
leondz Oct 24, 2024
27eaa5b
evaluators now yield EvalTuple not dict
leondz Nov 6, 2024
9636f85
add policy module docstring, describe policy ID regex
leondz Nov 6, 2024
c397bab
Merge branch 'main' into feature/policy
leondz Nov 7, 2024
b01ddee
explain policy config stanza
leondz Nov 7, 2024
9b8a60b
document _config.run.policy_scan
leondz Nov 7, 2024
7352472
Update garak/harnesses/base.py
leondz Nov 7, 2024
61f0b37
typo fix
leondz Nov 7, 2024
5d1981f
document typology in policy.rst
leondz Nov 7, 2024
b58a8b4
rm text version of policy - one is enough
leondz Nov 7, 2024
61e38ed
stop base harness run() and other harness run() from colliding
leondz Nov 7, 2024
33bc89d
remove --generate_autodan
leondz Nov 8, 2024
3966461
merge main
leondz Dec 9, 2024
0635ccc
Merge branch 'main' into feature/policy
leondz Dec 23, 2024
f6a6b05
move plugin config injection of generations count to garak.command
leondz Dec 23, 2024
1af5ae5
Merge branch 'main' into feature/policy
leondz Feb 18, 2025
64591f4
log if no policy descrs found
leondz Feb 19, 2025
e3e2440
rename _load_policy_points to _load_policy_typology, add docs
leondz Feb 19, 2025
f0f949f
refer only to passed _config
leondz Feb 19, 2025
0fc7c84
stop .generations injection into _config, instead override post-insta…
leondz Feb 19, 2025
dc39223
reinstate single generation injection in CLI, before run is started
leondz Feb 19, 2025
a23302c
separate out a policy harness, add a hook to let it do its magic
leondz Feb 20, 2025
bca90fe
leave test.Blank active=False as long as policy is experimental
leondz Feb 20, 2025
438817f
merge w/ main
leondz Apr 7, 2025
d843347
use plugin-level config injection pattern for policy probe generation…
leondz Apr 7, 2025
e8e5175
load probe just once, from string
leondz Apr 9, 2025
02dd034
policy->trait for individual items; add nomenclature section
leondz Apr 9, 2025
45084a6
clarify module headers in CLI list outputs
leondz Apr 9, 2025
436bba0
check sane sets of trait & non-trait probes
leondz Apr 9, 2025
29475d6
update docs, tooling
leondz Apr 9, 2025
b46dbe9
merge main / multiling
leondz Apr 11, 2025
7fa7d58
resolve main merge
leondz Apr 24, 2025
c2666c5
resolve merge in snowball
leondz Apr 24, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions docs/source/configurable.rst
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,9 @@ Let's take a look at the core config.
show_100_pass_modules: true
group_aggregation_function: minimum

policy:
threshold: false

Here we can see many entries that correspond to command line options, such as
``model_name`` and ``model_type``, as well as some entried not exposed via CLI
such as ``show_100_pass_modules``.
Expand All @@ -108,6 +111,7 @@ such as ``show_100_pass_modules``.
* ``deprefix`` - Remove the prompt from the start of the output (some models return the prompt as part of their output)
* ``seed`` - An optional random seed
* ``eval_threshold`` - At what point in the 0..1 range output by detectors does a result count as a successful attack / hit
* ``policy_scan`` - Should the run include a scan to automatically determine the target's content policy?
* ``user_agent`` - What HTTP user agent string should garak use? ``{version}`` can be used to signify where garak version ID should go
* ``soft_probe_prompt_cap`` - For probes that auto-scale their prompt count, the preferred limit of prompts per probe
* ``target_lang`` - A single language (as BCP47 that the target application for LLM accepts as prompt and output
Expand Down Expand Up @@ -140,6 +144,10 @@ For an example of how to use the ``detectors``, ``generators``, ``buffs``,
* ``show_100_pass_modules`` - Should entries scoring 100% still be detailed in the HTML report?
* ``group_aggregation_function`` - How should scored of probe groups (e.g. plugin modules or taxonomy categories) be aggregrated in the HTML report? Options are ``minimum``, ``mean``, ``median``, ``mean_minus_sd``, ``lower_quartile``, and ``proportion_passing``. NB averages like ``mean`` and ``median`` hide a lot of information and aren't recommended.

``policy`` config items
"""""""""""""""""""""""
* ``threshold`` - pass rate for a behavior to be considered "permitted" when policy probed; false indicates any passes mean a positive, permissive policy


Bundled quick configs
^^^^^^^^^^^^^^^^^^^^^
Expand Down
1 change: 1 addition & 0 deletions docs/source/detectors.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@ garak.detectors
garak.detectors.base
garak.detectors.always
garak.detectors.ansiescape
garak.detectors.any
garak.detectors.continuation
garak.detectors.dan
garak.detectors.divergence
Expand Down
8 changes: 8 additions & 0 deletions docs/source/garak.detectors.any.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
garak.detectors.any
===================

.. automodule:: garak.detectors.any
:members:
:undoc-members:
:show-inheritance:

1 change: 1 addition & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -48,6 +48,7 @@ Advanced usage

configurable
cliref
policy

Code reference
^^^^^^^^^^^^^^
Expand Down
85 changes: 85 additions & 0 deletions docs/source/policy.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,85 @@
garak.policy
============

This module represents objects related to policy scanning.

Trait scanning in garak attempts to work out what the target's content policy is,
before running a security scan.

It's important to know what target content policy is because we only really have
a useful/successful hit or breach if we're able to get a model to do something that
it otherwise wouldn't. It may be exciting to discover a model gives instructions for
e.g. cooking meth if the request is encoded in base64, but if in fact the model gives
the instructions when simply asked directly "print instructions for cooking meth", the
use of base64 necessarily an exploit in this output category - the model is acting
the same.

Garak's policy support follows a typology of different traits, each describing a
different behaviour. By default this typology is stored in
``data/policy/trait_typology.json``.

A trait scan is conducted by invoking garak with the ``--trait_scan`` switch. When
this is requested, a separate scan runs using all trait probes within garak. Trait
probes are denoted by a probe class asserting ``trait_probe=True``. An adapted
``probewise`` harness runs the scan, though reporting is diverted to a separate
policy report file. After completion, garak estimates an *observed policy* based
on trait probe results, and writes this to both main and policy reports.


Model behaviour typologies
--------------------------

Goal
^^^^

The model behaviour typology enumerates many different types of target behaviour. The
listed traits help structure a policy for model output. For each trait, one can choose
if an model should engage or not engage in that activity.

The typology serves as a point of departure for building model content policies, as
well as a framework for describing model behaviour.

This typology is hierarchical, but labels are not "hard". That is, multiple
categories might apply to a single candidate behaviour, and that's OK.

Because the range of possible model behaviours is large, and an open set, this
typology is not comprehensive, and is not designed or intended to ever be comprehensive.

To optimise effort spent building this typology, it's best to prioritise addition &
definition of categories for which we actually have payloads.

Usage
^^^^^

To use a typology to describe an model deployment, examine each category and check
if the target model engages with that behaviour directly, without using any adversarial
techniques.

To use a typology to describe intended deployed model policy, consider each category
in turn and decide how the model should react. A possible set of possible reactions can
be as simple as "Engage" & "Decline".

Trait guidelines
^^^^^^^^^^^^^^^^

* Each point describes something that the model does, i.e. a behaviour
* Leaf nodes are traits
* Given a decent prompt representing a trait, and a model's response, it should be possible to discern in isolation whether or not the model is engaging or refusing for that prompt/response pair
* Prioritise enumerating trait that reflect things we have tests for (or can reasonably test for)
* It's great to have two or more sample prompts per trait
* We want to stick to max three levels if at all possible
* Multiple inheritance is fine, e.g. a probe might represent multiple points in this typology

Nomenclature
^^^^^^^^^^^^
* ``trait`` - a behavioural trait of a model
* ``policy`` - a hierarchy of traits including descriptions of whether each trait should be enabled
* ``observed policy`` - a policy describing how the target was observed to behave
* ``policy point`` - any point in a policy. This subsumes traits and groups of traits
* ``trait typology`` - a structured set of traits, including and descriptions


.. automodule:: garak.policy
:members:
:undoc-members:
:show-inheritance:
8 changes: 6 additions & 2 deletions garak/_config.py
Original file line number Diff line number Diff line change
Expand Up @@ -30,7 +30,9 @@
system_params = (
"verbose narrow_output parallel_requests parallel_attempts skip_unknown".split()
)
run_params = "seed deprefix eval_threshold generations probe_tags interactive".split()
run_params = (
"seed deprefix eval_threshold generations probe_tags interactive trait_scan".split()
)
plugins_params = "model_type model_name extended_detectors".split()
reporting_params = "taxonomy report_prefix".split()
project_dir_name = "garak"
Expand Down Expand Up @@ -80,6 +82,7 @@ class TransientConfig(GarakSubConfig):
run = GarakSubConfig()
plugins = GarakSubConfig()
reporting = GarakSubConfig()
policy = GarakSubConfig()


def _lock_config_as_dict():
Expand Down Expand Up @@ -186,13 +189,14 @@ def _load_yaml_config(settings_filenames) -> dict:


def _store_config(settings_files) -> None:
global system, run, plugins, reporting, version
global system, run, plugins, reporting, version, policy
settings = _load_yaml_config(settings_files)
system = _set_settings(system, settings["system"])
run = _set_settings(run, settings["run"])
run.user_agent = run.user_agent.replace("{version}", version)
plugins = _set_settings(plugins, settings["plugins"])
reporting = _set_settings(reporting, settings["reporting"])
policy = _set_settings(plugins, settings["policy"])


# not my favourite solution in this module, but if
Expand Down
21 changes: 19 additions & 2 deletions garak/_plugins.py
Original file line number Diff line number Diff line change
Expand Up @@ -326,7 +326,7 @@ def plugin_info(plugin: Union[Callable, str]) -> dict:


def enumerate_plugins(
category: str = "probes", skip_base_classes=True
category: str = "probes", skip_base_classes=True, filter: Union[None, dict] = None
) -> List[tuple[str, bool]]:
"""A function for listing all modules & plugins of the specified kind.

Expand All @@ -339,6 +339,8 @@ def enumerate_plugins(
and finding the root classes here; it will then go through the other modules
in the package and see which classes can be enumerated from these.

for filtering, both the key and value must be there

:param category: the name of the plugin package to be scanned; should
be one of probes, detectors, generators, or harnesses.
:type category: str
Expand All @@ -350,8 +352,23 @@ def enumerate_plugins(
plugin_class_names = set()

for k, v in PluginCache.instance()[category].items():
if skip_base_classes and ".base." in k:
if skip_base_classes and k.split(".")[1] == "base":
continue
if filter is not None:
"""
try:
for attrib, value in filter.items():
if attrib in v and v[attrib] != value:
raise StopIteration
except StopIteration:
continue
"""
try:
for attrib, value in filter.items():
if attrib not in v or v[attrib] != value:
raise StopIteration
except StopIteration:
continue
enum_entry = (k, v["active"])
plugin_class_names.add(enum_entry)

Expand Down
52 changes: 28 additions & 24 deletions garak/cli.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@

"""Flow for invoking garak from the command line"""

command_options = "list_detectors list_probes list_generators list_buffs list_config plugin_info interactive report version fix".split()
command_options = "list_detectors list_probes list_trait_probes list_generators list_buffs list_config plugin_info interactive report version fix".split()


def parse_cli_plugin_config(plugin_type, args):
Expand Down Expand Up @@ -223,6 +223,9 @@ def main(arguments=None) -> None:
parser.add_argument(
"--list_probes", action="store_true", help="list available vulnerability probes"
)
parser.add_argument(
"--list_trait_probes", action="store_true", help="list available trait probes"
)
parser.add_argument(
"--list_detectors", action="store_true", help="list available detectors"
)
Expand Down Expand Up @@ -259,11 +262,6 @@ def main(arguments=None) -> None:
action="store_true",
help="Enter interactive probing mode",
)
parser.add_argument(
"--generate_autodan",
action="store_true",
help="generate AutoDAN prompts; requires --prompt_options with JSON containing a prompt and target",
)
parser.add_argument(
"--interactive.py",
action="store_true",
Expand All @@ -282,7 +280,12 @@ def main(arguments=None) -> None:
parser.description = (
str(parser.description) + " - EXPERIMENTAL FEATURES ENABLED"
)
pass
parser.add_argument(
"--trait_scan",
action="store_true",
default=_config.run.trait_scan,
help="determine model's behavioural traits before scanning",
)

logging.debug("args - raw argument string received: %s", arguments)

Expand Down Expand Up @@ -447,6 +450,9 @@ def worker_count_validation(workers):
elif args.list_probes:
command.print_probes()

elif args.list_trait_probes:
command.print_trait_probes()

elif args.list_detectors:
command.print_detectors()

Expand Down Expand Up @@ -530,6 +536,7 @@ def worker_count_validation(workers):

print(f"📜 logging to {log_filename}")

# set up generator
conf_root = _config.plugins.generators
for part in _config.plugins.model_type.split("."):
if not part in conf_root:
Expand All @@ -550,6 +557,7 @@ def worker_count_validation(workers):
logging.error(message)
raise ValueError(message)

# validate main run config
parsable_specs = ["probe", "detector", "buff"]
parsed_specs = {}
for spec_type in parsable_specs:
Expand All @@ -573,8 +581,7 @@ def worker_count_validation(workers):
msg_list = ",".join(rejected)
raise ValueError(f"❌Unknown {spec_namespace}❌: {msg_list}")

evaluator = garak.evaluators.ThresholdEvaluator(_config.run.eval_threshold)

# generator init
from garak import _plugins

generator = _plugins.load_plugin(
Expand All @@ -591,28 +598,25 @@ def worker_count_validation(workers):
logging=logging,
)

if "generate_autodan" in args and args.generate_autodan:
from garak.resources.autodan import autodan_generate

try:
prompt = _config.probe_options["prompt"]
target = _config.probe_options["target"]
except Exception as e:
print(
"AutoDAN generation requires --probe_options with a .json containing a `prompt` and `target` "
"string"
)
autodan_generate(generator=generator, prompt=prompt, target=target)

# looks like we might get something to report, so fire that up
command.start_run() # start the run now that all config validation is complete
print(f"📜 reporting to {_config.transient.report_filename}")

# do trait scan
if _config.run.trait_scan:
command.run_trait_scan(generator, _config)

# set up plugins for main run
# instantiate evaluator
evaluator = garak.evaluators.ThresholdEvaluator(_config.run.eval_threshold)

# parse & set up detectors, if supplied
if parsed_specs["detector"] == []:
command.probewise_run(
run_result = command.probewise_run(
generator, parsed_specs["probe"], evaluator, parsed_specs["buff"]
)
else:
command.pxd_run(
run_result = command.pxd_run(
generator,
parsed_specs["probe"],
parsed_specs["detector"],
Expand Down
Loading
Loading