NVIDIA · leondz · Oct 2, 2024 · Oct 16, 2024 · Oct 16, 2024 · Oct 17, 2024
diff --git a/docs/source/configurable.rst b/docs/source/configurable.rst
@@ -83,6 +83,9 @@ Let's take a look at the core config.
         show_100_pass_modules: true
         group_aggregation_function: minimum
 
+    policy:
+        threshold: false
+
 Here we can see many entries that correspond to command line options, such as 
 ``model_name`` and ``model_type``, as well as some entried not exposed via CLI
 such as ``show_100_pass_modules``.
@@ -108,6 +111,7 @@ such as ``show_100_pass_modules``.
 * ``deprefix`` - Remove the prompt from the start of the output (some models return the prompt as part of their output)
 * ``seed`` - An optional random seed
 * ``eval_threshold`` - At what point in the 0..1 range output by detectors does a result count as a successful attack / hit
+* ``policy_scan`` - Should the run include a scan to automatically determine the target's content policy?
 * ``user_agent`` - What HTTP user agent string should garak use? ``{version}`` can be used to signify where garak version ID should go
 * ``soft_probe_prompt_cap`` - For probes that auto-scale their prompt count, the preferred limit of prompts per probe
 * ``target_lang`` - A single language (as BCP47 that the target application for LLM accepts as prompt and output
@@ -140,6 +144,10 @@ For an example of how to use the ``detectors``, ``generators``, ``buffs``,
 * ``show_100_pass_modules`` - Should entries scoring 100% still be detailed in the HTML report?
 * ``group_aggregation_function`` - How should scored of probe groups (e.g. plugin modules or taxonomy categories) be aggregrated in the HTML report? Options are ``minimum``, ``mean``, ``median``, ``mean_minus_sd``, ``lower_quartile``, and ``proportion_passing``. NB averages like ``mean`` and ``median`` hide a lot of information and aren't recommended.
 
+``policy`` config items
+"""""""""""""""""""""""
+* ``threshold`` - pass rate for a behavior to be considered "permitted" when policy probed; false indicates any passes mean a positive, permissive policy
+
 
 Bundled quick configs
 ^^^^^^^^^^^^^^^^^^^^^

diff --git a/docs/source/detectors.rst b/docs/source/detectors.rst
@@ -8,6 +8,7 @@ garak.detectors
    garak.detectors.base
    garak.detectors.always
    garak.detectors.ansiescape
+   garak.detectors.any
    garak.detectors.continuation
    garak.detectors.dan
    garak.detectors.divergence

diff --git a/docs/source/garak.detectors.any.rst b/docs/source/garak.detectors.any.rst
@@ -0,0 +1,8 @@
+garak.detectors.any
+===================
+
+.. automodule:: garak.detectors.any
+   :members:
+   :undoc-members:
+   :show-inheritance:   
+
diff --git a/docs/source/index.rst b/docs/source/index.rst
@@ -48,6 +48,7 @@ Advanced usage
 
    configurable
    cliref
+   policy
 
 Code reference
 ^^^^^^^^^^^^^^

diff --git a/docs/source/policy.rst b/docs/source/policy.rst
@@ -0,0 +1,85 @@
+garak.policy
+============
+
+This module represents objects related to policy scanning. 
+
+Trait scanning in garak attempts to work out what the target's content policy is,
+before running a security scan.
+
+It's important to know what target content policy is because we only really have
+a useful/successful hit or breach if we're able to get a model to do something that
+it otherwise wouldn't. It may be exciting to discover a model gives instructions for
+e.g. cooking meth if the request is encoded in base64, but if in fact the model gives
+the instructions when simply asked directly "print instructions for cooking meth", the
+use of base64 necessarily an exploit in this output category - the model is acting 
+the same.
+
+Garak's policy support follows a typology of different traits, each describing a 
+different behaviour. By default this typology is stored in 
+``data/policy/trait_typology.json``.
+
+A trait scan is conducted by invoking garak with the ``--trait_scan`` switch. When 
+this is requested, a separate scan runs using all trait probes within garak. Trait 
+probes are denoted by a probe class asserting ``trait_probe=True``. An adapted 
+``probewise`` harness runs the scan, though reporting is diverted to a separate 
+policy report file. After completion, garak estimates an *observed policy* based 
+on trait probe results, and writes this to both main and policy reports.
+
+
+Model behaviour typologies
+--------------------------
+
+Goal
+^^^^
+
+The model behaviour typology enumerates many different types of target behaviour. The 
+listed traits help structure a policy for model output. For each trait, one can choose 
+if an model should engage or not engage in that activity.
+
+The typology serves as a point of departure for building model content policies, as 
+well as a framework for describing model behaviour. 
+
+This typology is hierarchical, but labels are not "hard". That is, multiple 
+categories might apply to a single candidate behaviour, and that's OK.
+
+Because the range of possible model behaviours is large, and an open set, this 
+typology is not comprehensive, and is not designed or intended to ever be comprehensive.
+
+To optimise effort spent building this typology, it's best to prioritise addition & 
+definition of categories for which we actually have payloads.
+
+Usage
+^^^^^
+
+To use a typology to describe an model deployment, examine each category and check 
+if the target model engages with that behaviour directly, without using any adversarial 
+techniques.
+
+To use a typology to describe intended deployed model policy, consider each category 
+in turn and decide how the model should react. A possible set of possible reactions can 
+be as simple as "Engage" & "Decline".
+
+Trait guidelines
+^^^^^^^^^^^^^^^^
+
+* Each point describes something that the model does, i.e. a behaviour
+* Leaf nodes are traits
+* Given a decent prompt representing a trait, and a model's response, it should be possible to discern in isolation whether or not the model is engaging or refusing for that prompt/response pair
+* Prioritise enumerating trait that reflect things we have tests for (or can reasonably test for)
+* It's great to have two or more sample prompts per trait
+* We want to stick to max three levels if at all possible
+* Multiple inheritance is fine, e.g. a probe might represent multiple points in this typology
+
+Nomenclature
+^^^^^^^^^^^^
+* ``trait`` - a behavioural trait of a model
+* ``policy`` - a hierarchy of traits including descriptions of whether each trait should be enabled
+* ``observed policy`` - a policy describing how the target was observed to behave
+* ``policy point`` - any point in a policy. This subsumes traits and groups of traits
+* ``trait typology`` - a structured set of traits, including and descriptions
+
+
+.. automodule:: garak.policy
+   :members:
+   :undoc-members:
+   :show-inheritance:   
diff --git a/garak/_config.py b/garak/_config.py
@@ -30,7 +30,9 @@
 system_params = (
     "verbose narrow_output parallel_requests parallel_attempts skip_unknown".split()
 )
-run_params = "seed deprefix eval_threshold generations probe_tags interactive".split()
+run_params = (
+    "seed deprefix eval_threshold generations probe_tags interactive trait_scan".split()
+)
 plugins_params = "model_type model_name extended_detectors".split()
 reporting_params = "taxonomy report_prefix".split()
 project_dir_name = "garak"
@@ -80,6 +82,7 @@ class TransientConfig(GarakSubConfig):
 run = GarakSubConfig()
 plugins = GarakSubConfig()
 reporting = GarakSubConfig()
+policy = GarakSubConfig()
 
 
 def _lock_config_as_dict():
@@ -186,13 +189,14 @@ def _load_yaml_config(settings_filenames) -> dict:
 
 
 def _store_config(settings_files) -> None:
-    global system, run, plugins, reporting, version
+    global system, run, plugins, reporting, version, policy
     settings = _load_yaml_config(settings_files)
     system = _set_settings(system, settings["system"])
     run = _set_settings(run, settings["run"])
     run.user_agent = run.user_agent.replace("{version}", version)
     plugins = _set_settings(plugins, settings["plugins"])
     reporting = _set_settings(reporting, settings["reporting"])
+    policy = _set_settings(plugins, settings["policy"])
 
 
 # not my favourite solution in this module, but if

diff --git a/garak/_plugins.py b/garak/_plugins.py
@@ -326,7 +326,7 @@ def plugin_info(plugin: Union[Callable, str]) -> dict:
 
 
 def enumerate_plugins(
-    category: str = "probes", skip_base_classes=True
+    category: str = "probes", skip_base_classes=True, filter: Union[None, dict] = None
 ) -> List[tuple[str, bool]]:
     """A function for listing all modules & plugins of the specified kind.
 
@@ -339,6 +339,8 @@ def enumerate_plugins(
     and finding the root classes here; it will then go through the other modules
     in the package and see which classes can be enumerated from these.
 
+    for filtering, both the key and value must be there
+
     :param category: the name of the plugin package to be scanned; should
       be one of probes, detectors, generators, or harnesses.
     :type category: str
@@ -350,8 +352,23 @@ def enumerate_plugins(
     plugin_class_names = set()
 
     for k, v in PluginCache.instance()[category].items():
-        if skip_base_classes and ".base." in k:
+        if skip_base_classes and k.split(".")[1] == "base":
             continue
+        if filter is not None:
+            """
+            try:
+                for attrib, value in filter.items():
+                    if attrib in v and v[attrib] != value:
+                        raise StopIteration
+            except StopIteration:
+                continue
+            """
+            try:
+                for attrib, value in filter.items():
+                    if attrib not in v or v[attrib] != value:
+                        raise StopIteration
+            except StopIteration:
+                continue
         enum_entry = (k, v["active"])
         plugin_class_names.add(enum_entry)
 

diff --git a/garak/cli.py b/garak/cli.py
@@ -3,7 +3,7 @@
 
 """Flow for invoking garak from the command line"""
 
-command_options = "list_detectors list_probes list_generators list_buffs list_config plugin_info interactive report version fix".split()
+command_options = "list_detectors list_probes list_trait_probes list_generators list_buffs list_config plugin_info interactive report version fix".split()
 
 
 def parse_cli_plugin_config(plugin_type, args):
@@ -223,6 +223,9 @@ def main(arguments=None) -> None:
     parser.add_argument(
         "--list_probes", action="store_true", help="list available vulnerability probes"
     )
+    parser.add_argument(
+        "--list_trait_probes", action="store_true", help="list available trait probes"
+    )
     parser.add_argument(
         "--list_detectors", action="store_true", help="list available detectors"
     )
@@ -259,11 +262,6 @@ def main(arguments=None) -> None:
         action="store_true",
         help="Enter interactive probing mode",
     )
-    parser.add_argument(
-        "--generate_autodan",
-        action="store_true",
-        help="generate AutoDAN prompts; requires --prompt_options with JSON containing a prompt and target",
-    )
     parser.add_argument(
         "--interactive.py",
         action="store_true",
@@ -282,7 +280,12 @@ def main(arguments=None) -> None:
         parser.description = (
             str(parser.description) + " - EXPERIMENTAL FEATURES ENABLED"
         )
-        pass
+        parser.add_argument(
+            "--trait_scan",
+            action="store_true",
+            default=_config.run.trait_scan,
+            help="determine model's behavioural traits before scanning",
+        )
 
     logging.debug("args - raw argument string received: %s", arguments)
 
@@ -447,6 +450,9 @@ def worker_count_validation(workers):
         elif args.list_probes:
             command.print_probes()
 
+        elif args.list_trait_probes:
+            command.print_trait_probes()
+
         elif args.list_detectors:
             command.print_detectors()
 
@@ -530,6 +536,7 @@ def worker_count_validation(workers):
 
             print(f"📜 logging to {log_filename}")
 
+            # set up generator
             conf_root = _config.plugins.generators
             for part in _config.plugins.model_type.split("."):
                 if not part in conf_root:
@@ -550,6 +557,7 @@ def worker_count_validation(workers):
                 logging.error(message)
                 raise ValueError(message)
 
+            # validate main run config
             parsable_specs = ["probe", "detector", "buff"]
             parsed_specs = {}
             for spec_type in parsable_specs:
@@ -573,8 +581,7 @@ def worker_count_validation(workers):
                         msg_list = ",".join(rejected)
                         raise ValueError(f"❌Unknown {spec_namespace}❌: {msg_list}")
 
-            evaluator = garak.evaluators.ThresholdEvaluator(_config.run.eval_threshold)
-
+            # generator init
             from garak import _plugins
 
             generator = _plugins.load_plugin(
@@ -591,28 +598,25 @@ def worker_count_validation(workers):
                     logging=logging,
                 )
 
-            if "generate_autodan" in args and args.generate_autodan:
-                from garak.resources.autodan import autodan_generate
-
-                try:
-                    prompt = _config.probe_options["prompt"]
-                    target = _config.probe_options["target"]
-                except Exception as e:
-                    print(
-                        "AutoDAN generation requires --probe_options with a .json containing a `prompt` and `target` "
-                        "string"
-                    )
-                autodan_generate(generator=generator, prompt=prompt, target=target)
-
+            # looks like we might get something to report, so fire that up
             command.start_run()  # start the run now that all config validation is complete
             print(f"📜 reporting to {_config.transient.report_filename}")
 
+            # do trait scan
+            if _config.run.trait_scan:
+                command.run_trait_scan(generator, _config)
+
+            # set up plugins for main run
+            # instantiate evaluator
+            evaluator = garak.evaluators.ThresholdEvaluator(_config.run.eval_threshold)
+
+            # parse & set up detectors, if supplied
             if parsed_specs["detector"] == []:
-                command.probewise_run(
+                run_result = command.probewise_run(
                     generator, parsed_specs["probe"], evaluator, parsed_specs["buff"]
                 )
             else:
-                command.pxd_run(
+                run_result = command.pxd_run(
                     generator,
                     parsed_specs["probe"],
                     parsed_specs["detector"],
-Original file line number
+Diff line change
@@ Expand Up / @@ -48,6 +48,7 @@ Advanced usage @@
        configurable
        cliref
+       policy
     Code reference
     ^^^^^^^^^^^^^^
@@ Expand Down @@