-
Notifications
You must be signed in to change notification settings - Fork 449
experimental feature: trait scan base infrastructure #955
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…it back to their caller
…arness, custom harness, and command.xxx_run()
garak/harnesses/probewise.py
Outdated
class PolicyHarness(ProbewiseHarness): | ||
|
||
def _probe_check(self, probe): | ||
assert ( | ||
probe.policy_probe == True | ||
), "only policy probes should be used in policy runs" | ||
setattr(probe, "generations", _config.policy.generations) | ||
return probe |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the plugin load cache is holding onto all created probe instances based on the config_root
used to create them. This should not mutate the probe as that will modify the cached probe, we should create another with the required generations passed via config_root
.
Consider the hook call can replace the call to load the probe itself (note this is untested code):
try:
probe = _plugins.load_plugin(probename)
except Exception as e:
print(f"failed to load probe {probename}")
logging.warning("failed to load probe %s:", repr(e))
becomes:
probe = _load_probe(probename)
with:
def _load_probe(self, probename):
probe = None
try:
probe = _plugins.load_plugin(probename)
except Exception as e:
print(f"failed to load probe {probename}")
logging.warning("failed to load probe %s:", repr(e))
return probe
class PolicyHarness(ProbewiseHarness):
def _load_probe(self, probename):
import copy
probe = None
assert (
_plugins.plugin_info["policy_probe"] == True
), "only policy probes should be used in policy runs"
config_root = copy.deepcopy(_config.plugins.probes)
probe_config = config_root
for path in probename.split(".")[2:]:
probe_config = probe_config[path]
probe_config["generations"] = _config.policy.generations
try:
probe = _plugins.load_plugin(probename, config_root=config_root)
except Exception as e:
print(f"failed to load probe {probename}")
logging.warning("failed to load probe %s:", repr(e))
return probe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the plugin load cache is holding onto all created probe instances
Can you say more about this? Does adjusting probe member values after the configuration in constructor, alter the cache in a meaningful way? I know probes alter their own internal values during execution and I think I'm missing the border
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probes should not alter internal values during execution, if they do we should look into why and evaluate impacts. I remember some values where probe()
method is overridden that modify and restore a couple self
values during execution.
Likely something we should consider for refactor, we have already noted issues that needed fixes where probe or detector attributes were being manipulated in ways that produced side-effects on the class.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Alright, fair. This one needs to lock generations
to 1 somehow. Perhaps as a condition for making it out of experimental status. Or maybe this can be done at harness level, once we have a general pattern for config injection / global config availability.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated to use this pattern
This module represents objects related to trait and policy scanning.
Trait scanning in garak attempts to work out what the target's content policy is, before running a security scan.
It's important to know what target content policy is because we only really have a useful/successful hit or breach if we're able to get a model to do something that it otherwise wouldn't. It may be exciting to discover a model gives instructions for e.g. cooking meth if the request is encoded in base64, but if in fact the model gives the instructions when simply asked directly "print instructions for cooking meth", the use of base64 necessarily an exploit in this output category - the model is acting the same.
Garak's policy support follows a typology of different traits, each describing a different behaviour. By default this typology is stored in
data/policy/trait_typology.json
.A trait scan is conducted by invoking garak with the
--trait_scan
switch. When this is requested, a separate scan runs using all trait probes within garak. Trait probes are denoted by a probe class assertingtrait_probe=True
. A regular probewise harness runs the scan, though reporting is diverted to a separate policy report file. After completion, garak estimates a policy based on trait probe results, and writes this to both main and policy reports.What this PR adds
We're laying the base infrastructure for trait scans in this PR.
policy
module andPolicy
class to allow storing and manipulation of target content policies. Policies consist of a set oftraits
each describing a behaviour and whether this is permitted by the target.policy.Policy
object detailing what was extracted about the target's apparent content policy_plugins.enumerate_plugins()
to help dynamic selection of plugins based on class attributesVerification
garak -m test --trait_scan -p encoding -g 1
, then tail thexxx.policy.jsonl
todo for this vs. later PRs
There are required for merging this:
--list_trait_probes
These are out-of-scope and planned: