|
1 | 1 | Reporting
|
2 | 2 | =========
|
3 | 3 |
|
| 4 | +By default, ``garak`` outputs: |
4 | 5 |
|
| 6 | +* a JSONL file, with the name ``garak.<uuid>.report.jsonl``, that stores progress and outcomes from a scan |
| 7 | +* an HTML report summarising scores |
| 8 | +* a JSONL hit log, describing all the attempts from the run that were scored successful |
5 | 9 |
|
6 |
| -By default, ``garak`` outputs a JSONL file, with the name ``garak.<uuid>.report.jsonl``, that stores outcomes from a scan. |
| 10 | +Report JSONL |
| 11 | +------------ |
| 12 | + |
| 13 | +The report JSON consists of JSON rows. Each row has an ``entry_type`` field. |
| 14 | +Different entry types have different other fields. |
| 15 | +Attempt-type entries have uuid and status fields. |
| 16 | +Status can be 0 (not sent to target), 1 (with target response but not evaluated), or 2 (with response and evaluation). |
| 17 | +Eval-type entries are added after each probe/detector pair completes, and list the results used to compute the score. |
| 18 | + |
| 19 | +Report HTML |
| 20 | +----------- |
| 21 | + |
| 22 | +The report HTML presents core items from the run. |
| 23 | +Runs are broken down into: |
| 24 | + |
| 25 | +1. modules/taxonomy entries |
| 26 | +2. probes within those categories |
| 27 | +3. detectors for each probe |
| 28 | + |
| 29 | +Results given are both absolute and relative. |
| 30 | +The relative ones are in terms of a Z-score computed against a set of recently tested other models and systems. |
| 31 | +For Z-scores, 0 is average, negative is worse, positive is better. |
| 32 | +Both absolute and relative scores are placed into one of five grades, ranging from 1 (worst) to 5 (best). |
| 33 | +This scale follows the NORAD DEFCON categorisation (with less dire consequences). |
| 34 | +Bounds for these categories are developed over many runs. |
| 35 | +The absolute scores are only alarmist or reassuring for very poor or very good Z-scores. |
| 36 | +The relative scores assume the middle 10% is average, the bottom 15% is terrible, and the top 15% is great. |
| 37 | + |
| 38 | +DEFCON scores are aggregated using a minimum, to avoid obscuring important failures. |
0 commit comments