GFN-Check: Diverse Valid Test Inputs with Generative Flow Networks

This is a repository for the KAIST AI based Software Engineering group project (https://coinse.github.io/teaching/2024/cs454/).

Kanghyeon Kim, Seungwoo Lee, and Ayhan Suleymanzade (All members equally contributed to this work.)

Abstract

Generative Flow Networks (GFlowNets) are amortized sampling techniques designed to learn a distribution of discrete objects that is proportional to their associated rewards. In this work, we explore the application of GFlowNets to property-based testing (PBT), focusing on the diversity of test inputs. Our work provides a comparative analysis on the diversity of inputs generated by GFlowNets and Reinforcement Learning (RL)-based approaches, particularly RLCheck. Experiments span synthetic tasks such as Binary Search Tree (BST) generation and real-world tasks such as Maven POM XML generation. Our results reveal that while RLCheck achieves superior performance on specific diversity metrics, our investigation highlights opportunities to improve GFlowNet's capabilities, paving the way for more diverse and effective test input generation.

Motivation

A variety of methods have been used for property-based testing, a powerful method for testing programs expecting highly-structured inputs. Reinforcement learning (RL) is often used to find such inputs which maximize rewards. However, RL-based methods tend to generate only high-reward inputs, limiting diversity.

To address this issue, Reddy et al. [1] proposed RLCheck, an on-policy RL approach to generating diverse inputs for property-based testing. It basically assigns high rewards to the “unique” valid inputs and uses Monte Carlo Control for sampling. However, this approach is still prone to overfitting to a certain space of valid inputs as RL itself is not designed for sampling. Moreover, in this approach, we should store all the inputs we have generated so far to incentivize unique inputs, requiring a large amount of space and time.

We believe that GFlowNets (GFN) [2], a family of probabilistic techniques that sample objects in a way that the generation probability is proportional to its reward, is better suited to provide the guidance in RLCheck. As illustrated in the figure below, while RL only aims to find the global maximum, GFN samples diverse inputs including the ones with low rewards. Moreover, since GFN itself is designed for sampling diverse inputs, there is no need for us to keep track of all the inputs that have been generated. That is, we can generate diverse inputs in a more time- and space-efficient manner.

Based on the work of Reddy et al. [1], we aim to find out if replacing the RL with GFN actually leads to a better performance for property-based testing. First, we start with a synthetic task where the goal is generating valid Binary Search Trees (BST). The existing work has conducted an experiment with RL utilizing the set of inputs generated before with diverse rewards, and we are going to expand this work with three additional experiments: 1) GFN with uniform rewards for all valid BSTs, 2) GFN utilizing the set of inputs generated before with diverse rewards, and 3) RL with uniform rewards for all valid BSTs. Also, we plan to compare our method to other baselines with four real-world Java benchmarks, used in the original evaluation of the state-of-the-art tool, Zest [3], on its valid test-input generation ability.

Running

The version of Python is 3.11.3. To clone the project, run the following commands:

git clone https://github.com/tfoseel/gfn-check

Please create conda environment and install the requirements:

conda create -n gfn-check python=3.11
conda activate gfn-check
python3 -m pip install -r requirements.txt

We have 4 tasks. The structure of each task resembles that of python implementation of BST task in Reddy et al. [1].

To run the fuzzer of each task, run the following commands:

python -m [TASK].fuzz [--trials TRIALS] [--model MODEL] [--depth DEPTH]
               [--value_range VALUE_RANGE] [--state_abstraction STATE_ABSTRACTION]
               [--local_search_steps LOCAL_SEARCH_STEPS] [--epsilon EPSILON]
               [--embedding_dim EMBEDDING_DIM] [--hidden_dim HIDDEN_DIM]
               [--verbose]

Here [TASK] is BST, POM, Student, or ANT. Below is a detailed explanation of the command line arguments supported by the fuzz.py script.

`--trials TRIALS`

Type: int
Default: 10000
Description:
Number of trials (iterations) to run during the fuzzing process.

`--model MODEL`

Type: str
Default: "TB"
Description:
Specifies the generation/fuzzing model to use. Options typically include:
- RL (Reinforcement Learning)
- FM (Flow Matching)
- TB (Trajectory Balance)
- DB (Detailed Balance)
- LS (Local Search)

`--depth DEPTH`

Type: int
Default: 3
Description:
Maximum depth of the tree (or XML structure) being generated.

`--value_range VALUE_RANGE`

Type: int
Default: 10
Description:
Range of values used when generating tree nodes or tags (if applicable).

`--state_abstraction STATE_ABSTRACTION`

Type: str
Default: "left_right_tree"
Description:
State abstraction function to apply. Common options:
- random: No abstraction, random selection.
- sequence: Sequence-based n-gram abstraction.
- tree: Hierarchical tree-based abstraction.
- left_right_tree: Specialized tree-based abstraction that considers left/right structure.

`--local_search_steps LOCAL_SEARCH_STEPS`

Type: int
Default: 5
Description:
Number of local search steps to perform when using the LS (Local Search) model.

`--epsilon EPSILON`

Type: float
Default: 0.25
Description:
Epsilon value used for exploration in various models.

`--embedding_dim EMBEDDING_DIM`

Type: int
Default: 128
Description:
Dimension of the embedding used in neural models.

`--hidden_dim HIDDEN_DIM`

Type: int
Default: 128
Description:
Dimension of the hidden layers used in neural models.

`--verbose`

Action: store_true
Default: True
Description:
Prints detailed information about each trial, including generated structures and validity.

Example Usage

# Run fuzzing with RL model, sequence-based abstraction, depth of 4, and 10000 trials
python -m [TASK].fuzz --model RL --state_abstraction sequence --depth 4 --trials 10000 --verbose

(Replace [TASK] with BST, POM, Student, or ANT.)

Implementation

Generator implementation

To implement a fuzzer using random, RL, or GFN, go to a generators folder in each tasks, implement two functions select() and reward() in Oracle class. The input parameters should be same throughout all Oracles, so that they can be tested together in fuzz.py. (See BST/fuzz.py for example.)

Task specification

For modifying current XML tasks or making new one, you have to write schema_name.xsd and config.json. For now, config.json has two fields: tags and texts. The fuzzer selects a tag and a text inside the tag. Texts are placeholders.

If you want to introduce some attributes of XMLs, then you can specify available attributes list in config.json and modify generate_*() function in fuzz.py.

State abstraction

When working with complex inputs—such as partially constructed XML documents—it can be useful to represent the current "state" (e.g., the sequence of XML tags and tokens chosen so far) in a more compact, generalized form. These representations, or abstractions, allow the learning algorithm to more easily identify patterns and regularities, improving decision-making and scaling to larger inputs.

The code snippet includes several abstraction functions that transform the raw state (a list of symbols, tags, indices, or structural markers) into simpler or more structured forms.

A typical choice sequence is represented as follows:

[1, True, 2, True, 5, False, False, True, 4, False, False, True, 3, False, False] # BST
['info', 2, 'name', 0, 'Steve', 'age', 0, 10] # XML

The state abstraction function generates the summarized choice sequences in RL. It is used as a key for Q-table.

You can modify the state abstraction function, as long as it returns some summarization of the state that can be used as a key for a dictionary.

Customize XML tasks

The framework ensures that all generated XML documents are valid by enforcing constraints defined in task-specific .xsd files and configurations. By customizing these files, you can easily adapt the system for different XML tasks.

Modifying `.xsd` Files

Each task folder contains a .xsd file that defines the structure and rules for valid XML documents. For example, a schema may specify the required elements, their attributes, and the hierarchical relationships between elements. By modifying the .xsd file, you can define a new XML task or adjust the constraints for an existing one.

For instance, in the case of Ant-like XML tasks, you might define new tags or restrict certain element combinations. The generator will then adhere to these rules when creating XML documents.

Editing `config.json`

In addition to the .xsd file, each task folder includes a config.json file that specifies the available tags, attributes, and placeholder texts for leaf nodes. By modifying config.json, you can control:

Tags: A list of XML tags that the fuzzer can select when constructing a document.
Texts: A list of placeholder texts that can appear as content within tags.

Here is an example of a config.json file:

{
    "tags": [
        "activation",
        // ...
    ],
    "texts": [
        "Text1",
        // ...
    ]
}

Editing `custom_task.py`

Implement a .py file that represents a class with the task and a function names generate_TASK() that generates the tree structure recursively, based on .py files in other folders. You can reuse the validation function that utilizes your own .xsd file, or you can implement your own logic for validity check.

In the former case, if your schema files have a specific namespace, the validation can fail with the namespace issue and all methods do not generate any valid inputs. To specify the namespace to the fuzzer, you should edit the root tag in the function to match the namespaces. For example, you can specify o

# Base case: most simple form of valid pom.xml
tag = ET.Element('project', {
    'xmlns': 'YOUR NAMESPACE SHOULD GO HERE'
})

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

GFN-Check: Diverse Valid Test Inputs with Generative Flow Networks

Abstract

Motivation

Running

`--trials TRIALS`

`--model MODEL`

`--depth DEPTH`

`--value_range VALUE_RANGE`

`--state_abstraction STATE_ABSTRACTION`

`--local_search_steps LOCAL_SEARCH_STEPS`

`--epsilon EPSILON`

`--embedding_dim EMBEDDING_DIM`

`--hidden_dim HIDDEN_DIM`

`--verbose`

Example Usage

Implementation

Generator implementation

Task specification

State abstraction

Customize XML tasks

Modifying `.xsd` Files

Editing `config.json`

Editing `custom_task.py`

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
ANT		ANT
BST		BST
POM		POM
Student		Student
generators		generators
.DS_Store		.DS_Store
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt

MisakiTaro0414/GFN-Check

Folders and files

Latest commit

History

Repository files navigation

GFN-Check: Diverse Valid Test Inputs with Generative Flow Networks

Abstract

Motivation

Running

--trials TRIALS

--model MODEL

--depth DEPTH

--value_range VALUE_RANGE

--state_abstraction STATE_ABSTRACTION

--local_search_steps LOCAL_SEARCH_STEPS

--epsilon EPSILON

--embedding_dim EMBEDDING_DIM

--hidden_dim HIDDEN_DIM

--verbose

Example Usage

Implementation

Generator implementation

Task specification

State abstraction

Customize XML tasks

Modifying .xsd Files

Editing config.json

Editing custom_task.py

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

`--trials TRIALS`

`--model MODEL`

`--depth DEPTH`

`--value_range VALUE_RANGE`

`--state_abstraction STATE_ABSTRACTION`

`--local_search_steps LOCAL_SEARCH_STEPS`

`--epsilon EPSILON`

`--embedding_dim EMBEDDING_DIM`

`--hidden_dim HIDDEN_DIM`

`--verbose`

Modifying `.xsd` Files

Editing `config.json`

Editing `custom_task.py`

Packages