This is a repository for the KAIST AI based Software Engineering group project (https://coinse.github.io/teaching/2024/cs454/).
Kanghyeon Kim, Seungwoo Lee, and Ayhan Suleymanzade (All members equally contributed to this work.)
Generative Flow Networks (GFlowNets) are amortized sampling techniques designed to learn a distribution of discrete objects that is proportional to their associated rewards. In this work, we explore the application of GFlowNets to property-based testing (PBT), focusing on the diversity of test inputs. Our work provides a comparative analysis on the diversity of inputs generated by GFlowNets and Reinforcement Learning (RL)-based approaches, particularly RLCheck. Experiments span synthetic tasks such as Binary Search Tree (BST) generation and real-world tasks such as Maven POM XML generation. Our results reveal that while RLCheck achieves superior performance on specific diversity metrics, our investigation highlights opportunities to improve GFlowNet's capabilities, paving the way for more diverse and effective test input generation.
A variety of methods have been used for property-based testing, a powerful method for testing programs expecting highly-structured inputs. Reinforcement learning (RL) is often used to find such inputs which maximize rewards. However, RL-based methods tend to generate only high-reward inputs, limiting diversity.
To address this issue, Reddy et al. [1] proposed RLCheck, an on-policy RL approach to generating diverse inputs for property-based testing. It basically assigns high rewards to the “unique” valid inputs and uses Monte Carlo Control for sampling. However, this approach is still prone to overfitting to a certain space of valid inputs as RL itself is not designed for sampling. Moreover, in this approach, we should store all the inputs we have generated so far to incentivize unique inputs, requiring a large amount of space and time.
We believe that GFlowNets (GFN) [2], a family of probabilistic techniques that sample objects in a way that the generation probability is proportional to its reward, is better suited to provide the guidance in RLCheck. As illustrated in the figure below, while RL only aims to find the global maximum, GFN samples diverse inputs including the ones with low rewards. Moreover, since GFN itself is designed for sampling diverse inputs, there is no need for us to keep track of all the inputs that have been generated. That is, we can generate diverse inputs in a more time- and space-efficient manner.
Based on the work of Reddy et al. [1], we aim to find out if replacing the RL with GFN actually leads to a better performance for property-based testing. First, we start with a synthetic task where the goal is generating valid Binary Search Trees (BST). The existing work has conducted an experiment with RL utilizing the set of inputs generated before with diverse rewards, and we are going to expand this work with three additional experiments: 1) GFN with uniform rewards for all valid BSTs, 2) GFN utilizing the set of inputs generated before with diverse rewards, and 3) RL with uniform rewards for all valid BSTs. Also, we plan to compare our method to other baselines with four real-world Java benchmarks, used in the original evaluation of the state-of-the-art tool, Zest [3], on its valid test-input generation ability.
The version of Python is 3.11.3
.
To clone the project, run the following commands:
git clone https://github.com/tfoseel/gfn-check
Please create conda environment and install the requirements:
conda create -n gfn-check python=3.11
conda activate gfn-check
python3 -m pip install -r requirements.txt
We have 4 tasks. The structure of each task resembles that of python implementation of BST task in Reddy et al. [1].
To run the fuzzer of each task, run the following commands:
python -m [TASK].fuzz [--trials TRIALS] [--model MODEL] [--depth DEPTH]
[--value_range VALUE_RANGE] [--state_abstraction STATE_ABSTRACTION]
[--local_search_steps LOCAL_SEARCH_STEPS] [--epsilon EPSILON]
[--embedding_dim EMBEDDING_DIM] [--hidden_dim HIDDEN_DIM]
[--verbose]
Here [TASK]
is BST
, POM
, Student
, or ANT
. Below is a detailed explanation of the command line arguments supported by the fuzz.py
script.
- Type: int
- Default: 10000
- Description:
Number of trials (iterations) to run during the fuzzing process.
- Type: str
- Default:
"TB"
- Description:
Specifies the generation/fuzzing model to use. Options typically include:- RL (Reinforcement Learning)
- FM (Flow Matching)
- TB (Trajectory Balance)
- DB (Detailed Balance)
- LS (Local Search)
- Type: int
- Default: 3
- Description:
Maximum depth of the tree (or XML structure) being generated.
- Type: int
- Default: 10
- Description:
Range of values used when generating tree nodes or tags (if applicable).
- Type: str
- Default:
"left_right_tree"
- Description:
State abstraction function to apply. Common options:- random: No abstraction, random selection.
- sequence: Sequence-based n-gram abstraction.
- tree: Hierarchical tree-based abstraction.
- left_right_tree: Specialized tree-based abstraction that considers left/right structure.
- Type: int
- Default: 5
- Description:
Number of local search steps to perform when using theLS
(Local Search) model.
- Type: float
- Default: 0.25
- Description:
Epsilon value used for exploration in various models.
- Type: int
- Default: 128
- Description:
Dimension of the embedding used in neural models.
--hidden_dim HIDDEN_DIM
- Type: int
- Default: 128
- Description:
Dimension of the hidden layers used in neural models.
- Action: store_true
- Default: True
- Description:
Prints detailed information about each trial, including generated structures and validity.
# Run fuzzing with RL model, sequence-based abstraction, depth of 4, and 10000 trials
python -m [TASK].fuzz --model RL --state_abstraction sequence --depth 4 --trials 10000 --verbose
(Replace [TASK]
with BST
, POM
, Student
, or ANT
.)
To implement a fuzzer using random, RL, or GFN, go to a generators
folder in each tasks, implement two functions select()
and reward()
in Oracle
class. The input parameters should be same throughout all Oracles, so that they can be tested together in fuzz.py
. (See BST/fuzz.py
for example.)
For modifying current XML tasks or making new one, you have to write schema_name.xsd
and config.json
. For now, config.json
has two fields: tags
and texts
. The fuzzer selects a tag and a text inside the tag. Texts are placeholders.
If you want to introduce some attributes of XMLs, then you can specify available attributes list in config.json
and modify generate_*()
function in fuzz.py
.
When working with complex inputs—such as partially constructed XML documents—it can be useful to represent the current "state" (e.g., the sequence of XML tags and tokens chosen so far) in a more compact, generalized form. These representations, or abstractions, allow the learning algorithm to more easily identify patterns and regularities, improving decision-making and scaling to larger inputs.
The code snippet includes several abstraction functions that transform the raw state (a list of symbols, tags, indices, or structural markers) into simpler or more structured forms.
A typical choice sequence is represented as follows:
[1, True, 2, True, 5, False, False, True, 4, False, False, True, 3, False, False] # BST
['info', 2, 'name', 0, 'Steve', 'age', 0, 10] # XML
The state abstraction function generates the summarized choice sequences in RL. It is used as a key for Q-table.
You can modify the state abstraction function, as long as it returns some summarization of the state that can be used as a key for a dictionary.
The framework ensures that all generated XML documents are valid by enforcing constraints defined in task-specific .xsd
files and configurations. By customizing these files, you can easily adapt the system for different XML tasks.
Each task folder contains a .xsd
file that defines the structure and rules for valid XML documents. For example, a schema may specify the required elements, their attributes, and the hierarchical relationships between elements. By modifying the .xsd
file, you can define a new XML task or adjust the constraints for an existing one.
For instance, in the case of Ant-like XML tasks, you might define new tags or restrict certain element combinations. The generator will then adhere to these rules when creating XML documents.
In addition to the .xsd
file, each task folder includes a config.json
file that specifies the available tags, attributes, and placeholder texts for leaf nodes. By modifying config.json
, you can control:
- Tags: A list of XML tags that the fuzzer can select when constructing a document.
- Texts: A list of placeholder texts that can appear as content within tags.
Here is an example of a config.json
file:
{
"tags": [
"activation",
// ...
],
"texts": [
"Text1",
// ...
]
}
Implement a .py
file that represents a class with the task and a function names generate_TASK()
that generates the tree structure recursively, based on .py
files in other folders. You can reuse the validation function that utilizes your own .xsd
file, or you can implement your own logic for validity check.
In the former case, if your schema files have a specific namespace, the validation can fail with the namespace issue and all methods do not generate any valid inputs. To specify the namespace to the fuzzer, you should edit the root tag in the function to match the namespaces. For example, you can specify o
# Base case: most simple form of valid pom.xml
tag = ET.Element('project', {
'xmlns': 'YOUR NAMESPACE SHOULD GO HERE'
})