Skip to content

Commit 277afbd

Browse files
authored
Merge pull request #47 from IntelLabs/nhasabni/cpp_support
Full support for C++ programs
2 parents e7e0e44 + 845a29c commit 277afbd

18 files changed

+391
-94
lines changed

CMakeLists.txt

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,12 @@ if(NOT EXISTS ${PROJECT_SOURCE_DIR}/src/tree-sitter/tree-sitter-php)
5555
)
5656
endif()
5757

58+
if(NOT EXISTS ${PROJECT_SOURCE_DIR}/src/tree-sitter/tree-sitter-cpp)
59+
execute_process(
60+
COMMAND git clone https://github.com/tree-sitter/tree-sitter-cpp.git ${PROJECT_SOURCE_DIR}/src/tree-sitter/tree-sitter-cpp
61+
)
62+
endif()
63+
5864
get_filename_component(TREE_SITTER_INCLUDE src/tree-sitter/tree-sitter/lib/include ABSOLUTE)
5965

6066
enable_testing()

README.md

Lines changed: 21 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -39,8 +39,6 @@ More details can be found in our MAPS paper (https://arxiv.org/abs/2011.03616).
3939
- `scripts`: Scripts for pattern mining and scanning for anomalies
4040
- `quick_start`: Scripts to run quick start tests
4141
- `github`: Scripts and data for downloading GitHub repos.
42-
It also contains pre-processed training data containing patterns mined from
43-
6000 GitHub repositories using C as their primary language.
4442
- `tests`: unit tests
4543

4644
## Install
@@ -69,14 +67,13 @@ $ cmake .
6967
$ make -j
7068
$ make test
7169
```
72-
All tests in `make test` should pass, but currently tests for Verilog are failing because of a version mismatch issue.
73-
Verilog support is WIP.
70+
All tests in `make test` should pass.
7471

7572
## Using ControlFlag
7673

7774
### Quick start
7875

79-
#### Using patterns obtained from 6000 GitHub repos to scan repository of your choice
76+
#### Using patterns obtained from several GitHub repos to scan repository of your choice
8077

8178
Download the training data for the language of interest depending on the memory constraints of your device. Note, however, that using smaller datasets may lead to reduced accuracy in the results ControlFlag produces and possibly an increase in the number of false positives it generates.
8279

@@ -85,6 +82,9 @@ Language | Dataset name | Size on disk | Memory requirements | Direct link | gdo
8582
C | Small | ~100MB | ~400MB | [link](https://drive.google.com/file/d/1gvUyRXq1SeZD9g3i__RaamYAMo_QaQIb/view?usp=sharing) | 1gvUyRXq1SeZD9g3i__RaamYAMo_QaQIb | 2825f209aba0430993f7a21e74d99889
8683
C | Medium | ~450MB | ~1.3GB | [link](https://drive.google.com/file/d/1zsCFJAKlZlSAWKPfBcVGcQNlFB5Gtwo3/view?usp=sharing) | 1zsCFJAKlZlSAWKPfBcVGcQNlFB5Gtwo3 | aab2427edebe9ed4acab75c3c6227f24
8784
C | Large | ~9GB | ~13GB | [link](https://drive.google.com/file/d/1-jzs3zrKU541hwChaciXSk8zrnMN1mYc/view?usp=sharing) | 1-jzs3zrKU541hwChaciXSk8zrnMN1mYc | 1ba954d9716765d44917445d3abf8e85
85+
C++ | Small | ~200MB | ~500MB | [link](https://drive.google.com/file/d/1ZD9J7vyT61T1D4rsedVXgFi0CrVb5BJl/view?usp=sharing) | 1ZD9J7vyT61T1D4rsedVXgFi0CrVb5BJl | f954486e20961f0838ac08e5d4dbf312
86+
C++ | Medium | ~500MB | ~1.3GB | [link](https://drive.google.com/file/d/1Pj3bQN3nwy84F5o1w05T1Gz8b4hGuPUr/view?usp=sharing) | 1Pj3bQN3nwy84F5o1w05T1Gz8b4hGuPUr | a5c18ea1cdbe354b93aabf9ecaa5b07a
87+
C++ | Large | ~1.2GB | ~3GB | [link](https://drive.google.com/file/d/14iNcH3plw3EYnYfX63LntPtyr8Pwo2IP/view?usp=sharing) | 14iNcH3plw3EYnYfX63LntPtyr8Pwo2IP | 4f5ffc1ab942eaba399cafd5be8bb45f
8888
PHP | Small | ~120MB | ~1GB | [Link](https://drive.google.com/file/d/1zUnBHMXPIXmlrCfWze8nNoMEQnc0W2K5/view?usp=sharing) | 1zUnBHMXPIXmlrCfWze8nNoMEQnc0W2K5 | 5a1cc4c24a20de7dad1b9f40661d517a
8989

9090
```
@@ -96,16 +96,22 @@ $ tar -zxf <tgz_file>
9696
To scan C code of your choice, use below command:
9797

9898
```
99-
$ scripts/scan_for_anomalies.sh -d <directory_to_be_scanned_for_anomalies> -t <training_data>.ts -o <output_directory_to_store_log_files>
99+
$ scripts/scan_for_anomalies.sh -d <directory_to_be_scanned_for_anomalies> -t <training_data>.ts -o <output_directory_to_store_log_files> -l 1
100+
```
101+
102+
To scan C++ code of your choice, use below command:
103+
104+
```
105+
$ scripts/scan_for_anomalies.sh -d <directory_to_be_scanned_for_anomalies> -t <training_data>.ts -o <output_directory_to_store_log_files> -l 4
100106
```
101107

102108
Once the run is complete (which could take some time depending on your system and the
103-
number of C programs in your repository,) refer to [the section below to
109+
number of programs from your repository that can be scanned by ControlFlag,) refer to [the section below to
104110
understand scan output](#understanding-scan-output).
105111

106112
#### Mining patterns from a small repo and applying them to another small repo
107113

108-
In this test, we will mine patterns from
114+
In this test for C language programs, we will mine patterns from
109115
[Glb-director](https://github.com/github/glb-director.git) project of GitHub and
110116
apply them to flag anomalies in GitHub's [brubeck](https://github.com/github/brubeck.git) project.
111117

@@ -143,9 +149,9 @@ statements that appear in C programs.*
143149

144150
If you want to use your own repository for mining patterns, jump to Step 1.2.
145151

146-
1.1 __Downloading Top-100 GitHub repos for C language__
152+
1.1 __Downloading GitHub repos for C language having more than 100 stars__
147153

148-
Steps below show how to download Top-100 GitHub repos for C language
154+
Steps below show how to download GitHub repos for C language that have more than 100 stars
149155
(`c100.txt`) and generate training data. `training_repo_dir` is a directory
150156
where the command below will clone all the repos.
151157

@@ -165,7 +171,8 @@ place of <training_repo_dir>.
165171
Usage: ./mine_patterns.sh -d <directory_to_mine_patterns_from> -o <output_file_to_store_training_data>
166172
Optional:
167173
[-n number_of_processes_to_use_for_mining] (default: num_cpus_on_system)
168-
[-l source_language_number] (default: 1 (C), supported: 1 (C), 2 (Verilog), 3 (PHP)
174+
[-l source_language_number] (default: 1 (C), supported: 1 (C), 2 (Verilog), 3 (PHP), 4 (C++)
175+
[-g github_repo_id] (default: 0) A unique identifier for GitHub repository, if any
169176
```
170177

171178
We use it as:
@@ -178,7 +185,7 @@ found in the specified GitHub repos and their AST (abstract syntax tree) represe
178185
You can view this file as a text file, if
179186
you want.
180187

181-
## Evaluation (or scanning for anomalies in C code from test repo)
188+
## Evaluation (or scanning for anomalies)
182189

183190
We can run `scan_for_anomalies.sh` script to scan target directory of interest.
184191
Its usage is as below.
@@ -189,14 +196,10 @@ Optional:
189196
[-n max_number_of_results_for_autocorrect] (default: 5)
190197
[-j number_of_scanning_threads] (default: num_cpus_on_systems)
191198
[-o output_log_dir] (default: /tmp)
192-
[-l source_language_number] (default: 1 (C), supported: 1 (C), 2 (Verilog), 3 (PHP))
199+
[-l source_language_number] (default: 1 (C), supported: 1 (C), 2 (Verilog), 3 (PHP), 4 (C++))
193200
[-a anomaly_threshold] (default: 3.0)
194201
```
195202

196-
```
197-
scripts/scan_for_anomalies.sh -d <test_directory> -t <training_data_file> -o <output_log_dir>
198-
```
199-
200203
As a part of scanning for anomalies, ControlFlag also suggests possible
201204
corrections in case a conditional expression is flagged as an anomaly. `25` is the
202205
`max_cost` for the correction -- how close should the suggested correction be to
@@ -218,7 +221,7 @@ $ grep "Potential anomaly" <output_log_dir>/thread_*.log
218221
A sample anomaly report looks like below:
219222
```
220223
Level:<ONE or TWO> Expression: <AST_for_anomalous_expression>
221-
Source file and line number: <C code with line number having the anomaly>
224+
Source file and line number: <Source code expression with line number having the anomaly>
222225
Potential anomaly
223226
Did you mean ...
224227
```

quick_start/test3_cpp.sh

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
#!/bin/sh
2+
QUICK_START_DIR=`dirname $0`
3+
git clone https://github.com/google/sentencepiece.git
4+
${QUICK_START_DIR}/../scripts/mine_patterns.sh -d sentencepiece -o sentencepiece_training_data.ts -l 4
5+
git clone https://github.com/google/xrtl.git
6+
mkdir test3_scan_output
7+
${QUICK_START_DIR}/../scripts/scan_for_anomalies.sh -d xrtl/ -t sentencepiece_training_data.ts -o test3_scan_output -l 4

scripts/mine_patterns.sh

Lines changed: 20 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -10,8 +10,8 @@ function print_usage() {
1010
else
1111
echo "[-n number_of_processes_to_use_for_mining] (default: num_cpus_on_system)"
1212
fi
13-
echo "[-l source_language_number] (default: 1 (C), supported: 1 (C), 2 (Verilog), 3 (PHP)"
14-
13+
echo "[-l source_language_number] (default: 1 (C), supported: 1 (C), 2 (Verilog), 3 (PHP), 4 (C++)"
14+
echo "[-g github_repo_id] (default: 0) A unique identifier for GitHub repository, if any"
1515
exit
1616
}
1717

@@ -23,14 +23,16 @@ else
2323
NUM_MINER_PROCS=`nproc`
2424
fi
2525
LANGUAGE=1
26+
REPO_ID=0
2627

27-
while getopts d:o:n:l: flag
28+
while getopts d:o:n:l:g: flag
2829
do
2930
case "${flag}" in
3031
d) TRAIN_DIR=${OPTARG};;
3132
o) OUTPUT_FILE=${OPTARG};;
3233
n) NUM_MINER_PROCS=${OPTARG};;
3334
l) LANGUAGE=${OPTARG};;
35+
g) REPO_ID=${OPTARG};;
3436
esac
3537
done
3638

@@ -41,15 +43,15 @@ then
4143
print_usage $0
4244
fi
4345

44-
if (( ${LANGUAGE} < 1 || ${LANGUAGE} > 3 ));
46+
if (( ${LANGUAGE} < 1 || ${LANGUAGE} > 4 ));
4547
then
46-
echo "ERROR: Only 1 (C), 2 (Verilog) or 3 (PHP) are supported languages; received ${LANGUAGE}"
48+
echo "ERROR: Only 1 (C), 2 (Verilog), 3 (PHP), and 4 (C++) are supported languages; received ${LANGUAGE}"
4749
print_usage $0
4850
fi
4951

5052
if [ -f "${OUTPUT_FILE}" ]
5153
then
52-
echo "ERROR: Output file exists. We don't want to over-write it."
54+
echo "ERROR: Output file ${OUTPUT_FILE} exists. We don't want to over-write it."
5355
print_usage $0
5456
fi
5557

@@ -58,36 +60,41 @@ FILE_LIST=${TMP_DIR}/file_list.txt
5860
if [ "${LANGUAGE}" = "1" ];
5961
then
6062
find "${TRAIN_DIR}" -iname "*.c" -o -iname "*.h" -type f > ${FILE_LIST}
63+
elif [ "${LANGUAGE}" = "2" ];
64+
then
65+
find "${TRAIN_DIR}" -iname "*.v" -o -iname "*.vh" -type f > ${FILE_LIST}
6166
elif [ "${LANGUAGE}" = "3" ];
6267
then
6368
find "${TRAIN_DIR}" -iname "*.php" -type f | fgrep -v "/vendor/" > ${FILE_LIST}
64-
else
65-
find "${TRAIN_DIR}" -iname "*.v" -o -iname "*.vh" -type f > ${FILE_LIST}
69+
elif [ "${LANGUAGE}" = "4" ];
70+
then
71+
find "${TRAIN_DIR}" -iname "*.cpp" -o -iname "*.cc" -o -iname "*.cxx" -o -iname "*.h" -o -iname "*.hpp" -o -iname "*.hxx" -type f > ${FILE_LIST}
6672
fi
6773

6874
SCRIPTS_DIR=`dirname $0`
6975
function dump_code_blocks() {
70-
id=`echo "$1" | cut -d ':' -f 1`
71-
f=`echo "$1" | cut -d ':' -f 2-`
76+
id=${REPO_ID}
77+
f=$1
7278
${SCRIPTS_DIR}/../bin/cf_dump_code_blocks -f "$f" -t 100 -g ${id} -l ${LANGUAGE} >> $2
7379
}
7480
export -f dump_code_blocks
7581
export LANGUAGE
7682
export SCRIPTS_DIR
83+
export REPO_ID
7784

7885
if ! command -v parallel &> /dev/null
7986
then
8087
echo "GNU Parallel does not exist. Invoking serial dump.."
81-
for id_f in `cat $FILE_LIST`;
88+
for f in `cat $FILE_LIST`;
8289
do
83-
dump_code_blocks ${id_f} ${OUTPUT_FILE}
90+
dump_code_blocks ${f} ${OUTPUT_FILE}
8491
done
8592

8693
else
8794

8895
echo "GNU Parallel exists. Invoking parallel dump.."
8996
cat ${FILE_LIST} | parallel --eta --bar --progress \
90-
-I% -j0 dump_code_blocks % ${TMP_DIR}/proc_{%}.log
97+
-I% -j ${NUM_MINER_PROCS} dump_code_blocks % ${TMP_DIR}/proc_{%}.log
9198

9299
for i in `seq 1 $NUM_MINER_PROCS`;
93100
do

scripts/scan_for_anomalies.sh

Lines changed: 9 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,7 @@ function print_usage() {
1414
fi
1515
echo " [-o output_log_dir] (default: /tmp)"
1616
echo " [-a anomaly_threshold] (default: 3.0)"
17-
echo " [-l source_language_number] (default: 1 (C), supported: 1 (C), 2 (Verilog), 3 (PHP)"
17+
echo " [-l source_language_number] (default: 1 (C), supported: 1 (C), 2 (Verilog), 3 (PHP), 4 (C++)"
1818

1919
exit
2020
}
@@ -58,21 +58,25 @@ then
5858
print_usage $0
5959
fi
6060

61-
if (( ${LANGUAGE} < 1 || ${LANGUAGE} > 3 ));
61+
if (( ${LANGUAGE} < 1 || ${LANGUAGE} > 4 ));
6262
then
63-
echo "ERROR: Only 1 (C), 2 (Verilog) or 3 (PHP) are supported languages; received ${LANGUAGE}"
63+
echo "ERROR: Only 1 (C), 2 (Verilog), 3 (PHP), and 4 (C++) are supported languages; received ${LANGUAGE}"
6464
print_usage $0
6565
fi
6666

6767
SCAN_FILE_LIST=`mktemp`
6868
if [ "${LANGUAGE}" = "1" ];
6969
then
7070
find "${SCAN_DIR}" -iname "*.c" -o -iname "*.h" -type f > ${SCAN_FILE_LIST}
71+
elif [ "${LANGUAGE}" = "2" ];
72+
then
73+
find "${SCAN_DIR}" -iname "*.v" -o -iname "*.vh" -type f > ${SCAN_FILE_LIST}
7174
elif [ "${LANGUAGE}" = "3" ];
7275
then
7376
find "${SCAN_DIR}" -iname "*.php" -type f | fgrep -v "/vendor/" > ${SCAN_FILE_LIST}
74-
else
75-
find "${SCAN_DIR}" -iname "*.v" -o -iname "*.vh" -type f > ${SCAN_FILE_LIST}
77+
elif [ "${LANGUAGE}" = "4" ];
78+
then
79+
find "${SCAN_DIR}" -iname "*.cpp" -o -iname "*.cc" -o -iname "*.cxx" -o -iname "*.h" -o -iname "*.hpp" -o -iname "*.hxx" -type f > ${SCAN_FILE_LIST}
7680
fi
7781

7882
SCRIPTS_DIR=`dirname $0`

src/CMakeLists.txt

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -27,6 +27,7 @@ set(COMMON_LINK_LIBRARIES
2727
tree-sitter
2828
tree-sitter-c
2929
tree-sitter-php
30+
tree-sitter-cpp
3031
tree-sitter-verilog
3132
pthread
3233
)

src/cf_dump_code_blocks.cpp

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -93,7 +93,7 @@ int handle_command_args(int argc, char* argv[], CFDumpArgs& command_args) {
9393
<< std::endl
9494
<< " [-l source_language_number] (default: "
9595
<< LANGUAGE_C << ")"
96-
<< ", supported: 1 (C), 2 (Verilog), 3 (PHP)"
96+
<< ", supported: 1 (C), 2 (Verilog), 3 (PHP), 4 (C++)"
9797
<< std::endl;
9898
};
9999

@@ -104,9 +104,9 @@ int handle_command_args(int argc, char* argv[], CFDumpArgs& command_args) {
104104
case 'l':
105105
command_args.source_language_ = VerifyLanguage(atoi(optarg)); break;
106106
case 'g':
107-
command_args.github_contributor_id_ = std::strtoul(argv[3], NULL, 10);
107+
command_args.github_contributor_id_ = std::strtoul(optarg, NULL, 10);
108108
break;
109-
case 't': command_args.level_ = VerifyTreeLevel(atoi(argv[2])); break;
109+
case 't': command_args.level_ = VerifyTreeLevel(atoi(optarg)); break;
110110
default: print_usage(); return EXIT_FAILURE;
111111
}
112112
}
@@ -137,6 +137,9 @@ int main(int argc, char* argv[]) {
137137
case LANGUAGE_PHP:
138138
DumpCodeBlocksFromSourceFile<LANGUAGE_PHP>(command_args);
139139
break;
140+
case LANGUAGE_CPP:
141+
DumpCodeBlocksFromSourceFile<LANGUAGE_CPP>(command_args);
142+
break;
140143
default:
141144
throw cf_unexpected_situation("Unsupported language:" +
142145
std::to_string(LanguageToInt(

src/cf_file_scanner.cpp

Lines changed: 7 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,6 +24,7 @@
2424
#include <string>
2525
#include <thread> // NOLINT [build/c++11]
2626

27+
#include "exception.h"
2728
#include "train_and_scan_util.h"
2829
#include "trie.h"
2930

@@ -54,7 +55,7 @@ static int handle_command_args(int argc, char* argv[], FileScannerArgs& args) {
5455
<< " [-a anomaly_threshold] (default: 3.0)"
5556
<< std::endl
5657
<< " [-l source_language_number] (default: 1 (C), "
57-
<< "supported: 1 (C), 2 (Verilog), 3(PHP))"
58+
<< "supported: 1 (C), 2 (Verilog), 3 (PHP), 4 (C++))"
5859
<< std::endl
5960
<< " [-v log_level ] (default: 0, "
6061
<< "{ERROR, 0}, {INFO, 1}, {DEBUG, 2})"
@@ -159,7 +160,11 @@ int main(int argc, char* argv[]) {
159160
break;
160161
case LANGUAGE_PHP:
161162
status = train_and_scan_util.ScanFile<LANGUAGE_PHP>(eval_file,
162-
log_file);
163+
log_file);
164+
break;
165+
case LANGUAGE_CPP:
166+
status = train_and_scan_util.ScanFile<LANGUAGE_CPP>(eval_file,
167+
log_file);
163168
break;
164169
default:
165170
throw cf_unexpected_situation("Unsupported language:" +

src/common_util.cpp

Lines changed: 11 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@
2323
#include <sstream>
2424

2525
#include "parser.h"
26+
#include "exception.h"
2627
#include "common_util.h"
2728

2829
template <Language L>
@@ -91,8 +92,8 @@ void CollectCodeBlocksOfInterest<LANGUAGE_VERILOG>(const TSNode& node,
9192
}
9293
}
9394

94-
// For C and PHP language,
95-
// we are looking for control structures such as if statements.
95+
// For C, C++, and PHP language, we are looking for control structures
96+
// such as if statements.
9697
template <Language L>
9798
void CollectCodeBlocksOfInterest(const TSNode& node,
9899
code_blocks_t& code_blocks) {
@@ -126,17 +127,24 @@ ManagedTSTree GetTSTree<LANGUAGE_VERILOG>(const std::string&, bool);
126127
template
127128
ManagedTSTree GetTSTree<LANGUAGE_PHP>(const std::string&, bool);
128129
template
130+
ManagedTSTree GetTSTree<LANGUAGE_CPP>(const std::string&, bool);
131+
template
129132
ManagedTSTree GetTSTree<LANGUAGE_C>(const std::string&, std::string&);
130133
template
131134
ManagedTSTree GetTSTree<LANGUAGE_VERILOG>(const std::string&, std::string&);
132135
template
133136
ManagedTSTree GetTSTree<LANGUAGE_PHP>(const std::string&, std::string&);
134137
template
138+
ManagedTSTree GetTSTree<LANGUAGE_CPP>(const std::string&, std::string&);
139+
template
135140
void CollectCodeBlocksOfInterest<LANGUAGE_C>(const ManagedTSTree&,
136141
code_blocks_t&);
137142
template
138143
void CollectCodeBlocksOfInterest<LANGUAGE_VERILOG>(const ManagedTSTree &,
139144
code_blocks_t&);
140145
template
141146
void CollectCodeBlocksOfInterest<LANGUAGE_PHP>(const ManagedTSTree &,
142-
code_blocks_t&);
147+
code_blocks_t&);
148+
template
149+
void CollectCodeBlocksOfInterest<LANGUAGE_CPP>(const ManagedTSTree &,
150+
code_blocks_t&);

0 commit comments

Comments
 (0)