Ref : https://github.com/parasol-aser/deepsim/tree/master/dataset
Question for GoogleCodeJam that we use : Question-GCJ
Other Question (eg. Code Jam ,Kick Start,Hash Code) : this web!
Code fragments: 1665
True clone pairs: 274959
False clone pairs: 1110321
Ref: https://developer.ibm.com/exchanges/data/all/project-codenet/
Code fragments: 75000
True clone pairs: 11212500
False clone pairs: 2801250000
Ref :https://github.com/microsoft/CodeBERT/tree/master
CodeBERT is a pre-trained model for programming language, which is a multi-programming-lingual model pre-trained on NL-PL pairs in 6 programming languages (Python, Java, JavaScript, PHP, Ruby, Go).
GraphCodeBERT is a pre-trained model for programming language that considers the inherent structure of code i.e. data flow, which is a multi-programming-lingual model pre-trained on NL-PL pairs in 6 programming languages (Python, Java, JavaScript, PHP, Ruby, Go).
We will focus on this for clone detection