The dataset is a synthetically generated collection of code vulnerabilities produced through the use of attributed prompts designed to capture a wide range of real-world security patterns. It contains a total of 47,520 individual functions, evenly divided between vulnerable (50%) and non-vulnerable (50%) samples. These functions vary in their type, length, and structural characteristics, ensuring diversity across different programming constructs and vulnerability categories. The dataset aims to provide a balanced and comprehensive resource for evaluating and training models in vulnerability detection and secure code analysis.
The dataset is located within the data/ directory, which contains all the generated samples and related metadata. The configuration settings and parameters used during the dataset generation process are stored in the file scripts/config/v2_config.json. This configuration file provides a detailed record of the prompt structure, generation rules, and attribute definitions applied, allowing full reproducibility and transparency of the dataset creation process.
The required python3 dependencies can be installed using the requirements.txt-file.
- Run
python3 -r requrirements.txt.
The prompt configuration is expected to be in JSON format.
Refer to src/dataset_generation/config/v2_config.json for the configuration that was used to generate the prompts that have been used for my report.
It should contain the following three keys:
{
"query_intro": ...,
"requirements": ...,
"params": ...
}
query_introis the introductory sentence of the queryrequirementsholds a list of strings that may contain placeholders for attribute dimensions, enclosed in curly braces (e.g.,{cwe})paramsdefines the attribute dimensions and their corresponding attribute values. It is a dictionary mapping the attribute dimensions (e.g.,cwe) to yet another dictionary, which comprises the following information:mode: Either"product"or"sampled""product": The attribute dimension will be expanded when constructing the Cartesian product"sampled": Sample a fixed amount of attribute values for the attribute dimension. Specifyn_samplesto define the number of samples to draw.
dependency(optional): Can be used to define dependent attribute dimensions. These will have different options, depending on the value of some other attribute dimension (e.g., function names are dependent on the purpose of the function. In that case,dependencywould be set topurpose.)choices: Defines a list of possible attribute values for the placeholders corresponding to the attribute dimensions. For dependent attribute dimensions, instead, provide a dictionary mapping the attribute values of the depended-upon attribute dimension to lists of possible values. Again,func_namein thev2_config.jsonprovides an example.
- Use
batch_generator.pyto generate a JSONL file that can be handed over to the OpenAI Batch API. The script will also create a corresponding CSV file into which the resulting responses can be integrated later. - Use
batch_api_handler.pyto access the Batch API. Batches can be scheduled via-m create. The results can be retrieved and integrated into said CSV file with-m integrate.