Skip to content

tuhh-softsec/SyntVul

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SyntVul

The dataset is a synthetically generated collection of code vulnerabilities produced through the use of attributed prompts designed to capture a wide range of real-world security patterns. It contains a total of 47,520 individual functions, evenly divided between vulnerable (50%) and non-vulnerable (50%) samples. These functions vary in their type, length, and structural characteristics, ensuring diversity across different programming constructs and vulnerability categories. The dataset aims to provide a balanced and comprehensive resource for evaluating and training models in vulnerability detection and secure code analysis.

Dataset

The dataset is located within the data/ directory, which contains all the generated samples and related metadata. The configuration settings and parameters used during the dataset generation process are stored in the file scripts/config/v2_config.json. This configuration file provides a detailed record of the prompt structure, generation rules, and attribute definitions applied, allowing full reproducibility and transparency of the dataset creation process.

Dataset Generation

Setting Up a Virtual Environment

The required python3 dependencies can be installed using the requirements.txt-file.

  • Run python3 -r requrirements.txt.

Prompt Configuration

The prompt configuration is expected to be in JSON format. Refer to src/dataset_generation/config/v2_config.json for the configuration that was used to generate the prompts that have been used for my report. It should contain the following three keys:

{
    "query_intro": ...,
    "requirements": ...,
    "params": ...
}
  • query_intro is the introductory sentence of the query
  • requirements holds a list of strings that may contain placeholders for attribute dimensions, enclosed in curly braces (e.g., {cwe})
  • params defines the attribute dimensions and their corresponding attribute values. It is a dictionary mapping the attribute dimensions (e.g., cwe) to yet another dictionary, which comprises the following information:
    • mode: Either "product" or "sampled"
      • "product": The attribute dimension will be expanded when constructing the Cartesian product
      • "sampled": Sample a fixed amount of attribute values for the attribute dimension. Specify n_samples to define the number of samples to draw.
    • dependency (optional): Can be used to define dependent attribute dimensions. These will have different options, depending on the value of some other attribute dimension (e.g., function names are dependent on the purpose of the function. In that case, dependency would be set to purpose.)
    • choices: Defines a list of possible attribute values for the placeholders corresponding to the attribute dimensions. For dependent attribute dimensions, instead, provide a dictionary mapping the attribute values of the depended-upon attribute dimension to lists of possible values. Again, func_name in the v2_config.json provides an example.

Data Generation via Batch API

  • Use batch_generator.py to generate a JSONL file that can be handed over to the OpenAI Batch API. The script will also create a corresponding CSV file into which the resulting responses can be integrated later.
  • Use batch_api_handler.py to access the Batch API. Batches can be scheduled via -m create. The results can be retrieved and integrated into said CSV file with -m integrate.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages