This repository contains the replication package for the 2025 DIMVA paper Quantifying and Mitigating the Impact of Obfuscations on Machine-Learning-Based Decompilation Improvement.
The preprocessed datasets are available here. Some scripts in this repository assume that the data directory is unpacked into the same directory that this repository is in.
The experiments are replicated across three different models: DIRTY, VarBERT, and HexT5.
To reproduce our experiments on each model, follow the instructions under the respective directory.
If you would like to generate your own dataset from scratch, follow the instructions on our automatic compilation-and-obfuscation tool, GHCC-obfuscator
. It can be obtained here by running
git clone https://github.com/squaresLab/ghcc-obfuscator
The follow the instructions in that repo to build and run the tool.
Once the binaries are built, decompile them using DIRTY's decompilation framework. To do this, check out the README at DIRTY/dataset-gen
. Note that this implicitly requires a copy of IDA Pro.
Preprocess the code with DIRTY's preprocessing scripts. For experiment 1, use DIRTY/dirty/utils/preprocess_exp1
. For experiments 3 and 4, use DIRTY/dirty/utils/preprocess
. (Experiment 2 uses a subset of the data from experiment 1.)
For instance, for experiment 1, use
cd DIRTY/dirty/
python -m utils.preprocess_exp1 /path/to/decompiler_output /path/to/decompiler_output/fnames.txt data/exp1/unobfuscated data/binaries-metadata.json --max=1500000 --unobfuscated-partition --target-size=2482980 --workers 40
python utils.preprocess_exp1 /path/to/decompiler_output /path/to/decompiler_output/fnames.txt data/exp1/obfuscated /path/to/binaries-metadata.json --max=1500000 --obfuscated-partition --target-size=2482980 --workers 40 --based-on data/exp1/unobfuscated/
The HexT5
and VarBERT
folders include scripts to convert data from the DIRTY format to the formats that these models expect.
In addition, scripts in all three repositories use deduplication clusters. The ones we generated are in dedup/
; you can generate your own with deduplicate.py
.