Introduction

This repository contains the replication package for the 2025 DIMVA paper Quantifying and Mitigating the Impact of Obfuscations on Machine-Learning-Based Decompilation Improvement.

The preprocessed datasets are available here. Some scripts in this repository assume that the data directory is unpacked into the same directory that this repository is in.

The experiments are replicated across three different models: DIRTY, VarBERT, and HexT5.

To reproduce our experiments on each model, follow the instructions under the respective directory.

Generating a New Dataset

If you would like to generate your own dataset from scratch, follow the instructions on our automatic compilation-and-obfuscation tool, GHCC-obfuscator. It can be obtained here by running

git clone https://github.com/squaresLab/ghcc-obfuscator

The follow the instructions in that repo to build and run the tool.

Once the binaries are built, decompile them using DIRTY's decompilation framework. To do this, check out the README at DIRTY/dataset-gen. Note that this implicitly requires a copy of IDA Pro.

Preprocess the code with DIRTY's preprocessing scripts. For experiment 1, use DIRTY/dirty/utils/preprocess_exp1. For experiments 3 and 4, use DIRTY/dirty/utils/preprocess. (Experiment 2 uses a subset of the data from experiment 1.)

For instance, for experiment 1, use

cd DIRTY/dirty/
python -m utils.preprocess_exp1 /path/to/decompiler_output /path/to/decompiler_output/fnames.txt data/exp1/unobfuscated data/binaries-metadata.json --max=1500000 --unobfuscated-partition --target-size=2482980 --workers 40

python utils.preprocess_exp1 /path/to/decompiler_output /path/to/decompiler_output/fnames.txt data/exp1/obfuscated /path/to/binaries-metadata.json --max=1500000 --obfuscated-partition --target-size=2482980 --workers 40 --based-on data/exp1/unobfuscated/

The HexT5 and VarBERT folders include scripts to convert data from the DIRTY format to the formats that these models expect.

In addition, scripts in all three repositories use deduplication clusters. The ones we generated are in dedup/; you can generate your own with deduplicate.py.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Introduction

Generating a New Dataset

About

Uh oh!

Releases

Packages

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
DIRTY		DIRTY
HexT5		HexT5
VarBERT		VarBERT
dedup		dedup
README.md		README.md

squaresLab/ML-Decompilation-Obfuscation

Folders and files

Latest commit

History

Repository files navigation

Introduction

Generating a New Dataset

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages