Skip to content

Replication package for the 2025 DIMVA paper "Quantifying and Mitigating the Impact of Obfuscations on Machine-Learning-Based Decompilation Improvement"

Notifications You must be signed in to change notification settings

squaresLab/ML-Decompilation-Obfuscation

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Introduction

This repository contains the replication package for the 2025 DIMVA paper Quantifying and Mitigating the Impact of Obfuscations on Machine-Learning-Based Decompilation Improvement.

The preprocessed datasets are available here. Some scripts in this repository assume that the data directory is unpacked into the same directory that this repository is in.

The experiments are replicated across three different models: DIRTY, VarBERT, and HexT5.

To reproduce our experiments on each model, follow the instructions under the respective directory.

Generating a New Dataset

If you would like to generate your own dataset from scratch, follow the instructions on our automatic compilation-and-obfuscation tool, GHCC-obfuscator. It can be obtained here by running

git clone https://github.com/squaresLab/ghcc-obfuscator

The follow the instructions in that repo to build and run the tool.

Once the binaries are built, decompile them using DIRTY's decompilation framework. To do this, check out the README at DIRTY/dataset-gen. Note that this implicitly requires a copy of IDA Pro.

Preprocess the code with DIRTY's preprocessing scripts. For experiment 1, use DIRTY/dirty/utils/preprocess_exp1. For experiments 3 and 4, use DIRTY/dirty/utils/preprocess. (Experiment 2 uses a subset of the data from experiment 1.)

For instance, for experiment 1, use

cd DIRTY/dirty/
python -m utils.preprocess_exp1 /path/to/decompiler_output /path/to/decompiler_output/fnames.txt data/exp1/unobfuscated data/binaries-metadata.json --max=1500000 --unobfuscated-partition --target-size=2482980 --workers 40

python utils.preprocess_exp1 /path/to/decompiler_output /path/to/decompiler_output/fnames.txt data/exp1/obfuscated /path/to/binaries-metadata.json --max=1500000 --obfuscated-partition --target-size=2482980 --workers 40 --based-on data/exp1/unobfuscated/

The HexT5 and VarBERT folders include scripts to convert data from the DIRTY format to the formats that these models expect.

In addition, scripts in all three repositories use deduplication clusters. The ones we generated are in dedup/; you can generate your own with deduplicate.py.

About

Replication package for the 2025 DIMVA paper "Quantifying and Mitigating the Impact of Obfuscations on Machine-Learning-Based Decompilation Improvement"

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published