This repository contains the code for preparing and processing data, and replicating the results in the paper "Iterative Multilingual Spectral Attribute Erasure".
# Create a new conda environment with Python:
conda create -n msal python=3.10
# activate the environment
conda activate msal
# Install required packages one by one:
pip install -r requirements.txt
# Install spacy
conda install -c conda-forge spacyInstall pytorch with CUDA:
# Install the pytorch with CUDA
conda install pytorch torchvision torchaudio pytorch-cuda=11.8 -c pytorch -c nvidia
# cuda 12.1
conda install pytorch torchvision torchaudio pytorch-cuda=12.1 -c pytorch -c nvidia
# Or choose Wheel: CUDA 11.8
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu118# Create the environment using the file:
conda env create -f environment.yml./requirements_spacy.shBefore running any scripts, you need to configure the environment variables and paths in the configuration files:
-
Open
batch_jobs/_experiment_configuration.shand update the path variables to match your directory structure. -
Modify the batch job scripts in
csd3_gpu.sh(or similar files) to match your computing environment:- Update the environment name
- Adjust sbatch parameters (CPU/GPU requirements, memory, time limits)
- Change queue names and partition settings as needed for your cluster
This configuration step is crucial to ensure all scripts run correctly in your specific computing environment.
Download the SEFair dataset and save it in the directory data/mSEFair/raw using this link:
https://drive.google.com/drive/folders/19_H3CxVU6wObDj1Z1ZLbyr_B2rYSIbiP?usp=sharing
Encode the data
# Encode the text
./batch_jobs/data/mSEFair_encode.sh
# Our method: IMSAE
./batch_jobs/mSEFair_compute_sal.sh
./batch_jobs/mSEFair_evaluation_sal.sh
# Baseline: INLP
./batch_jobs/mSEFair_compute_inlp.sh
./batch_jobs/mSEFair_evaluation_inlp.sh
# Baseline: SentenceDebias
./batch_jobs/mSEFair_compute_sentencedebias.sh
./batch_jobs/mSEFair_evaluation_sentencedebias.sh
# Export the results to overleaf tables
./batch_jobs/print-results/mSEFair_export_results.shDuring dataset processing, we generate two mapping files for the SEFair dataset:
helpfulness2index.txt: Maps helpfulness labels to numerical indicesreputation2index.txt: Maps reputation labels to numerical indices
These files are stored in the directory data/${experiment}/raw/${target_language}/.
Each file contains a dictionary-like format where each line maps a text label to its corresponding numerical index. For example:
helpful 1
helpless 0
This mapping allows for efficient encoding of categorical labels during model training and evaluation.
This section contains scripts for processing and preparing the multilingual BiasBios dataset for bias analysis. Follow these steps to process and encode the raw data.
First, process the raw text files:
./batch_jobs/data/mBiasBios_create_dataset.shThe processed outputs are saved in language-specific directories:
data/multilingualBiasBios/raw/{LANGUAGE}/post-processing/
For example, German data is stored in data/multilingualBiasBios/raw/de/post-processing/. You can access datasets for other languages by replacing de with the desired language code.
Each language directory contains the following files:
train/dev/test.pickle: Dataset splits containing processed examplesgender2index.txt: Mapping of gender labels to numerical indicesprofession2index.txt: Mapping of profession labels to numerical indices
Each data point in the pickle files contains the following fields:
{
'g': 'm', # Gender ('m' or 'f')
'p': 'journalist', # Profession
'text': '...', # Full biography text
'start': 73, # Position where the hard text starts
'hard_text': '...', # Text without profession mention
'text_without_gender': '...', # Text with gender pronouns replaced with '_'
'hard_text_tokenized': '...' # Tokenized version of hard_text
}{
'g': 'm',
'p': 'journalist',
'text': 'Frank Hofmann, Philosoph und Theologe, ist leidenschaftlicher Journalist. Seine Erfahrungen sammelte er bei so unterschiedlichen Medien wie auto motor und sport, stern und Men's Health. Zuletzt verantwortete er die inhaltliche Arbeit des Vereins Andere Zeiten in Hamburg und erweiterte das publizistische Angebot um neue Medienkanäle, Magazine und Events.',
'start': 73,
'hard_text': 'Seine Erfahrungen sammelte er bei so unterschiedlichen Medien wie auto motor und sport, stern und Men's Health. Zuletzt verantwortete er die inhaltliche Arbeit des Vereins Andere Zeiten in Hamburg und erweiterte das publizistische Angebot um neue Medienkanäle, Magazine und Events.',
'text_without_gender': '_ Erfahrungen sammelte _ bei so unterschiedlichen Medien wie auto motor und sport, stern und Men's Health. Zuletzt verantwortete _ die inhaltliche Arbeit des Vereins Andere Zeiten in Hamburg und erweiterte das publizistische Angebot um neue Medienkanäle, Magazine und Events.',
'hard_text_tokenized': 'Seine Erfahrungen sammelte er bei so unterschiedlichen Medien wie auto motor und sport , stern und Men 's Health . Zuletzt verantwortete er die inhaltliche Arbeit des Vereins Andere Zeiten in Hamburg und erweiterte das publizistische Angebot um neue Medienkanäle , Magazine und Events .'
}Encode the processed text using different language models:
./batch_jobs/data/mBiasBios_encode.shFor the English training set, we recommend encoding by split due to its large size:
# After running the encoding script for English
python src/data/mBiasBios_combine_encoded_representation.pyPerform final post-processing on the encoded data:
./batch_jobs/data/mBiasBios_post_process_data.shAfter completing these steps, you can use the processed and encoded data for bias analysis and model training with the language model representations of your choice (Llama, Mistral, or MBERT).
After creating the BERT representations and storing them in data/multilingualBiasBios/{Language-Model-Name}, you can run the following experiments:
# Run IMSAE computations
./batch_jobs/mBiasBios_compute_sal.sh
# Evaluate IMSAE performance
./batch_jobs/mBiasBios_evaluation_sal.sh# Compute INLP results
./batch_jobs/mBiasBios_compute_inlp.sh
# Evaluate INLP performance
./batch_jobs/mBiasBios_evaluation_inlp.sh# Compute SentenceDebias results
./batch_jobs/mBiasBios_compute_sentencedebias.sh
# Evaluate SentenceDebias performance
./batch_jobs/mBiasBios_evaluation_sentencedebias.shGenerate formatted tables for publications:
# Export results to LaTeX tables for Overleaf
./batch_jobs/print-results/mBiasBios_export_results.shShell scripts for exporting experiment results into LaTeX tables with different grouping methods.
# Export results in row-wise format
./batch_jobs/print-results/mBiasBios_export_results.shThis repository contains scripts for processing, analyzing, and evaluating bias mitigation methods on multilingual hate speech datasets.
Download the dataset from Google Cloud:
[Link to be added]
Convert the language names in the code from full names to ISO language codes:
# Change this:
languages = ['English', 'Italian', 'Polish', 'Portuguese', 'Spanish']
# To this:
languages = ['en', 'it', 'pl', 'pt', 'es']Run the preprocessing notebook to prepare the data for analysis:
preprocessing.ipynb
Encode the preprocessed text using language models:
./batch_jobs/data/mHateSpeech_encode.sh# Compute IMSAE results
./batch_jobs/mHateSpeech_compute_sal.sh
# Evaluate IMSAE performance
./batch_jobs/mHateSpeech_evaluation_sal.sh# Compute INLP results
./batch_jobs/mHateSpeech_compute_inlp.sh
# Evaluate INLP performance
./batch_jobs/mHateSpeech_evaluation_inlp.sh# Compute SentenceDebias results
./batch_jobs/mHateSpeech_compute_sentencedebias.sh
# Evaluate SentenceDebias performance
./batch_jobs/mHateSpeech_evaluation_sentencedebias.shGenerate formatted tables for publication:
./batch_jobs/print-results/mHateSpeech_export_results.shAfter running the encoding and experiments, the results will be organized in a structured format that can be used for analysis and comparison of different bias mitigation methods across the five languages (English, Italian, Polish, Portuguese, and Spanish).