Skip to content

siebeniris/QuantifyingLanguageConfusion

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

10 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Quantifying Language Confusion

This is the official repository for our paper Large Language Models are Easily Confused: A Quantitative Metric, Security Implications and Typological Analysis.

Data and Results

Please refer to zenodo for datasets, language graphs, and results:

DATA include the following datasets:

i) Raw Language Graphs and

ii) The calculated Language Similarities from the Language Graphs,

iii) MTEI: the files from the experimental results of multilingual inversion attacks, and calculated language confusion entropy from the data;

iv) LCB: the files from the language confusion benchmark and calculated language confusion entropy from the data

Results include aggregated results for further analysis:

i) inversion_language_confusion: results from MTEI

ii) prompting_language_confusion: results from LCB

Installation

  1. Download the repository to local:

  2. Create a new conda environment

conda create -n envlc python==3.12

conda activate envlc

  1. Install pytorch and packages from requirements

pip3 install torch torchvision torchaudio

pip install -r requirements.txt

  1. Specifics
  • Tokenize Japanese, after installing fugashi[unidic]:

python -m unidic download

Language Confusion Analysis

src/analysis_language_confusion

About

Quantifying Language Confusion in LLMs.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published