Quantifying Language Confusion

This is the official repository for our paper Large Language Models are Easily Confused: A Quantitative Metric, Security Implications and Typological Analysis.

Data and Results

Please refer to zenodo for datasets, language graphs, and results:

DATA include the following datasets:

i) Raw Language Graphs and

ii) The calculated Language Similarities from the Language Graphs,

iii) MTEI: the files from the experimental results of multilingual inversion attacks, and calculated language confusion entropy from the data;

iv) LCB: the files from the language confusion benchmark and calculated language confusion entropy from the data

Results include aggregated results for further analysis:

i) inversion_language_confusion: results from MTEI

ii) prompting_language_confusion: results from LCB

Installation

Download the repository to local:
Create a new conda environment

conda create -n envlc python==3.12

conda activate envlc

Install pytorch and packages from requirements

pip3 install torch torchvision torchaudio

pip install -r requirements.txt

Specifics

Tokenize Japanese, after installing fugashi[unidic]:

python -m unidic download

Language Confusion Analysis

src/analysis_language_confusion

Name		Name	Last commit message	Last commit date
Latest commit History 10 Commits
datasets		datasets
notebooks		notebooks
src		src
.DS_Store		.DS_Store
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Quantifying Language Confusion

Data and Results

Installation

Language Confusion Analysis

About

Uh oh!

Releases

Packages

Uh oh!

Languages

siebeniris/QuantifyingLanguageConfusion

Folders and files

Latest commit

History

Repository files navigation

Quantifying Language Confusion

Data and Results

Installation

Language Confusion Analysis

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Languages

Packages