Multilingual Neural Machine Translation (NMT) Dataset for In-context Learning, Finetuning, and Baseline Model Development

Problem Statement

Multilingual Neural Machine Translation (NMT) enables training a single model capable of translating between multiple source and target languages. Traditional approaches use encoder-decoder architectures, while recent advancements explore the use of Large Language Models (LLMs) for Multilingual Machine Translation (MMT). This project investigates:

Performance Comparison: Evaluate the performance of encoder-decoder based MT versus smaller LLMs trained on the same data with similar parameters.
Context Role Quantification: Analyze the impact of context (number of tokens) on translation quality for both architectures.

Dataset

The dataset provided includes:

One-to-One translations
One-to-Many translations
Many-to-One translations
MT Dataset: Contains data necessary for training and evaluation across various translation scenarios.
Google Drive Link: MT Dataset and Results

Installation

Clone the repository:

git clone https://github.com/sujaykumarmag/iasnlp.git
cd DSP

Create and activate a virtual environment:

python -m venv venv
source venv/bin/activate  # On Windows, use `venv\Scripts�ctivate`

Install the required packages:
```
pip install -r requirements.txt
```

Configuration

Configuration parameters will be saved in runs/args.yaml for each experiment

Example `args.yaml`

batchsize: 10
cross_lang: false
direction_order: null
disorder: false
ex_lang: null
experiment: icl
ice_num: 8
lang_order: null
lang_pair: eng-hin
lora_alpha: 16
lora_dropout: 0.05
lr: 0.001
many2one: false
model_name: facebook/xglm-564M
model_type: enc_dec
multi: eng hin mar
numepochs: 5
one2many: false
oracle: false
output_dir: runs/
prompt_template: </E></X>=</Y>
repeat: false
retriever: random
reverse_direction: false
run_all_icl: true
seed: 43
tokenizer_name: google/mt5-base

File Structure

root_directory/
├── src/
│   ├── baselines.py
│   │
│   ├── dataset.py
│   │
│   ├── decoder_only.py
│   │
│   ├── icl.py
│   │
│   ├── utils.py
│   │
│   ├── mbart/ (from hugging face)
│   │   ├── configuration_mbart.py
│   │   ├── modeling_mbart.py 
│   │   └── tokenization_mbart.py
│   │
│   │
│   ├── xlnet/ (from hugging face)
│   │   ├── configuration_xlnet.py
│   │   ├── modeling_xlnet.py 
│   │   └── tokenization_xlnet.py
│   │
│   │
│   ├── training/
│       ├── normal_train.py
│       └── training.py
│  
├── finetuning\ (all notebooks ran on kaggle)
│  
├── runs/
│   ├── exp1
│   ├── exp2
│   ├── exp3
│   ├── exp4
│   ├── exp5 
│   ├── exp6
│   ├── exp7
│   ├── exp8
│   └── exp9
│   
│   
├── train.py  # (Entry Point for the Program)
│   
├── trainall.bash
└── notebooks/
    └── train.ipynb

Demo Video

code.mp4

Pending Tasks

Include Finetuning Code
Enhance documentation with more detailed explanations (report in IASNLP_Project_report.pdf)
Add support for GPU training (MPS for mac and cuda for Nvidia)
Research on SSA Attention Method

References

Cite this Work

@misc{iasnlp_project,
  author = {Abhinav P.M ., SujayKumar Reddy M., Oswald.C (Machine Translators)},
  title = {In-context Learning (ICL), Finetuning and Baseline Model Development for Natural Machine Translation},
  year = {2024},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/sujaykumarmag/iasnlp}},
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
__pycache__		__pycache__
extra		extra
finetuning		finetuning
iasnlp_extra		iasnlp_extra
notebooks		notebooks
papers		papers
src		src
.DS_Store		.DS_Store
.gitignore		.gitignore
IASNLP ppt.pdf		IASNLP ppt.pdf
IASNLP_Project_report.pdf		IASNLP_Project_report.pdf
README.md		README.md
inference.py		inference.py
main.py		main.py
requirements.txt		requirements.txt
sacrebleu_tokenizer_spm.model		sacrebleu_tokenizer_spm.model
test.py		test.py
train.py		train.py
trainall.bash		trainall.bash

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Multilingual Neural Machine Translation (NMT) Dataset for In-context Learning, Finetuning, and Baseline Model Development

Table of Contents

Problem Statement

Dataset

Installation

Configuration

Example `args.yaml`

File Structure

Demo Video

Pending Tasks

References

Cite this Work

About

Releases

Packages

Languages

sujaykumarmag/iasnlp

Folders and files

Latest commit

History

Repository files navigation

Multilingual Neural Machine Translation (NMT) Dataset for In-context Learning, Finetuning, and Baseline Model Development

Table of Contents

Problem Statement

Dataset

Installation

Configuration

Example args.yaml

File Structure

Demo Video

Pending Tasks

References

Cite this Work

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Example `args.yaml`

Packages