Multilingual Neural Machine Translation (NMT) Dataset for In-context Learning, Finetuning, and Baseline Model Development
- Problem Statement
- Dataset
- Installation
- Configuration
- File Structure
- Usage
- Pending Tasks
- Cite This Work
- References
Multilingual Neural Machine Translation (NMT) enables training a single model capable of translating between multiple source and target languages. Traditional approaches use encoder-decoder architectures, while recent advancements explore the use of Large Language Models (LLMs) for Multilingual Machine Translation (MMT). This project investigates:
-
Performance Comparison: Evaluate the performance of encoder-decoder based MT versus smaller LLMs trained on the same data with similar parameters.
-
Context Role Quantification: Analyze the impact of context (number of tokens) on translation quality for both architectures.
The dataset provided includes:
-
One-to-One translations
-
One-to-Many translations
-
Many-to-One translations
-
MT Dataset: Contains data necessary for training and evaluation across various translation scenarios.
-
Google Drive Link: MT Dataset and Results
-
Clone the repository:
git clone https://github.com/sujaykumarmag/iasnlp.git cd DSP
-
Create and activate a virtual environment:
python -m venv venv source venv/bin/activate # On Windows, use `venv\Scripts�ctivate`
-
Install the required packages:
pip install -r requirements.txt
Configuration parameters will be saved in runs/args.yaml
for each experiment
batchsize: 10
cross_lang: false
direction_order: null
disorder: false
ex_lang: null
experiment: icl
ice_num: 8
lang_order: null
lang_pair: eng-hin
lora_alpha: 16
lora_dropout: 0.05
lr: 0.001
many2one: false
model_name: facebook/xglm-564M
model_type: enc_dec
multi: eng hin mar
numepochs: 5
one2many: false
oracle: false
output_dir: runs/
prompt_template: </E></X>=</Y>
repeat: false
retriever: random
reverse_direction: false
run_all_icl: true
seed: 43
tokenizer_name: google/mt5-base
root_directory/
├── src/
│ ├── baselines.py
│ │
│ ├── dataset.py
│ │
│ ├── decoder_only.py
│ │
│ ├── icl.py
│ │
│ ├── utils.py
│ │
│ ├── mbart/ (from hugging face)
│ │ ├── configuration_mbart.py
│ │ ├── modeling_mbart.py
│ │ └── tokenization_mbart.py
│ │
│ │
│ ├── xlnet/ (from hugging face)
│ │ ├── configuration_xlnet.py
│ │ ├── modeling_xlnet.py
│ │ └── tokenization_xlnet.py
│ │
│ │
│ ├── training/
│ ├── normal_train.py
│ └── training.py
│
├── finetuning\ (all notebooks ran on kaggle)
│
├── runs/
│ ├── exp1
│ ├── exp2
│ ├── exp3
│ ├── exp4
│ ├── exp5
│ ├── exp6
│ ├── exp7
│ ├── exp8
│ └── exp9
│
│
├── train.py # (Entry Point for the Program)
│
├── trainall.bash
└── notebooks/
└── train.ipynb
code.mp4
- Include Finetuning Code
- Enhance documentation with more detailed explanations (report in
IASNLP_Project_report.pdf
) - Add support for GPU training (MPS for mac and cuda for Nvidia)
- Research on SSA Attention Method
- Massively Multilingual Neural Machine Translation
- Multilingual Machine Translation with Large Language Models: Empirical Results and Analysis
@misc{iasnlp_project,
author = {Abhinav P.M ., SujayKumar Reddy M., Oswald.C (Machine Translators)},
title = {In-context Learning (ICL), Finetuning and Baseline Model Development for Natural Machine Translation},
year = {2024},
publisher = {GitHub},
journal = {GitHub repository},
howpublished = {\url{https://github.com/sujaykumarmag/iasnlp}},
}