Skip to content

Repo for the NAACL 2025 paper "Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation"

Notifications You must be signed in to change notification settings

SapienzaNLP/sava

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation


Conference arXiv License: CC BY-NC-SA 4.0 Hugging Face Collection Hugging Face Collection

A repository containing the original code and models for the paper:

Luca Moroni, Giovanni Puccetti, Pere-Lluís Huguet Cabot, Andrei Stefan Bejgu, Alessio Miaschi, Edoardo Barba, Felice Dell’Orletta, Andrea Esuli, Roberto Navigli. Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation, in Findings of NAACL 2025.

Usage

This repository is divided in four parts, adaptation, embedding analysis, train, and analysis.

Each part is implemented and documented in the respective folder of this repository.

  • The Adaptation part constains the code to reproduce the adaptation of english LLMs on a given tokenizer.
  • The Embedding Analysis part contains the script used to analyze the embedding structure of the adapted models.
  • The Train folder contains the code and the reference for the library used to train adapted models.
  • The Evaluation folder contains the code and the reference of the dataset and libraries used to evaluate adapted models during the further stage of training.

Cite this work

If you use any part of this work, please consider citing the paper as follows:

@inproceedings{moroni2025optimizing,
  title={Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation},
  author={Moroni, Luca and Puccetti, Giovanni and Cabot, Pere-Llu{\'\i}s Huguet and Bejgu, Andrei Stefan and Miaschi, Alessio and Barba, Edoardo and Dell’Orletta, Felice and Esuli, Andrea and Navigli, Roberto},
  booktitle={Findings of the Association for Computational Linguistics: NAACL 2025},
  pages={6646--6660},
  year={2025}
}

🪪 License

The data and software are licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0.

Acknowledgements

We gratefully acknowledge the support of Future AI Research (PNRR MUR project PE0000013-FAIR). Partially financed by the European Union - NextGenerationEU through the Italian Ministry of University and Research under PNRR - PRIN 2022 (2022EPTPJ9) "WEMB: Word Embeddings from Cognitive Linguistics to Language Engineering and back" and by the PNRR project ITSERR (CUP B53C22001770006). We acknowledge the support of the ISCRA project TRAVEL (HP10CY9V7K) for awarding access to the LEONARDO supercomputer, owned by the EuroHPC Joint Undertaking, hosted by CINECA (Italy) and thank Giuseppe Fiameni for his support.

About

Repo for the NAACL 2025 paper "Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation"

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •  

Languages