Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation

A repository containing the original code and models for the paper:

Luca Moroni, Giovanni Puccetti, Pere-Lluís Huguet Cabot, Andrei Stefan Bejgu, Alessio Miaschi, Edoardo Barba, Felice Dell’Orletta, Andrea Esuli, Roberto Navigli. Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation, in Findings of NAACL 2025.

Usage

This repository is divided in four parts, adaptation, embedding analysis, train, and analysis.

Each part is implemented and documented in the respective folder of this repository.

The Adaptation part constains the code to reproduce the adaptation of english LLMs on a given tokenizer.
The Embedding Analysis part contains the script used to analyze the embedding structure of the adapted models.
The Train folder contains the code and the reference for the library used to train adapted models.
The Evaluation folder contains the code and the reference of the dataset and libraries used to evaluate adapted models during the further stage of training.

Cite this work

If you use any part of this work, please consider citing the paper as follows:

@inproceedings{moroni2025optimizing,
  title={Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation},
  author={Moroni, Luca and Puccetti, Giovanni and Cabot, Pere-Llu{\'\i}s Huguet and Bejgu, Andrei Stefan and Miaschi, Alessio and Barba, Edoardo and Dell’Orletta, Felice and Esuli, Andrea and Navigli, Roberto},
  booktitle={Findings of the Association for Computational Linguistics: NAACL 2025},
  pages={6646--6660},
  year={2025}
}

🪪 License

The data and software are licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0.

Acknowledgements

We gratefully acknowledge the support of Future AI Research (PNRR MUR project PE0000013-FAIR). Partially financed by the European Union - NextGenerationEU through the Italian Ministry of University and Research under PNRR - PRIN 2022 (2022EPTPJ9) "WEMB: Word Embeddings from Cognitive Linguistics to Language Engineering and back" and by the PNRR project ITSERR (CUP B53C22001770006). We acknowledge the support of the ISCRA project TRAVEL (HP10CY9V7K) for awarding access to the LEONARDO supercomputer, owned by the EuroHPC Joint Undertaking, hosted by CINECA (Italy) and thank Giuseppe Fiameni for his support.

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
adaptation		adaptation
embedding_analysis		embedding_analysis
evaluation		evaluation
training		training
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation

Usage

Cite this work

🪪 License

Acknowledgements

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

SapienzaNLP/sava

Folders and files

Latest commit

History

Repository files navigation

Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation

Usage

Cite this work

🪪 License

Acknowledgements

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages