Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation

A repository containing the original code and models for the paper:
Luca Moroni, Giovanni Puccetti, Pere-Lluís Huguet Cabot, Andrei Stefan Bejgu, Alessio Miaschi, Edoardo Barba, Felice Dell’Orletta, Andrea Esuli, Roberto Navigli. Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation, in Findings of NAACL 2025.
This repository is divided in four parts, adaptation
, embedding analysis
, train
, and analysis
.
Each part is implemented and documented in the respective folder of this repository.
- The Adaptation part constains the code to reproduce the adaptation of english LLMs on a given tokenizer.
- The Embedding Analysis part contains the script used to analyze the embedding structure of the adapted models.
- The Train folder contains the code and the reference for the library used to train adapted models.
- The Evaluation folder contains the code and the reference of the dataset and libraries used to evaluate adapted models during the further stage of training.
If you use any part of this work, please consider citing the paper as follows:
@inproceedings{moroni2025optimizing,
title={Optimizing LLMs for Italian: Reducing Token Fertility and Enhancing Efficiency Through Vocabulary Adaptation},
author={Moroni, Luca and Puccetti, Giovanni and Cabot, Pere-Llu{\'\i}s Huguet and Bejgu, Andrei Stefan and Miaschi, Alessio and Barba, Edoardo and Dell’Orletta, Felice and Esuli, Andrea and Navigli, Roberto},
booktitle={Findings of the Association for Computational Linguistics: NAACL 2025},
pages={6646--6660},
year={2025}
}
The data and software are licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0.
We gratefully acknowledge the support of Future AI Research (PNRR MUR project PE0000013-FAIR). Partially financed by the European Union - NextGenerationEU through the Italian Ministry of University and Research under PNRR - PRIN 2022 (2022EPTPJ9) "WEMB: Word Embeddings from Cognitive Linguistics to Language Engineering and back" and by the PNRR project ITSERR (CUP B53C22001770006). We acknowledge the support of the ISCRA project TRAVEL (HP10CY9V7K) for awarding access to the LEONARDO supercomputer, owned by the EuroHPC Joint Undertaking, hosted by CINECA (Italy) and thank Giuseppe Fiameni for his support.