In this paper, we propose MTGRec, which leverages Multi-identifier item Tokenization to augment token sequence data for Generative Recommender pre-training. Specifically, our approach makes two key contributions: multi-identifier item tokenization and curriculum recommender pre-training. For multi-identifier item tokenization, we adopt the Residual-Quantized Variational AutoEncoder (RQ-VAE) as the backbone of item tokenizers and consider model checkpoints from adjacent epochs as semantically relevant tokenizers. This enables us to associate each item with multiple identifiers and tokenize a single item interaction sequence into several token sequences as different data groups. For curriculum recommender pre-training, we design a data curriculum scheme through data influence estimation. During recommender pre-training, we dynamically adjust the sampling probability of each data group according to the influence of the data from each item tokenizer, where the influence estimation is achieved via first-order gradient approximation. Finally, we fine-tune the pre-trained model using a single item identifier to ensure accurate item identification during recommendation.
torch==2.4.1+cu124
transformers==4.45.2
accelerate==1.0.1
Train RQ-VAE and generate item semantic IDs:
cd tokenizer
bash run.sh
Pre-train recommender:
bash pretrain.sh
Finetune recommender:
bash finetune.sh