Working prototype for a Small Concept Model (SCM) based on Meta's Large Concept Model (LCM), with a custom embedding decoder for vec-to-text conversion.
On the root of this project, you can find:
train_inversion.ipynb, a notebook that trains an embedding inversion model based on prefix tuning (Figure 1). By default, it trains a PreNet to invert paraphrase-multilingual-MiniLM-L12-v2 sentence-level embeddings.
Figure 1. Scheme of the architecture of the embedding inversion model.
train_scm.ipynb, a notebook that trains the actual autoregressive small concept model (SMC), a decoder-only transformer inspired by Meta's BaseLCM designed for next-embedding prediction (Figure 2).
For a more faithful straightforward reproduction of BaseLCM, take a look at this implementation.
Figure 2. High-level scheme of the Small Concept Model (SCM).
inference_test.ipynb, a notebook where you can run inference using pretrained weights. You can test both the embedding inversion model (trained on 1 million sentences from BookCorpus), and the SCM (trained on 100k sequences of 16 sentences each from BookCorpus).
Inside the ./small_concept_model folder, you can find all the modules used to build models, load and pre-process datasets, and create the full pipleine.
In this repo you can also find a Streamlit app to interact with the pretrained language models for text generation. To use it, from the root of the project run:
streamlit run app.py


