SCALAR is a part-of-speech tagger for source code identifiers. It supports two model types:
- DistilBERT-based model with CRF layer (Recommended: faster, more accurate)
- Legacy Gradient Boosting model (for compatibility)
Make sure you have python3.12
installed. Then:
git clone https://github.com/SCANL/scanl_tagger.git
cd scanl_tagger
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt
You can run SCALAR in multiple ways:
python main --mode run --model_type lm_based # DistilBERT (recommended)
python main --mode run --model_type tree_based # Legacy model
Then query like:
http://127.0.0.1:8080/GetValue/FUNCTION
Supports context types:
- FUNCTION
- CLASS
- ATTRIBUTE
- DECLARATION
- PARAMETER
You can retrain either model (default parameters are currently hardcoded):
python main --mode train --model_type lm_based
python main --mode train --model_type tree_based
Metric | Score |
---|---|
Macro F1 | 0.9032 |
Token Accuracy | 0.9223 |
Identifier Accuracy | 0.8291 |
Label | Precision | Recall | F1 | Support |
---|---|---|---|---|
CJ | 0.88 | 0.88 | 0.88 | 8 |
D | 0.98 | 0.96 | 0.97 | 52 |
DT | 0.95 | 0.93 | 0.94 | 45 |
N | 0.94 | 0.94 | 0.94 | 418 |
NM | 0.91 | 0.93 | 0.92 | 440 |
NPL | 0.97 | 0.97 | 0.97 | 79 |
P | 0.94 | 0.92 | 0.93 | 79 |
PRE | 0.79 | 0.79 | 0.79 | 68 |
V | 0.89 | 0.84 | 0.86 | 110 |
VM | 0.79 | 0.85 | 0.81 | 13 |
Inference Performance:
- Identifiers/sec: 225.8
Metric | Score |
---|---|
Accuracy | 0.8216 |
Balanced Accuracy | 0.9160 |
Weighted Recall | 0.8216 |
Weighted Precision | 0.8245 |
Weighted F1 | 0.8220 |
Inference Time | 249.05s |
Inference Performance:
- Identifiers/sec: 8.6
Tag | Meaning | Examples |
---|---|---|
N | Noun | user , Data , Array |
DT | Determiner | this , that , those |
CJ | Conjunction | and , or , but |
P | Preposition | with , for , in |
NPL | Plural Noun | elements , indices |
NM | Noun Modifier (adjective-like) | max , total , employee |
V | Verb | get , set , delete |
VM | Verb Modifier (adverb-like) | quickly , deeply |
D | Digit | 1 , 2 , 10 , 0xAF |
PRE | Preamble / Prefix | m , b , GL , p |
For the legacy server, you can also use Docker:
docker compose pull
docker compose up
- Kebab case is not supported (e.g.,
do-something-cool
). - Feature and position tokens (e.g.,
@pos_0
) are inserted automatically. - Internally uses WordNet for lexical features.
- Input must be parsed into identifier tokens. We recommend srcML but any AST-based parser works.
Please cite:
@inproceedings{newman2025scalar,
author = {Christian Newman and Brandon Scholten and Sophia Testa and others},
title = {SCALAR: A Part-of-speech Tagger for Identifiers},
booktitle = {ICPC Tool Demonstrations Track},
year = {2025}
}
@article{newman2021ensemble,
title={An Ensemble Approach for Annotating Source Code Identifiers with Part-of-speech Tags},
author={Newman, Christian and Decker, Michael and AlSuhaibani, Reem and others},
journal={IEEE Transactions on Software Engineering},
year={2021},
doi={10.1109/TSE.2021.3098242}
}
You can find the most recent SCALAR training dataset here
Please open an issue if you encounter problems!