Skip to content

SCANL/scanl_tagger

Repository files navigation

SCALAR Part-of-Speech Tagger for Identifiers

SCALAR is a part-of-speech tagger for source code identifiers. It supports two model types:

  • DistilBERT-based model with CRF layer (Recommended: faster, more accurate)
  • Legacy Gradient Boosting model (for compatibility)

Installation

Make sure you have python3.12 installed. Then:

git clone https://github.com/SCANL/scanl_tagger.git
cd scanl_tagger
python -m venv venv
source venv/bin/activate
pip install -r requirements.txt

Usage

You can run SCALAR in multiple ways:

CLI (with DistilBERT or GradientBoosting model)

python main --mode run --model_type lm_based         # DistilBERT (recommended)
python main --mode run --model_type tree_based       # Legacy model

Then query like:

http://127.0.0.1:8080/GetValue/FUNCTION

Supports context types:

  • FUNCTION
  • CLASS
  • ATTRIBUTE
  • DECLARATION
  • PARAMETER

Training

You can retrain either model (default parameters are currently hardcoded):

python main --mode train --model_type lm_based
python main --mode train --model_type tree_based

Evaluation Results

DistilBERT (LM-Based Model) — Recommended

Metric Score
Macro F1 0.9032
Token Accuracy 0.9223
Identifier Accuracy 0.8291
Label Precision Recall F1 Support
CJ 0.88 0.88 0.88 8
D 0.98 0.96 0.97 52
DT 0.95 0.93 0.94 45
N 0.94 0.94 0.94 418
NM 0.91 0.93 0.92 440
NPL 0.97 0.97 0.97 79
P 0.94 0.92 0.93 79
PRE 0.79 0.79 0.79 68
V 0.89 0.84 0.86 110
VM 0.79 0.85 0.81 13

Inference Performance:

  • Identifiers/sec: 225.8

Gradient Boost Model (Legacy)

Metric Score
Accuracy 0.8216
Balanced Accuracy 0.9160
Weighted Recall 0.8216
Weighted Precision 0.8245
Weighted F1 0.8220
Inference Time 249.05s

Inference Performance:

  • Identifiers/sec: 8.6

Supported Tagset

Tag Meaning Examples
N Noun user, Data, Array
DT Determiner this, that, those
CJ Conjunction and, or, but
P Preposition with, for, in
NPL Plural Noun elements, indices
NM Noun Modifier (adjective-like) max, total, employee
V Verb get, set, delete
VM Verb Modifier (adverb-like) quickly, deeply
D Digit 1, 2, 10, 0xAF
PRE Preamble / Prefix m, b, GL, p

Docker Support (Legacy only)

For the legacy server, you can also use Docker:

docker compose pull
docker compose up

Notes

  • Kebab case is not supported (e.g., do-something-cool).
  • Feature and position tokens (e.g., @pos_0) are inserted automatically.
  • Internally uses WordNet for lexical features.
  • Input must be parsed into identifier tokens. We recommend srcML but any AST-based parser works.

Citations

Please cite:

@inproceedings{newman2025scalar,
  author    = {Christian Newman and Brandon Scholten and Sophia Testa and others},
  title     = {SCALAR: A Part-of-speech Tagger for Identifiers},
  booktitle = {ICPC Tool Demonstrations Track},
  year      = {2025}
}

@article{newman2021ensemble,
  title={An Ensemble Approach for Annotating Source Code Identifiers with Part-of-speech Tags},
  author={Newman, Christian and Decker, Michael and AlSuhaibani, Reem and others},
  journal={IEEE Transactions on Software Engineering},
  year={2021},
  doi={10.1109/TSE.2021.3098242}
}

Training Data

You can find the most recent SCALAR training dataset here


More from SCANL


Trouble?

Please open an issue if you encounter problems!

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 5