An implementation to the paper: Text Level Graph Neural Network for Text Classification (https://arxiv.org/pdf/1910.02356.pdf)
- Dynamic edge weights instead of static edge weights
 - All documents are from a big graph instead of every documents having its own structure
 - Public edge sharing (achieved by computing edge statistics during dataset construction and masking during training, a novel mechanism roughly described by the paper yet without much further information)
 - Flexible argument controls and early stopping features
 - Detailed explanations about intermediate operations
 - The number of parameters in this model is close to the amount of parameters mentioned in the paper
 
+---embeddings\
|             +---glove.6B.50d.txt
|             +---glove.6B.100d.txt
|             +---glove.6B.200d.txt
|             +---glove.6B.300d.txt
+---train.py
+---r52-test-all-terms.txt
+---r52-train-all-terms.txt
+---r8-test-all-terms.txt
+---r8-train-all-terms.txt
Since the original link DOES NOT work anymore, I hereby provide the original link and the corresponding dataset file in this repository for anyone who is also looking for the r8 and r52 dataset.
https://www.cs.umb.edu/~smimarog/textmining/datasets/r8-train-all-terms.txt => r8-train-all-terms.txt https://www.cs.umb.edu/~smimarog/textmining/datasets/r8-test-all-terms.txt => r8-test-all-terms.txt https://www.cs.umb.edu/~smimarog/textmining/datasets/r52-train-all-terms.txt => r52-train-all-terms.txt https://www.cs.umb.edu/~smimarog/textmining/datasets/r52-test-all-terms.txt => r52-test-all-terms.txt
- Python 3.7.4
 - PyTorch 1.5.1 + CUDA 10.1
 - Pandas 1.0.5
 - Numpy 1.19.0
 
Successful run on RTX 2070, RTX 2080 Ti and RTX 3090. However, the memory consumption is quite large that it requires smaller batch size / shorter MAX_LENGTH / smaller embedding_size on RTX 2070.
- Linux:
- OMP_NUM_THREADS=1 python train.py --cuda=0 --embedding_size=300 --p=3 --min_freq=2 --max_length=70 --dropout=0 --epoch=300
 
 - Windows:
- python train.py --cuda=0 --embedding_size=300 --p=3 --min_freq=2 --max_length=70 --dropout=0 --epoch=300
 
 
I only tested the model on r8 dataset and is unable to achieve the figure as described in the paper despite having tried some hyperparameter tunings. The closest run that I could get is:
| Train Accuracy | Validation Accuracy | Test Accuracy | 
|---|---|---|
| 99.91% | 95.7% | 96.2% | 
with embedding_size=300, p=3 and 70<=max_length<=150 and dropout=0.
As the experiment settings described in the paper is not clearly stated, I assumed they used a learning rate decay mechanism too. I also added a warming up mechanism to pretrain the model. But actually the model converged quite fast and does not even need to use warming up technique.