Skip to content

shubham2345/language-modelling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Language Modeling with Transformers

Kaggle Competition Link: https://www.kaggle.com/competitions/nlp-243-fall-24-hw-3-language-modeling

This repository contains a project focused on implementing a language modeling task using a Transformer-based architecture. The model leverages attention mechanisms, SentencePiece tokenization, and FastText embeddings to predict the next token in a sequence, ultimately evaluating its performance through perplexity.

Overview

Language modeling is a fundamental task in natural language processing, aiming to predict the next token in a sequence based on the previous tokens. This project explores various architectures and tokenization strategies to improve the accuracy and efficiency of the model. Key highlights include:

  • Transformer-based architecture
  • Attention mechanisms
  • SentencePiece tokenization
  • FastText embeddings
  • Perplexity evaluation

How It Works

Given a sequence of tokens, the model predicts the probability distribution of the next token. The predictions are evaluated using perplexity, which measures how well a probabilistic model predicts a sample.

Example sequence:

<s> NLP 243 is the best </s>

Predictions:

- p(NLP | <s>)
- p(243 | <s>, NLP)
- p(is | <s>, NLP, 243)
- p(the | <s>, NLP, 243, is)
- p(best | <s>, NLP, 243, is, the)

Requirements

  • Python 3.8+
  • Install the required libraries using:
    pip install -r requirements.txt

Running the Code

To train the model and generate predictions, run the following command:

python run.py submission.csv

Ensure that your input data is correctly formatted and placed in the appropriate directory.

Model Evaluation

The performance of the language model is evaluated using perplexity:

exp(-1/T * sum(log p(t_i | t_<i)))

Where:

  • T: Total number of tokens in the sentence
  • p(t_i | t_<i): Predicted probability of token t_i given the previous tokens

Submission Format

Your final predictions should be saved in a CSV file with the following format:

ID,ppl
2,2.134
5,5.230
6,1.120
  • ID: Sentence ID
  • ppl: Perplexity value

Techniques Explored

The project explores various techniques to improve the model's performance:

  • Recurrent Neural Networks (RNN)
  • Long Short-Term Memory (LSTM)
  • Gated Recurrent Units (GRU)
  • Transformers
  • Attention mechanisms
  • Different tokenization strategies
  • Pretrained embeddings (Word2Vec, GloVe)

Limitations

External libraries such as HuggingFace Transformers or Keras were not used for model implementation or training. The focus was on building core components from scratch.

About

Autoregressive Language Model: Token Prediction, Perplexity Evaluation

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages