Skip to content

This project implements a Byte Pair Encoding (BPE) tokenization approach along with a Word2Vec model to generate word embeddings from a text corpus. The implementation leverages Apache Hadoop for distributed processing and includes evaluation metrics for optimal dimensionality of embeddings.

License

Notifications You must be signed in to change notification settings

taabishhh/LLM_Preprocessing

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

35 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

CS441_Fall2024

Class repository for CS441 on Cloud Computing taught at the University of Illinois, Chicago in Fall, 2024
Name: Taabish Sutriwala Email: [email protected] UIN: 673379837


Homework 1

Overview

This project implements a Byte Pair Encoding (BPE) tokenization approach along with a Word2Vec model to generate word embeddings from a text corpus. The implementation leverages Apache Hadoop for distributed processing and includes evaluation metrics for optimal dimensionality of embeddings.

Project Structure

The project is structured as follows:

  • MainApp.scala: The entry point of the application, orchestrating the execution of the BPE tokenization, Word2Vec training, and evaluation of embeddings.
  • BPEEncoding.scala: Contains functions for encoding text using BPE.
  • BPEMapReduceJob.scala: Manages the MapReduce job configuration and execution for BPE tokenization.
  • BPETokenMapper.scala: Implements the mapper for the MapReduce job to perform tokenization.
  • BPETokenReducer.scala: Implements the reducer to aggregate token counts and save vocabulary statistics.
  • DimensionalityEvaluator.scala: Evaluates different dimensions of embeddings based on analogy and similarity tasks.
  • Word2VecEmbedding.scala: Manages the training of the Word2Vec model and saving similar words to a CSV file.

Requirements

  • Apache Hadoop
  • Scala (version 3.5.0)
  • Java 11

Scala and SBT Version

  • Scala Version: 3.5.0
  • SBT Version: 0.1.0-SNAPSHOT

Dependencies

The project requires the following dependencies:

Setup Instructions

  1. Clone the Repository:

    git clone <repository-url>
    cd <repository-directory>
  2. Build the Project: Ensure that you have all necessary dependencies specified in your build.sbt or equivalent file. Use the following command to build the project:

    sbt package
  3. Run the Program: Execute the MainApp with the following command, replacing <input-path> and <output-path> with your respective paths: Eg. hadoop jar target/scala-3.5.0/HW1.jar MainApp src/resources/input src/resources/output

    hadoop jar target/scala-3.5.0/<your-jar-name>.jar MainApp <input-path> <output-path>
    • Input Path: Path to the text corpus for BPE tokenization.
    • Output Path: Path where results will be stored.
  4. View Results: After execution, the output will include:

    • Vocabulary Statistics: Available in vocabulary_stats.csv within the output directory.
    • Similar Words: Found in similar_words.csv.
    • Dimensionality Evaluation: Results logged in dimension_evaluation.csv.

Results and Evaluation

The results of the program execution will provide insights into:

  • Token Frequencies: An aggregated view of token occurrences in the input text.
  • Similar Words: A list of semantically similar words for each word in the vocabulary.
  • Dimensionality Analysis: Evaluation metrics including analogy accuracy and similarity scores based on different embedding dimensions (e.g., 50, 100, 150, 200).

Link to Video Demonstration

Link to Video

Limitations

  • The performance of the Word2Vec model is sensitive to hyperparameters such as window size and minimum word frequency. The choice of parameters in this implementation may not be optimal for all datasets.
  • The input corpus should be pre-processed adequately to remove noise and irrelevant characters for better results.
  • The current implementation does not handle out-of-vocabulary (OOV) words in the analogy evaluation.

Conclusion

This project showcases an efficient approach to text tokenization and word embedding generation using BPE and Word2Vec. The use of Hadoop enables scalable processing of large text corpora. Further optimizations and explorations can be conducted to enhance the model's performance based on the specific application requirements.

About

This project implements a Byte Pair Encoding (BPE) tokenization approach along with a Word2Vec model to generate word embeddings from a text corpus. The implementation leverages Apache Hadoop for distributed processing and includes evaluation metrics for optimal dimensionality of embeddings.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages