CS441_Fall2024

Class repository for CS441 on Cloud Computing taught at the University of Illinois, Chicago in Fall, 2024
Name: Taabish Sutriwala Email: [email protected] UIN: 673379837

Homework 1

Overview

This project implements a Byte Pair Encoding (BPE) tokenization approach along with a Word2Vec model to generate word embeddings from a text corpus. The implementation leverages Apache Hadoop for distributed processing and includes evaluation metrics for optimal dimensionality of embeddings.

Project Structure

The project is structured as follows:

MainApp.scala: The entry point of the application, orchestrating the execution of the BPE tokenization, Word2Vec training, and evaluation of embeddings.
BPEEncoding.scala: Contains functions for encoding text using BPE.
BPEMapReduceJob.scala: Manages the MapReduce job configuration and execution for BPE tokenization.
BPETokenMapper.scala: Implements the mapper for the MapReduce job to perform tokenization.
BPETokenReducer.scala: Implements the reducer to aggregate token counts and save vocabulary statistics.
DimensionalityEvaluator.scala: Evaluates different dimensions of embeddings based on analogy and similarity tasks.
Word2VecEmbedding.scala: Manages the training of the Word2Vec model and saving similar words to a CSV file.

Requirements

Apache Hadoop
Scala (version 3.5.0)
Java 11

Scala and SBT Version

Scala Version: 3.5.0
SBT Version: 0.1.0-SNAPSHOT

Dependencies

The project requires the following dependencies:

Hadoop:
- hadoop-common (3.3.4)
- hadoop-mapreduce-client-core (3.3.4)
- hadoop-mapreduce-client-jobclient (3.3.4)
Jtokkit:
- jtokkit (1.1.0)
Deeplearning4j:
- deeplearning4j-core (1.0.0-M2.1)
- deeplearning4j-nlp (1.0.0-M2.1)
ND4J:
- nd4j-native-platform (1.0.0-M2.1)
Logging:
- logback-classic (1.5.6)
- slf4j-api (2.0.12)
Testing:
- ScalaTest (3.2.19) for testing
- ScalaTest Plus Mockito (3.2.19.0) for testing

Setup Instructions

Clone the Repository:

git clone <repository-url>
cd <repository-directory>

Build the Project: Ensure that you have all necessary dependencies specified in your build.sbt or equivalent file. Use the following command to build the project:
```
sbt package
```
Run the Program: Execute the MainApp with the following command, replacing <input-path> and <output-path> with your respective paths: Eg. hadoop jar target/scala-3.5.0/HW1.jar MainApp src/resources/input src/resources/output
```
hadoop jar target/scala-3.5.0/<your-jar-name>.jar MainApp <input-path> <output-path>
```
- Input Path: Path to the text corpus for BPE tokenization.
- Output Path: Path where results will be stored.
View Results: After execution, the output will include:
- Vocabulary Statistics: Available in vocabulary_stats.csv within the output directory.
- Similar Words: Found in similar_words.csv.
- Dimensionality Evaluation: Results logged in dimension_evaluation.csv.

Results and Evaluation

The results of the program execution will provide insights into:

Token Frequencies: An aggregated view of token occurrences in the input text.
Similar Words: A list of semantically similar words for each word in the vocabulary.
Dimensionality Analysis: Evaluation metrics including analogy accuracy and similarity scores based on different embedding dimensions (e.g., 50, 100, 150, 200).

Link to Video Demonstration

Link to Video

Limitations

The performance of the Word2Vec model is sensitive to hyperparameters such as window size and minimum word frequency. The choice of parameters in this implementation may not be optimal for all datasets.
The input corpus should be pre-processed adequately to remove noise and irrelevant characters for better results.
The current implementation does not handle out-of-vocabulary (OOV) words in the analogy evaluation.

Conclusion

This project showcases an efficient approach to text tokenization and word embedding generation using BPE and Word2Vec. The use of Hadoop enables scalable processing of large text corpora. Further optimizations and explorations can be conducted to enhance the model's performance based on the specific application requirements.

Name		Name	Last commit message	Last commit date
Latest commit History 35 Commits
.idea		.idea
Homeworks		Homeworks
project		project
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
build.sbt		build.sbt
feedback.md		feedback.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CS441_Fall2024

Homework 1

Overview

Project Structure

Requirements

Scala and SBT Version

Dependencies

Setup Instructions

Results and Evaluation

Link to Video Demonstration

Limitations

Conclusion

About

Uh oh!

Releases

Packages

Contributors 2

Uh oh!

Languages

License

taabishhh/LLM_Preprocessing

Folders and files

Latest commit

History

Repository files navigation

CS441_Fall2024

Homework 1

Overview

Project Structure

Requirements

Scala and SBT Version

Dependencies

Setup Instructions

Results and Evaluation

Link to Video Demonstration

Limitations

Conclusion

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Uh oh!

Languages

Packages