A Python-based ETL (Extract, Transform, Load) project to process the IMDb dataset and load it into a MySQL database. This pipeline is designed to be robust and handle large datasets in memory-efficient chunks.
- Python
- Pandas
- MySQL / mysql-connector-python
Follow these steps to set up and run the project on your local machine.
-
Python 3.x: Ensure you have a recent version of Python installed.
-
MySQL Server: A running MySQL instance is required to host the database.
-
IMDb Dataset: The project will automatically download the necessary datasets (title.basics.tsv.gz, title.ratings.tsv.gz, name.basics.tsv.gz).
-
Clone the repository
git clone https://github.com/Gireeshs02/imdb-data-pipeline.git cd imdb-data-pipeline
-
Set up a Virtual Environment
py -m venv venv # On Windows venv\Scripts\activate # On macOS/Linux source venv/bin/activate
-
Install Dependencies
pip install -r requirements.txt
-
Configure
.env
file with your MySQL credentialsDB_USER = user_name DB_PASSWORD = your_password DB_HOST = localhost DB_PORT = port_number DB_NAME = database_name
-
Set up MySQL database
- Run the schema file inside MySQL:
mysql -u root -p your_database_name < schema.sql
-
Download the Data
py download_data.py
-
Run the Project
py main.py
Contributions are welcome! Feel free to open issues, submit pull requests, or suggest improvements.
This project is licensed under the MIT License - see the LICENSE file for details.