Airflow Project is my first project using Apache Airflow to automate the process of web scraping, data processing, and storing book data from Amazon into a PostgreSQL database. This repository includes two main files:
docker-compose.yaml: Docker Compose configuration to set up the necessary services, including Airflow and PostgreSQL.dag/dag.py: The Airflow DAG (Directed Acyclic Graph) that defines the workflow for scraping, processing, and storing book data.
- Web Scraping: Collects information about books from Amazon, including title, author, rating, price, and availability.
- Data Processing: Cleans and transforms the scraped data to ensure consistency and quality.
- Data Storage: Inserts the processed data into a PostgreSQL database for easy querying and analysis.
- Docker: Ensure Docker is installed on your system. Install Docker
- Docker Compose: Install Docker Compose to manage multi-container Docker applications. Install Docker Compose
git clone <repo_url>
cd airflowEnsure that the docker-compose.yaml file is properly configured for your environment. Key configurations include ports and database connection details.
docker-compose up -dThis command will start the Airflow and PostgreSQL services in detached mode.
- Access the Airflow web interface at
http://localhost:8080. - Log in with the default credentials (if not changed).
- Navigate to Admin > Connections.
- Add a new connection with the following details:
- Conn Id:
books_connection - Conn Type:
Postgres - Host:
postgres(service name defined indocker-compose.yaml) - Schema:
postgres - Login:
postgres - Password:
postgres(or your configured password) - Port:
5432
- Conn Id:
Ensure that the dag.py file is placed inside the dag/ directory. By default, Docker Compose mounts this directory to /opt/airflow/dags inside the Airflow container.
- Go to the Airflow web interface at
http://localhost:8080. - Locate the DAG named
fetch_and_store_books_data. - Toggle the DAG from
OfftoOnto activate it. - The DAG is scheduled to run daily based on the defined schedule interval.
- Use the Airflow web interface to monitor the status of each task within the DAG.
- View logs and check for any issues during execution.
airflow/
├── docker-compose.yaml
└── dag/
└── dag.py
- docker-compose.yaml: Configures Docker Compose to run Airflow and PostgreSQL services.
- dag/dag.py: Defines the Airflow DAG responsible for scraping and storing book data from Amazon.
The fetch_and_store_books_data DAG includes the following tasks:
- fetch_and_transform_data: Scrapes book data from Amazon, processes it, and transforms it into a structured format.
- create_table: Creates the
bookstable in PostgreSQL if it doesn't already exist. - insert_data_into_post: Inserts the processed book data into the
bookstable in PostgreSQL.
Task dependencies are set as follows: fetch_and_transform_data → create_table → insert_data_into_post.
As this is my first Airflow project, I welcome any feedback, suggestions, or contributions. Feel free to open an issue or submit a pull request.
This project is licensed under the MIT License.
If you have any questions or need further assistance, please open an issue on GitHub or contact me directly.
Thank you for checking out my first Airflow project!