This project implements a full ETL pipeline for real estate data using Apache Airflow, Data build tool (Astronomer Cosmos). The pipeline scrapes data from the batdongsan.com website, processes it by using python and dbt, and loads it into multiple storage destinations, including PostgreSQL, AWS S3, and AWS Redshift.
├── airflow_dag/ # Contains the Airflow DAG for orchestrating the ETL workflow
├── data_stage/ # Stores raw and processed data outputs
├── load_to_postgresql/ # Handles PostgreSQL table creation and data loading
├── load_to_redshift/ # Loads data from S3 to Redshift
├── load_to_s3/ # Uploads processed data to AWS S3
├── transformation/ # Contains transformation logic for cleaning and processing data
├── web_scraping/ # Contains Scrapy web scraping logic to extract data
├── real_estate/ # Data build tool project
├── .env # Environment configuration (ignored in Git)
├── docker-compose.yaml # Docker Compose configuration
├── requirements.txt # Python dependencies for the project
├── README.md # Project overview and instructions
├── .gitignore # Git ignore file
- Apache Airflow: Orchestrates and manages the ETL pipeline.
- Scrapy: Framework for scraping real estate data from batdongsan.com.
- PostgreSQL: Stores processed data for further analysis.
- AWS S3: Stores raw and processed data.
- AWS Redshift: Data warehouse for high-performance analytics.
- Cosmos:Data build tool: Data model and data quality.
-
Clone the repository:
git clone <repository_url>
-
Start the project using Docker Compose:
docker-compose up
-
Access the Airflow web UI to monitor DAGs and tasks.

