Data-Engineering-on-Azure-cloud

Project objective

To explore Microsoft Azure services and build ETL pipleines to analyze the data and produce dashboards in Power BI.

Services Used

Azure Data Factory is used to ingest data from source point.
Azure Databricks (Apache Spark) for scalable data transformations.
Azure Data Lake Storage (ADLS Gen2) for cost-effective data storage.
Azure Synapse Analytics a powerful data warehouse for large-scale SQL queries.
Power BI in data visualization and reporting.

Project Workflow

Ingestion → Azure Data Factory → Raw Data (ADLS Gen2)
Transformation → Azure Databricks → Transformed Data (ADLS Gen2)
Analytics → Azure Synapse Analytics
Visualization → Power BI

Datasource: The dataset used is Olympic Data from Kaggle, which contains details about athletes, coaches, medals, teams, and gender participation in the 2021 Tokyo Olympics. https://www.kaggle.com/datasets/arjunprasadsarkhel/2021-olympics-in-tokyo

Extract Data (Ingestion)
- Azure Data Factory (ADF) is used to extract raw data from a GitHub repository.
- The dataset includes five CSV files (athletes, coaches, medals, teams, entries_gender).
- ADF moves the raw data into Azure Data Lake Storage Gen2 (ADLS Gen2) for further processing.
Transform Data
- Azure Databricks (Apache Spark) is used to clean and transform the raw data.
- Transformation steps include:
  - Converting incorrect data types (e.g., strings to integers).
  - Removing duplicates and renaming columns.
  - Computing aggregated statistics such as top countries by gold medals.
- The transformed data is written back to Azure Data Lake Storage (ADLS Gen2).
Load and Analysis
- Azure Synapse Analytics is used to query transformed data.
- Used SQL-based querying and charts view to gain insights from the dataset.
- The processed data is visualized using Power BI.
Dashboard and Reporting
- A dashboard is created to visualize key insights, such as:
  - Top countries with the most medals.
  - Gender participation by discipline.
  - Distribution of medals across different sports.

Execution

Step 1: Data Ingestion using Azure Data Factory

Create Azure Data Factory to manage data pipelines.
Create a Storage Account and a Data Lake Storage Gen2 container to store the files.
Set up a Data Pipeline in ADF:
- Source: GitHub (HTTP connector)
- Destination: ADLS Gen2 (Raw Data Folder)
Pipeline Execution: Runs successfully and stores the raw data.
Used ADF (source and sink connections) to load the raw data files from the google drive src usig HTTP connector to storage datalake raw_data.

Step 2: Data Transformation using Azure Databricks

Create an Azure Databricks Workspace and configure a Spark cluster.
Mount the Data Lake Storage to Databricks using Service Principal authentication. utilized apps registration service to create an app with secret value and required role assignment(storage contributor IAM access control), using which mounted Stoarge point on to databricks.
Read Data using Apache Spark (PySpark)
- Print schema to verify column types.
- Converted data types where necessary (e.g., medals count from string to integer).
- Filtered and grouped data where needed for insights.

Step 3: Load Transformed Data to ADLS

The transformed data is written back to ADLS Gen2 (Transformed Data Folder).
Data is stored in CSV format with partitioning.

Step 4: Data Analysis with Azure Synapse Analytics

Create a Synapse Analytics Workspace.
Load the transformed data into Synapse Tables.
Run SQL queries to extract insights, such as:
- Ranking countries by medal count.
- Finding the top athlete in each discipline.

Step 5: Data Visualization using Power BI

Connect Power BI to Synapse Analytics.
Create dashboards to visualize key insights.
Publish Reports for stakeholders.

This project demonstrates a real-world data engineering workflow in Azure Cloud, integrating multiple services for data ingestion, transformation, storage, analysis, and visualization.

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
data		data
img		img
CLI commands.txt		CLI commands.txt
README.md		README.md
olympic_data_transformation.ipynb		olympic_data_transformation.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data-Engineering-on-Azure-cloud

Project objective

Services Used

Project Workflow

Execution

Step 1: Data Ingestion using Azure Data Factory

Step 2: Data Transformation using Azure Databricks

Step 3: Load Transformed Data to ADLS

Step 4: Data Analysis with Azure Synapse Analytics

Step 5: Data Visualization using Power BI

About

Uh oh!

Releases

Packages

Languages

SreeHarsha72/Data-Engineering-on-Azure-cloud

Folders and files

Latest commit

History

Repository files navigation

Data-Engineering-on-Azure-cloud

Project objective

Services Used

Project Workflow

Execution

Step 1: Data Ingestion using Azure Data Factory

Step 2: Data Transformation using Azure Databricks

Step 3: Load Transformed Data to ADLS

Step 4: Data Analysis with Azure Synapse Analytics

Step 5: Data Visualization using Power BI

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages