- Question 1: Market Analyst Task
- Question 2: Revenue Analysis
- Question 3: Gaming Category Percentage
- Question 4: Paid vs Free Apps Recommendation
- Question 5: Tracking Price Changes
- Question 6: Restoring Correct Data
- Question 7: Correlation Investigation
- Question 8: Cleaning Genres Column
- Question 9: Duration vs Fetch Time
This case study aims to clean and analyze the Google Play Store dataset for various market analysis tasks. Below are several key tasks performed during the process, including cleaning the data, modifying data types, and preparing the dataset for further analysis.
We removed the unnecessary column Unnamed: 0
using the following SQL command:
# Droping a column
alter table Playstore
drop column `Unnamed: 0`;
To clean rows with invalid data in the Reviews
column and change its datatype to INTEGER
:
DELETE FROM playstore WHERE reviews LIKE '%M';
ALTER TABLE playstore
MODIFY COLUMN reviews INTEGER;
The Installs
column had symbols like +
and ,
that were removed. We also converted the column from TEXT
to DOUBLE
for numerical analysis:
UPDATE playstore
SET installs = REPLACE(REPLACE(installs, '+', ''), ',', '');
ALTER TABLE playstore
MODIFY COLUMN installs DOUBLE;
The Price
column was cleaned by removing the dollar sign and converting the column to INTEGER
:
UPDATE playstore
SET Price = REPLACE(Price, '$', '');
ALTER TABLE playstore
MODIFY COLUMN Price INTEGER;
We removed rows with NULL
values across multiple columns:
WITH temp AS (
SELECT `index`
FROM playstore
WHERE App IS NULL OR Category IS NULL OR Rating IS NULL
OR Reviews IS NULL OR Size IS NULL OR Installs IS NULL
OR Type IS NULL OR Price IS NULL OR `Content Rating` IS NULL
OR Genres IS NULL OR `Last Updated` IS NULL
OR `Current Ver` IS NULL OR `Android Ver` IS NULL
)
DELETE FROM playstore WHERE `index` IN (SELECT `index` FROM temp);
alter table playstore
change `Content Rating` `Content_Rating` text,
change `Current Ver` `Current_Ver` text,
change `Android Ver` `Android_Ver` text
change `Last Updated` `Last_Updated` text
;
The Last_Updated
column was converted into the DATETIME
format for proper time-series analysis:
UPDATE playstore
SET Last_Updated = STR_TO_DATE(`Last_Updated`, '%d-%b-%y');
ALTER TABLE playstore
MODIFY COLUMN Last_Updated DATETIME;
Question 1. You're working as a market analyst for a mobile app development company. Your task is to identify the most promising categories (TOP 5) for launching new free apps based on their average ratings.
select Category,round(avg(rating),2) as 'avg_rating' from playstore
group by Category order by avg_rating desc limit 5;
Question 2. As a business strategist for a mobile app company, your objective is to pinpoint the three categories that generate the most revenue from paid apps. This calculation is based on the product of the app price and its number of installations.
select Category,round(avg(revenue),2) 'avg_revenue' from (
select *,(installs*Price) as 'Revenue' from playstore
) t group by Category order by avg_revenue desc;
Question 3. As a data analyst for a gaming company, you're tasked with calculating the percentage of games within each category. This information will help the company understand the distribution of gaming apps across different categories.
select *, (total_apps/(select count(*) from playstore))*100 as 'category_percentage' from (
select Category,count(*) 'total_apps' from playstore
group by Category
)t order by Category_percentage desc;
Question 4. As a data analyst at a mobile app-focused market research firm you’ll recommend whether the company should develop paid or free apps for each category based on the ratings of that category.
with t1 as (
select Category,round(avg(Rating),2) 'avg_rating_free' from playstore where Type = 'Free'
group by Category
),
t2 as (
select Category,round(avg(Rating),2) 'avg_rating_paid' from playstore where Type = 'Paid'
group by category
)
select t1.category,avg_rating_free,avg_rating_paid,
case
when avg_rating_free > avg_rating_paid then 'made_free_apps'
else 'made_paid_apps'
end as 'suggestion'
from t1
join t2
on t1.Category = t2.Category;
Question 5. As a database administrator, your databases have been hacked and hackers are changing the price of certain apps in the database. While it is taking some time for the IT team to neutralize the hack, you, as a responsible manager, don't want your data to be changed without tracking. You decide to implement a measure where any changes in the price can be recorded. The goal is to log the changes made to the Price
field by the hackers.
To tackle this problem, we will use SQL triggers that will log any changes to the Price
field of the playstore
dataset into a separate table for future auditing.
-
Create a Table for Logging Price Changes
We'll create a table calledpricechagelog
to store information on each price change, including the old price, new price, app name, the type of operation (in this case, an update), and the timestamp of the change.CREATE TABLE pricechagelog ( app VARCHAR(255), old_price DECIMAL(10, 2), new_price DECIMAL(10, 2), operation_type VARCHAR(255), operation_date TIMESTAMP
-
Make a Duplicate Database Table for Experimentation
Since we don’t want to change the originalplaystore
dataset, we create a duplicate table calledplay
and insert all data fromplaystore
.CREATE TABLE play LIKE playstore; INSERT INTO play SELECT * FROM playstore;
-
Create the Trigger for Price Changes
We create an AFTER UPDATE trigger on theplay
table that automatically logs any changes to thePrice
field into thepricechagelog
table.DELIMITER // CREATE TRIGGER price_change_log AFTER UPDATE ON play FOR EACH ROW BEGIN INSERT INTO pricechagelog(app, old_price, new_price, operation_type, operation_date) VALUES (NEW.app, OLD.price, NEW.price, 'update', CURRENT_TIMESTAMP); END; // DELIMITER ;
-
Testing the Trigger
To test the trigger, we perform an update on thePric
e field of theplay
table and check if the change is logged correctly in thepricechagelog
table.UPDATE play t1 SET price = 40 WHERE `index` = 0; SELECT * FROM play;
Question 6. Your IT team has neutralized the threat; however, hackers have made some changes in the prices. Since you had measures in place to log the changes, you can now restore the correct data into the database.
SET sql_safe_updates = 0;
-- Dropping the price change log trigger
DROP TRIGGER price_change_log;
-- Updating the play table with the correct prices from pricechagelog
UPDATE play t1
JOIN pricechagelog t2
ON t1.`App` = t2.`App`
SET t1.price = t2.old_price;
-- Selecting all data from the play table to verify the update
SELECT * FROM play;
Question 7. as a data person you are assigned the task of investigating the correlation between two numeric factors: app ratings and the quantity of reviews.
-
Calculate the averages (means) of ratings and reviews
SET @x = (SELECT ROUND(AVG(rating), 2) FROM playstore); SET @y = (SELECT ROUND(AVG(reviews), 2) FROM playstore);
-
Step 2: Create a temporary table to calculate (x - avg(x)) and (y - avg(y)) along with their squares
WITH temp AS ( SELECT *, ROUND(rat * rat, 2) AS 'sqrt_x', ROUND(rev * rev, 2) AS 'sqrt_y' FROM ( SELECT rating, @x, reviews, @y, ROUND((rating - @x), 2) AS rat, ROUND((reviews - @y), 2) AS rev FROM playstore ) t1
-
Step 3: Calculate the numerator and denominator for Pearson Correlation
SELECT @numerator := ROUND(SUM(rat * rev), 2), @deno_1 := ROUND(SUM(sqrt_x), 2), @deno_2 := ROUND(SUM(sqrt_y), 2) FROM temp;
-
Step 4: Compute the Pearson correlation coefficient
SELECT ROUND(@numerator / SQRT(@deno_1 * @deno_2), 2) AS correlation;
observation: The Pearson correlation coefficient between app ratings and the quantity of reviews is **0.07**.
This indicates a very weak positive correlation, meaning that there is a slight tendency for apps with more reviews to have slightly higher ratings, but the relationship is not strong.
Question 8. Your boss noticed that some rows in the Genres
column had multiple genres, which was causing issues when developing the recommender system. You were tasked with cleaning the Genres
column by splitting it into two separate columns: genre_1
and genre_2
. Rows with only one genre will have the genre_2
column left blank.
-
Add New Genre Columns We first added two new columns,
genre_1
andgenre_2
, to theplaystore
table.ALTER TABLE playstore ADD COLUMN genre_1 VARCHAR(255) AFTER Genres, ADD COLUMN genre_2 VARCHAR(255) AFTER genre_1;
-
Step 2: Populate genre_1 with the First Genre
UPDATE playstore t1 SET genre_1 = SUBSTRING_INDEX(Genres, ';', 1);
-
Step 3: Populate genre_2 with the Second Genre or Leave it Blank
UPDATE playstore t1 SET genre_2 = CASE WHEN Genres = SUBSTRING_INDEX(Genres, ';', -1) THEN NULL ELSE SUBSTRING_INDEX(Genres, ';', -1) END;
In MySQL Workbench, Duration Time and Fetch Time refer to different phases of executing a query:
- Definition: This represents the total time it takes for MySQL to process the query and prepare the results, including query parsing, optimization, and execution on the server side.
- Purpose: It indicates how long the server took to generate the results, but it does not include the time required to retrieve the results from the server.
- Example: If you run a
SELECT
query that takes 1 second to search the database and prepare the result, the Duration Time will be 1 second.
- Definition: This is the time spent fetching the result set from the MySQL server to the MySQL Workbench client (the actual transfer of data from the server to your local machine).
- Purpose: Fetch Time indicates how long it takes to retrieve and transfer the result rows, particularly important when dealing with large datasets.
- Example: After MySQL has executed the query, if it takes 0.5 seconds to send the data to your machine, the Fetch Time will be 0.5 seconds.
- Duration Time: The time taken by MySQL to execute and prepare the query results on the server.
- Fetch Time: The time taken to transfer the result set from the server to the client (MySQL Workbench).
Both metrics together help you understand how long a query takes from execution to result retrieval.