diff --git a/.ipynb_checkpoints/student-checkpoint.ipynb b/.ipynb_checkpoints/student-checkpoint.ipynb new file mode 100644 index 00000000..8db372a4 --- /dev/null +++ b/.ipynb_checkpoints/student-checkpoint.ipynb @@ -0,0 +1,1748 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Final Project Submission\n", + "\n", + "Please fill out:\n", + "* Student name: David Munyiri\n", + "* Student pace: part time\n", + "* Scheduled project review date/time: 27/07/2025 23:59:59\n", + "* Instructor name: Fidelis Wanalwenge\n", + "* Blog post URL:\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "# ================================================\n", + "# Aviation Safety Risk Analysis Report\n", + "# ================================================\n", + "\n", + "## Introduction\n", + "### This notebook analyzes aviation accident data to provide recommendations for selecting the safest aircraft models for business, commercial, or personal purposes.\n", + "\n", + "### Key objectives:\n", + "- Clean and prepare the data\n", + "- Compute safety risk metrics (Fatality, Severe Injury, Damage Severity)\n", + "- Calculate a weighted Risk Score\n", + "- Identify aircraft models with best safety records\n", + "- Provide data exports for Tableau visualization\n", + "#\n", + "# -------------------------------" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Data Exploration" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "C:\\Users\\david.munyiri\\AppData\\Local\\anaconda3\\envs\\learn-env\\lib\\site-packages\\IPython\\core\\interactiveshell.py:3145: DtypeWarning: Columns (6,7,28) have mixed types.Specify dtype option on import or set low_memory=False.\n", + " has_raised = await self.run_ast_nodes(code_ast.body, cell_name,\n" + ] + } + ], + "source": [ + "#Load the data into a pandas Dataframe\n", + "import pandas as pd\n", + "import seaborn as sns\n", + "import matplotlib.pyplot as plt\n", + "\n", + "Aviation_df = pd.read_csv(\"data/Aviation_Data.csv\")" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(90348, 31)" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#Check the size of the Aviation raw data\n", + "Aviation_df.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Index(['Event.Id', 'Investigation.Type', 'Accident.Number', 'Event.Date',\n", + " 'Location', 'Country', 'Latitude', 'Longitude', 'Airport.Code',\n", + " 'Airport.Name', 'Injury.Severity', 'Aircraft.damage',\n", + " 'Aircraft.Category', 'Registration.Number', 'Make', 'Model',\n", + " 'Amateur.Built', 'Number.of.Engines', 'Engine.Type', 'FAR.Description',\n", + " 'Schedule', 'Purpose.of.flight', 'Air.carrier', 'Total.Fatal.Injuries',\n", + " 'Total.Serious.Injuries', 'Total.Minor.Injuries', 'Total.Uninjured',\n", + " 'Weather.Condition', 'Broad.phase.of.flight', 'Report.Status',\n", + " 'Publication.Date'],\n", + " dtype='object')" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#View the all the columns of the raw data\n", + "Aviation_df.columns" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "RangeIndex: 90348 entries, 0 to 90347\n", + "Data columns (total 31 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 Event.Id 88889 non-null object \n", + " 1 Investigation.Type 90348 non-null object \n", + " 2 Accident.Number 88889 non-null object \n", + " 3 Event.Date 88889 non-null object \n", + " 4 Location 88837 non-null object \n", + " 5 Country 88663 non-null object \n", + " 6 Latitude 34382 non-null object \n", + " 7 Longitude 34373 non-null object \n", + " 8 Airport.Code 50249 non-null object \n", + " 9 Airport.Name 52790 non-null object \n", + " 10 Injury.Severity 87889 non-null object \n", + " 11 Aircraft.damage 85695 non-null object \n", + " 12 Aircraft.Category 32287 non-null object \n", + " 13 Registration.Number 87572 non-null object \n", + " 14 Make 88826 non-null object \n", + " 15 Model 88797 non-null object \n", + " 16 Amateur.Built 88787 non-null object \n", + " 17 Number.of.Engines 82805 non-null float64\n", + " 18 Engine.Type 81812 non-null object \n", + " 19 FAR.Description 32023 non-null object \n", + " 20 Schedule 12582 non-null object \n", + " 21 Purpose.of.flight 82697 non-null object \n", + " 22 Air.carrier 16648 non-null object \n", + " 23 Total.Fatal.Injuries 77488 non-null float64\n", + " 24 Total.Serious.Injuries 76379 non-null float64\n", + " 25 Total.Minor.Injuries 76956 non-null float64\n", + " 26 Total.Uninjured 82977 non-null float64\n", + " 27 Weather.Condition 84397 non-null object \n", + " 28 Broad.phase.of.flight 61724 non-null object \n", + " 29 Report.Status 82508 non-null object \n", + " 30 Publication.Date 73659 non-null object \n", + "dtypes: float64(5), object(26)\n", + "memory usage: 21.4+ MB\n" + ] + } + ], + "source": [ + "#Get information on the data types and content in different columns\n", + "Aviation_df.info()" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Event.IdInvestigation.TypeAccident.NumberEvent.DateLocationCountryLatitudeLongitudeAirport.CodeAirport.Name...Purpose.of.flightAir.carrierTotal.Fatal.InjuriesTotal.Serious.InjuriesTotal.Minor.InjuriesTotal.UninjuredWeather.ConditionBroad.phase.of.flightReport.StatusPublication.Date
020001218X45444AccidentSEA87LA0801948-10-24MOOSE CREEK, IDUnited StatesNaNNaNNaNNaN...PersonalNaN2.00.00.00.0UNKCruiseProbable CauseNaN
120001218X45447AccidentLAX94LA3361962-07-19BRIDGEPORT, CAUnited StatesNaNNaNNaNNaN...PersonalNaN4.00.00.00.0UNKUnknownProbable Cause19-09-1996
220061025X01555AccidentNYC07LA0051974-08-30Saltville, VAUnited States36.9222-81.8781NaNNaN...PersonalNaN3.0NaNNaNNaNIMCCruiseProbable Cause26-02-2007
320001218X45448AccidentLAX96LA3211977-06-19EUREKA, CAUnited StatesNaNNaNNaNNaN...PersonalNaN2.00.00.00.0IMCCruiseProbable Cause12-09-2000
420041105X01764AccidentCHI79FA0641979-08-02Canton, OHUnited StatesNaNNaNNaNNaN...PersonalNaN1.02.0NaN0.0VMCApproachProbable Cause16-04-1980
\n", + "

5 rows × 31 columns

\n", + "
" + ], + "text/plain": [ + " Event.Id Investigation.Type Accident.Number Event.Date \\\n", + "0 20001218X45444 Accident SEA87LA080 1948-10-24 \n", + "1 20001218X45447 Accident LAX94LA336 1962-07-19 \n", + "2 20061025X01555 Accident NYC07LA005 1974-08-30 \n", + "3 20001218X45448 Accident LAX96LA321 1977-06-19 \n", + "4 20041105X01764 Accident CHI79FA064 1979-08-02 \n", + "\n", + " Location Country Latitude Longitude Airport.Code \\\n", + "0 MOOSE CREEK, ID United States NaN NaN NaN \n", + "1 BRIDGEPORT, CA United States NaN NaN NaN \n", + "2 Saltville, VA United States 36.9222 -81.8781 NaN \n", + "3 EUREKA, CA United States NaN NaN NaN \n", + "4 Canton, OH United States NaN NaN NaN \n", + "\n", + " Airport.Name ... Purpose.of.flight Air.carrier Total.Fatal.Injuries \\\n", + "0 NaN ... Personal NaN 2.0 \n", + "1 NaN ... Personal NaN 4.0 \n", + "2 NaN ... Personal NaN 3.0 \n", + "3 NaN ... Personal NaN 2.0 \n", + "4 NaN ... Personal NaN 1.0 \n", + "\n", + " Total.Serious.Injuries Total.Minor.Injuries Total.Uninjured \\\n", + "0 0.0 0.0 0.0 \n", + "1 0.0 0.0 0.0 \n", + "2 NaN NaN NaN \n", + "3 0.0 0.0 0.0 \n", + "4 2.0 NaN 0.0 \n", + "\n", + " Weather.Condition Broad.phase.of.flight Report.Status Publication.Date \n", + "0 UNK Cruise Probable Cause NaN \n", + "1 UNK Unknown Probable Cause 19-09-1996 \n", + "2 IMC Cruise Probable Cause 26-02-2007 \n", + "3 IMC Cruise Probable Cause 12-09-2000 \n", + "4 VMC Approach Probable Cause 16-04-1980 \n", + "\n", + "[5 rows x 31 columns]" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#View a snapshot of the raw data\n", + "Aviation_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
MakeModelAircraft.CategoryEngine.TypeInjury.SeverityAircraft.damage
count888268879732287818128788985695
unique82371231815131094
topCessna152AirplaneReciprocatingNon-FatalSubstantial
freq22227236727617695306735764148
\n", + "
" + ], + "text/plain": [ + " Make Model Aircraft.Category Engine.Type Injury.Severity \\\n", + "count 88826 88797 32287 81812 87889 \n", + "unique 8237 12318 15 13 109 \n", + "top Cessna 152 Airplane Reciprocating Non-Fatal \n", + "freq 22227 2367 27617 69530 67357 \n", + "\n", + " Aircraft.damage \n", + "count 85695 \n", + "unique 4 \n", + "top Substantial \n", + "freq 64148 " + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#View statistics of columns of interest \n", + "Aviation_df[['Make', 'Model','Aircraft.Category', 'Engine.Type', 'Injury.Severity','Aircraft.damage']].describe()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Data Cleaning\n", + "\n", + "Based on a quick exploration, the dataset appears to contain records of accidents and incidents involving various aircraft types, with **airplanes** being the most frequent category.\n", + "\n", + "The focus of our analysis will be on accident records and remove rows missing:\n", + "\n", + "- Make, Model, Aircraft Category\n", + "- Injury counts (fatal, serious, minor, uninjured)\n", + "\n", + "which are critical to our eventual recommendation. This cleaning process ensures that the dataset remains relevant, consistent, and ready for further analysis.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "# Filter only 'Accident' type investigations\n", + "accidents_df = Aviation_df[Aviation_df['Investigation.Type'] == 'Accident'].copy()\n", + "\n", + "# Standardize Make and Model columns before grouping\n", + "accidents_df['Make'] = accidents_df['Make'].str.lower().str.strip()\n", + "accidents_df['Model'] = accidents_df['Model'].str.lower().str.strip()\n", + "\n", + "# Rebuild combined make_model field\n", + "accidents_df['make_model'] = accidents_df['Make'] + ' ' + accidents_df['Model']\n", + "\n", + "# Define critical columns to keep\n", + "critical_columns = [\n", + " 'Make', 'Model', 'Aircraft.Category',\n", + " 'Total.Fatal.Injuries', 'Total.Serious.Injuries',\n", + " 'Total.Minor.Injuries', 'Total.Uninjured'\n", + "]\n", + "\n", + "# Drop rows with missing critical values\n", + "accidents_df.dropna(subset=critical_columns, inplace=True)\n", + "\n", + "# Fill in missing aircraft damage field\n", + "accidents_df['Aircraft.damage'] = accidents_df['Aircraft.damage'].fillna('Unknown')\n", + "\n", + "# Convert injuries to numeric\n", + "injury_cols = ['Total.Fatal.Injuries', 'Total.Serious.Injuries', 'Total.Minor.Injuries', 'Total.Uninjured']\n", + "for col in injury_cols:\n", + " accidents_df[col] = pd.to_numeric(accidents_df[col], errors='coerce').fillna(0)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Aggregate Accident Statistics by Aircraft Make and Model" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "#Define the columns that the data will be grouped by\n", + "grouped_df = accidents_df.groupby(['make_model'])\n", + "\n", + "#Total risk factor counts\n", + "model_summary_df = grouped_df.agg(\n", + " total_accidents=('Model', 'count'),\n", + " total_fatalities=('Total.Fatal.Injuries', 'sum'),\n", + " total_serious=('Total.Serious.Injuries', 'sum'),\n", + " total_minor=('Total.Minor.Injuries', 'sum'),\n", + " total_uninjured=('Total.Uninjured', 'sum'),\n", + " total_destroyed=('Aircraft.damage', lambda x: (x == 'Destroyed').sum())\n", + ").reset_index()\n", + "\n", + "model_summary_df['make_model'] = model_summary_df['make_model'].str.lower().str.strip()\n", + "\n", + "# Total people onboard\n", + "model_summary_df['total_people'] = (\n", + " model_summary_df['total_fatalities'] +\n", + " model_summary_df['total_serious'] +\n", + " model_summary_df['total_minor'] +\n", + " model_summary_df['total_uninjured']\n", + ")\n", + "\n", + "# Filter for valid data\n", + "model_summary_df = model_summary_df[\n", + " (model_summary_df['total_people'] > 0) &\n", + " (model_summary_df['total_accidents'] >= 10)\n", + "]\n", + "\n", + "# Add a combined Make_Model label for easier charting\n", + "# model_summary_df['make_model'] = model_summary_df['Make'] + ' ' + model_summary_df['Model']\n" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "total_fatalities 0\n", + "total_serious 0\n", + "total_minor 0\n", + "total_destroyed 0\n", + "total_accidents 0\n", + "dtype: int64\n", + "Empty DataFrame\n", + "Columns: [make_model, total_accidents, total_fatalities, total_serious, total_minor, total_uninjured, total_destroyed, total_people]\n", + "Index: []\n" + ] + } + ], + "source": [ + "# Check for missing values in critical columns\n", + "print(model_summary_df[['total_fatalities', 'total_serious', 'total_minor', 'total_destroyed', 'total_accidents']].isnull().sum())\n", + "\n", + "# Look at models with very few accidents or zero values in critical columns\n", + "print(model_summary_df[model_summary_df['total_accidents'] < 10])\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Compute Risk Indexes\n", + "\n", + "Based on the available dataset, we derive indexes that help us estimate and assign a safety evaluation of each aircraft model\n", + "\n", + "- **Fatality Index** = Fatalities / Total People Onboard\n", + "- **Injury Index** = (All Injuries) / Total People\n", + "- **Damage Severity Index** = Weighted damage / Total Accidents\n" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Index(['make_model', 'total_accidents', 'total_fatalities', 'total_serious',\n", + " 'total_minor', 'total_uninjured', 'total_destroyed', 'total_people',\n", + " 'fatality_index', 'injury_index', 'damage_severity_index'],\n", + " dtype='object')\n" + ] + } + ], + "source": [ + "# Define fatality index\n", + "model_summary_df['fatality_index'] = model_summary_df['total_fatalities'] / model_summary_df['total_people']\n", + "\n", + "#Define injury index\n", + "model_summary_df['injury_index'] = (\n", + " model_summary_df['total_serious'] + model_summary_df['total_minor']\n", + ") / model_summary_df['total_people']\n", + "\n", + "#Define damage severity index\n", + "model_summary_df['damage_severity_index'] = model_summary_df['total_destroyed'] / model_summary_df['total_accidents']\n", + "print(model_summary_df.columns)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Calculate Weighted Risk Score\n", + "## Define weights for each index — update these anytime to change importance or client priority/preference\n", + "\n", + "- **Fatality Index** = 0.5\n", + "- **Injury Index** = 0.2\n", + "- **Damage Severity Index** = 0.3\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
make_modeltotal_accidentstotal_fatalitiestotal_serioustotal_minortotal_uninjuredtotal_destroyedtotal_peoplefatality_indexinjury_indexdamage_severity_indexrisk_score
7488vans rv4159.02.05.05.0321.00.4285710.3333330.2000000.340952
7489vans rv6146.06.08.02.0422.00.2727270.6363640.2857140.349351
7491vans rv7114.06.01.03.0314.00.2857140.5000000.2727270.324675
7495vans rv8145.01.02.010.0318.00.2777780.1666670.2142860.236508
7750yakovlev yak 521110.03.02.05.0120.00.5000000.2500000.0909090.327273
\n", + "
" + ], + "text/plain": [ + " make_model total_accidents total_fatalities total_serious \\\n", + "7488 vans rv4 15 9.0 2.0 \n", + "7489 vans rv6 14 6.0 6.0 \n", + "7491 vans rv7 11 4.0 6.0 \n", + "7495 vans rv8 14 5.0 1.0 \n", + "7750 yakovlev yak 52 11 10.0 3.0 \n", + "\n", + " total_minor total_uninjured total_destroyed total_people \\\n", + "7488 5.0 5.0 3 21.0 \n", + "7489 8.0 2.0 4 22.0 \n", + "7491 1.0 3.0 3 14.0 \n", + "7495 2.0 10.0 3 18.0 \n", + "7750 2.0 5.0 1 20.0 \n", + "\n", + " fatality_index injury_index damage_severity_index risk_score \n", + "7488 0.428571 0.333333 0.200000 0.340952 \n", + "7489 0.272727 0.636364 0.285714 0.349351 \n", + "7491 0.285714 0.500000 0.272727 0.324675 \n", + "7495 0.277778 0.166667 0.214286 0.236508 \n", + "7750 0.500000 0.250000 0.090909 0.327273 " + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Define damage weights\n", + "WEIGHTS = {\n", + " 'fatality_index': 0.5,\n", + " 'damage_severity_index': 0.3,\n", + " 'injury_index': 0.2\n", + "}\n", + "# Compute Risk score using weighted fatality, damage_severity and Injury indices\n", + "model_summary_df['risk_score'] = (\n", + " model_summary_df['fatality_index'] * WEIGHTS['fatality_index'] +\n", + " model_summary_df['damage_severity_index'] * WEIGHTS['damage_severity_index'] +\n", + " model_summary_df['injury_index'] * WEIGHTS['injury_index']\n", + ")\n", + "model_summary_df.tail()\n" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(431, 12)" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "model_summary_df.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(431, 12)" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "model_summary_df_cleaned = model_summary_df.dropna(subset=['risk_score'])\n", + "model_summary_df_cleaned.shape" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Visualize Risk Index Distributions" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "plt.figure(figsize=(15, 8))\n", + "sns.heatmap(\n", + " model_summary_df[['fatality_index', 'damage_severity_index', 'injury_index', 'risk_score']].corr(),\n", + " annot=True,\n", + " cmap='coolwarm',\n", + " fmt='.2f'\n", + ")\n", + "plt.title(\"Correlation Matrix of Risk Indexes\")\n", + "plt.tight_layout()\n", + "plt.show()\n" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "# Scatter plot: Damage Severity Index vs Risk Score\n", + "plt.figure(figsize=(8, 6))\n", + "sns.scatterplot(data=model_summary_df, x='damage_severity_index', y='risk_score', color='orange')\n", + "plt.title('Damage Severity Index vs Risk Score')\n", + "plt.xlabel('Damage Severity Index')\n", + "plt.ylabel('Risk Score')\n", + "plt.tight_layout()\n", + "plt.show()\n", + "\n", + "# Scatter plot: Fatality Index vs Risk Score\n", + "plt.figure(figsize=(8, 6))\n", + "sns.scatterplot(data=model_summary_df, x='fatality_index', y='risk_score', color='red')\n", + "plt.title('Fatality Index vs Risk Score')\n", + "plt.xlabel('Fatality Index')\n", + "plt.ylabel('Risk Score')\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "plt.figure(figsize=(10, 8)) # Increased figure size\n", + "\n", + "# Create the scatter plot\n", + "scatter = sns.scatterplot(\n", + " data=model_summary_df,\n", + " x='fatality_index',\n", + " y='risk_score',\n", + " size='total_people',\n", + " hue='risk_score',\n", + " palette='coolwarm',\n", + " sizes=(30, 200),\n", + " alpha=0.7\n", + ")\n", + "\n", + "# Add reference lines\n", + "plt.axhline(0.3, linestyle='--', color='gray', alpha=0.5)\n", + "plt.axvline(0.2, linestyle='--', color='gray', alpha=0.5)\n", + "\n", + "# Customize titles and labels\n", + "plt.title(\"Aircraft Risk Profile\\n(Bubble Size Represents Total People Involved)\", pad=20, fontsize=14)\n", + "plt.xlabel(\"Fatality Index (Fatalities/Total People)\", fontsize=12)\n", + "plt.ylabel(\"Composite Risk Score\", fontsize=12)\n", + "\n", + "\n", + "\n", + "# Method 2: If you really want bottom-left inside the plot\n", + "plt.legend(\n", + " bbox_to_anchor=(0.80, 0.0), # Inside bottom-left\n", + " loc='lower left',\n", + " borderaxespad=0.5,\n", + " frameon=True,\n", + " title='Risk Score'\n", + " )\n", + "# Add tight_layout\n", + "plt.tight_layout() \n", + "\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
make_modeltotal_accidentstotal_peoplefatality_indexdamage_severity_indexinjury_indexrisk_score
1428boeing 757161810.00.00000.00000.00500.0010
4759maule m-5-210c1226.00.00000.00000.03850.0077
3010embraer emb-145lr10426.00.00000.00000.03990.0080
1387boeing 737 7h4141655.00.00060.00000.08940.0182
1460boeing 777172422.00.00000.05880.00870.0194
3202evektor-aerotechnik as sportstar2027.00.00000.00000.11110.0222
1938cessna 180j2542.00.00000.00000.11900.0238
1930cessna 180a1430.00.03330.00000.06670.0300
4047hughes 269a2033.00.00000.00000.15150.0303
283air tractor at 6021213.00.00000.00000.15380.0308
\n", + "
" + ], + "text/plain": [ + " make_model total_accidents total_people \\\n", + "1428 boeing 757 16 1810.0 \n", + "4759 maule m-5-210c 12 26.0 \n", + "3010 embraer emb-145lr 10 426.0 \n", + "1387 boeing 737 7h4 14 1655.0 \n", + "1460 boeing 777 17 2422.0 \n", + "3202 evektor-aerotechnik as sportstar 20 27.0 \n", + "1938 cessna 180j 25 42.0 \n", + "1930 cessna 180a 14 30.0 \n", + "4047 hughes 269a 20 33.0 \n", + "283 air tractor at 602 12 13.0 \n", + "\n", + " fatality_index damage_severity_index injury_index risk_score \n", + "1428 0.0000 0.0000 0.0050 0.0010 \n", + "4759 0.0000 0.0000 0.0385 0.0077 \n", + "3010 0.0000 0.0000 0.0399 0.0080 \n", + "1387 0.0006 0.0000 0.0894 0.0182 \n", + "1460 0.0000 0.0588 0.0087 0.0194 \n", + "3202 0.0000 0.0000 0.1111 0.0222 \n", + "1938 0.0000 0.0000 0.1190 0.0238 \n", + "1930 0.0333 0.0000 0.0667 0.0300 \n", + "4047 0.0000 0.0000 0.1515 0.0303 \n", + "283 0.0000 0.0000 0.1538 0.0308 " + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Filter out models with zero risk score\n", + "filtered_df = model_summary_df[model_summary_df['risk_score'] > 0]\n", + "\n", + "# Sort top 10 lowest non-zero risk models\n", + "top_10 = filtered_df.sort_values('risk_score').head(10)\n", + "\n", + "plt.figure(figsize=(10, 6))\n", + "sns.barplot(\n", + " data=top_10,\n", + " x='risk_score',\n", + " y='make_model',\n", + " hue='make_model',\n", + " dodge=False\n", + ")\n", + "plt.title(\"Top 10 Lowest-Risk Aircraft Models\")\n", + "plt.xlabel(\"Risk Score\")\n", + "plt.ylabel(\"make_model\")\n", + "plt.tight_layout()\n", + "plt.show()\n", + "\n", + "# Display summary table with key stats\n", + "top_10[[\n", + " 'make_model', 'total_accidents', 'total_people',\n", + " 'fatality_index', 'damage_severity_index', 'injury_index', 'risk_score'\n", + "]].round(4)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [], + "source": [ + "# Remove models with zero risk score as indication of limited data for analysis\n", + "model_summary_df = model_summary_df[model_summary_df['risk_score'] > 0]\n", + "\n", + "# filter models with zero fatality, damage and injury index for respective charts\n", + "fatality_filtered = model_summary_df[model_summary_df['fatality_index'] > 0]\n", + "damage_filtered = model_summary_df[model_summary_df['damage_severity_index'] > 0]\n", + "injury_filtered = model_summary_df[model_summary_df['injury_index'] > 0]\n" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "# Sort and select top 10\n", + "lowest_fatality = fatality_filtered.sort_values('fatality_index').head(10)\n", + "\n", + "# Plot\n", + "plt.figure(figsize=(10,6))\n", + "sns.barplot(data=lowest_fatality, x='fatality_index', y='make_model', palette='Greens_r')\n", + "plt.title(\"Top 10 Models with Lowest Non-Zero Fatality Rates\")\n", + "plt.xlabel(\"Fatality Index\")\n", + "plt.ylabel(\"Aircraft Make-Model\")\n", + "plt.tight_layout()\n", + "plt.show()\n" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "# Top 10 models with lowest non-zero Damage Severity Index\n", + "damage_filtered = model_summary_df[model_summary_df['damage_severity_index'] > 0]\n", + "lowest_damage = damage_filtered.sort_values('damage_severity_index').head(10)\n", + "\n", + "plt.figure(figsize=(10,6))\n", + "sns.barplot(data=lowest_damage, x='damage_severity_index', y='make_model', palette='Purples_r')\n", + "plt.title(\"Top 10 Models with Lowest Non-Zero Damage Severity Index\")\n", + "plt.xlabel(\"Damage Severity Index\")\n", + "plt.ylabel(\"Aircraft Make-Model\")\n", + "plt.tight_layout()\n", + "plt.show()\n" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "# Top 10 models with lowest non-zero Damage Severity Index\n", + "damage_filtered = model_summary_df[model_summary_df['damage_severity_index'] > 0]\n", + "lowest_damage = damage_filtered.sort_values('damage_severity_index').head(10)\n", + "\n", + "plt.figure(figsize=(10,6))\n", + "sns.barplot(data=lowest_damage, x='damage_severity_index', y='make_model', palette='Oranges_r')\n", + "plt.title(\"Top 10 Models with Lowest Non-Zero Damage Severity Index\")\n", + "plt.xlabel(\"Damage Severity Index\")\n", + "plt.ylabel(\"Aircraft Make-Model\")\n", + "plt.tight_layout()\n", + "plt.show()\n" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "# Top 10 models with lowest non-zero Injury Index\n", + "lowest_injury = injury_filtered.sort_values('injury_index').head(10)\n", + "\n", + "plt.figure(figsize=(10,6))\n", + "sns.barplot(data=lowest_injury, x='injury_index', y='make_model', palette='Blues_r')\n", + "plt.title(\"Top 10 Models with Lowest Non-Zero Severe Injury Index\")\n", + "plt.xlabel(\"Severe Injury Index\")\n", + "plt.ylabel(\"Aircraft Make-Model\")\n", + "plt.tight_layout()\n", + "plt.show()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Data Analysis\n", + "## Recommend the aircraft with the lowest fatality, injury, damage and overall risk i.e. the ones that itersect across all the metrics." + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Models appearing in top 30 for all 3 metrics:\n", + "{'cessna 180a', 'cessna 195a', 'boeing 747'}\n", + "\n", + "✅ Top 10 Models by Combined Safety Rank:\n", + " make_model fatality_index injury_index \\\n", + "1428 boeing 757 0.000000 0.004972 \n", + "4759 maule m-5-210c 0.000000 0.038462 \n", + "3010 embraer emb-145lr 0.000000 0.039906 \n", + "1930 cessna 180a 0.033333 0.066667 \n", + "3202 evektor-aerotechnik as sportstar 0.000000 0.111111 \n", + "1387 boeing 737 7h4 0.000604 0.089426 \n", + "1938 cessna 180j 0.000000 0.119048 \n", + "1981 cessna 195a 0.038462 0.076923 \n", + "5632 piper pa 28-161 0.032258 0.096774 \n", + "1460 boeing 777 0.000000 0.008671 \n", + "\n", + " damage_severity_index risk_score combined_rank \n", + "1428 0.000000 0.000994 10.0 \n", + "4759 0.000000 0.007692 19.0 \n", + "3010 0.000000 0.007981 23.0 \n", + "1930 0.000000 0.030000 78.0 \n", + "3202 0.000000 0.022222 80.0 \n", + "1387 0.000000 0.018187 90.0 \n", + "1938 0.000000 0.023810 91.0 \n", + "1981 0.000000 0.034615 95.0 \n", + "5632 0.000000 0.035484 115.0 \n", + "1460 0.058824 0.019381 117.0 \n" + ] + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Increase range\n", + "top_n = 30\n", + "top_fatality = fatality_filtered.sort_values('fatality_index').head(top_n)['make_model']\n", + "top_injury = injury_filtered.sort_values('injury_index').head(top_n)['make_model']\n", + "top_risk = model_summary_df.sort_values('risk_score').head(top_n)['make_model']\n", + "top_damage = damage_filtered.sort_values('damage_severity_index').head(top_n)['make_model']\n", + "\n", + "# Intersection\n", + "common_models = set(top_fatality) & set(top_injury) & set(top_risk)\n", + "\n", + "if common_models:\n", + " print(f\"✅ Models appearing in top {top_n} for all 3 metrics:\")\n", + " print(common_models)\n", + "else:\n", + " print(f\"❌ No exact overlap in top {top_n}. Computing combined ranking...\")\n", + "\n", + "# Compute combined rank\n", + "model_summary_df['rank_fatality'] = model_summary_df['fatality_index'].rank(method='min')\n", + "model_summary_df['rank_injury'] = model_summary_df['injury_index'].rank(method='min')\n", + "model_summary_df['rank_risk'] = model_summary_df['risk_score'].rank(method='min')\n", + "model_summary_df['rank_damage'] = model_summary_df['damage_severity_index'].rank(method='min')\n", + "\n", + "# Compute combined rank across 4 metrics\n", + "model_summary_df['combined_rank'] = (\n", + " model_summary_df['rank_fatality'] +\n", + " model_summary_df['rank_injury'] +\n", + " model_summary_df['rank_risk'] +\n", + " model_summary_df['rank_damage']\n", + ")\n", + "\n", + "# Sort by combined rank\n", + "combined_top = model_summary_df.sort_values('combined_rank').head(10)\n", + "print(\"\\n✅ Top 10 Models by Combined Safety Rank:\")\n", + "print(combined_top[['make_model', 'fatality_index', 'injury_index', 'damage_severity_index', 'risk_score', 'combined_rank']])\n", + "\n", + "# Optional: Venn Diagram for visualization\n", + "from matplotlib_venn import venn3\n", + "\n", + "plt.figure(figsize=(8,6))\n", + "venn3([set(top_fatality), set(top_injury), set(top_damage)],\n", + " set_labels=('Top Fatality', 'Top Severe Injury', 'Top Damage'))\n", + "plt.title(\"Overlap of Top 10 Models Across All Metrics\")\n", + "plt.show()\n" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Export completed successfully!\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
make_modeltotal_accidentstotal_peoplefatality_indexinjury_indexdamage_severity_indexrisk_scorecombined_rank
32aero commander 1001321.00.0952380.2857140.0769230.127839690.0
71aero commander s2r1818.00.2222220.2222220.3888890.2722221254.0
98aeronca 11ac2950.00.1400000.3400000.0689660.158690853.0
101aeronca 15ac1012.00.0833330.0833330.0000000.058333185.0
110aeronca 7ac85129.00.1627910.2868220.0823530.163466865.0
113aeronca 7bcm1417.00.0588240.4705880.0714290.144958773.0
115aeronca 7ccm1016.00.0000000.3125000.0000000.062500374.0
163aerospatiale as3501532.00.2812500.0937500.0666670.179375715.0
233agusta a1091130.00.6333330.1666670.5454550.5136361420.0
283air tractor at 6021213.00.0000000.1538460.0000000.030769128.0
\n", + "
" + ], + "text/plain": [ + " make_model total_accidents total_people fatality_index \\\n", + "32 aero commander 100 13 21.0 0.095238 \n", + "71 aero commander s2r 18 18.0 0.222222 \n", + "98 aeronca 11ac 29 50.0 0.140000 \n", + "101 aeronca 15ac 10 12.0 0.083333 \n", + "110 aeronca 7ac 85 129.0 0.162791 \n", + "113 aeronca 7bcm 14 17.0 0.058824 \n", + "115 aeronca 7ccm 10 16.0 0.000000 \n", + "163 aerospatiale as350 15 32.0 0.281250 \n", + "233 agusta a109 11 30.0 0.633333 \n", + "283 air tractor at 602 12 13.0 0.000000 \n", + "\n", + " injury_index damage_severity_index risk_score combined_rank \n", + "32 0.285714 0.076923 0.127839 690.0 \n", + "71 0.222222 0.388889 0.272222 1254.0 \n", + "98 0.340000 0.068966 0.158690 853.0 \n", + "101 0.083333 0.000000 0.058333 185.0 \n", + "110 0.286822 0.082353 0.163466 865.0 \n", + "113 0.470588 0.071429 0.144958 773.0 \n", + "115 0.312500 0.000000 0.062500 374.0 \n", + "163 0.093750 0.066667 0.179375 715.0 \n", + "233 0.166667 0.545455 0.513636 1420.0 \n", + "283 0.153846 0.000000 0.030769 128.0 " + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Ensure make_model column exists and is normalized\n", + "if 'make_model' not in model_summary_df.columns:\n", + " model_summary_df['make_model'] = (model_summary_df['Make'] + ' ' + model_summary_df['Model']).str.strip().str.lower()\n", + "else:\n", + " model_summary_df['make_model'] = model_summary_df['make_model'].str.strip().str.lower()\n", + "\n", + "# Define columns for export\n", + "export_cols = [\n", + " 'make_model',\n", + " 'total_accidents', 'total_people',\n", + " 'fatality_index', 'injury_index', 'damage_severity_index', 'risk_score', 'combined_rank'\n", + "]\n", + "\n", + "# Check if all columns exist\n", + "missing_cols = [col for col in export_cols if col not in model_summary_df.columns]\n", + "if missing_cols:\n", + " print(f\"⚠️ Missing columns: {missing_cols}\")\n", + "else:\n", + " # Export CSV and Excel\n", + " model_summary_df[export_cols].to_csv('Aviation_Safety_Tableau.csv', index=False)\n", + " model_summary_df[export_cols].to_excel('Aviation_Safety_Tableau.xlsx', index=False)\n", + " print(\"✅ Export completed successfully!\")\n", + "\n", + "# Show a preview of exported data\n", + "model_summary_df[export_cols].head(10)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# **Aircraft Safety Analysis – Recommended Models**\n", + "\n", + "Based on the computed safety indices (**Fatality Index**, **Injury Index**, **Damage Severity Index**) and overall **Risk Score**, here are the insights deduced:\n", + "\n", + "### **Insights**\n", + "1. **Models with lowest risk scores** tend to have fewer accidents and lower fatality ratios.\n", + "2. **Purpose of flight patterns** show that some of these safer models are commonly used for **personal purposes**.\n", + "3. **Engine configurations** (type and number) may indicate suitability for specific operations.\n", + "\n", + "---\n", + "\n", + "## ✅ Recommendations for Client:\n", + "- **Personal Use:* For private operations, prioritize single-engine piston types with historically low injury rates.\n", + "- **Top 10 models as illustrated in the bar graph with the boeing 757 being the safest evaluated model too invest in. " + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python (learn-env)", + "language": "python", + "name": "learn-env" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.8.5" + } + }, + "nbformat": 4, + "nbformat_minor": 2 +} diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md deleted file mode 100644 index dd12a2a4..00000000 --- a/CONTRIBUTING.md +++ /dev/null @@ -1,37 +0,0 @@ -# Contributing to Learn.co Curriculum - -We're really exited that you're about to contribute to the [open curriculum](https://learn.co/content-license) on [Learn.co](https://learn.co). If this is your first time contributing, please continue reading to learn how to make the most meaningful and useful impact possible. - -## Raising an Issue to Encourage a Contribution - -If you notice a problem with the curriculum that you believe needs improvement -but you're unable to make the change yourself, you should raise a Github issue -containing a clear description of the problem. Include relevant snippets of -the content and/or screenshots if applicable. Curriculum owners regularly review -issue lists and your issue will be prioritized and addressed as appropriate. - -## Submitting a Pull Request to Suggest an Improvement - -If you see an opportunity for improvement and can make the change yourself go -ahead and use a typical git workflow to make it happen: - -* Fork this curriculum repository -* Make the change on your fork, with descriptive commits in the standard format -* Open a Pull Request against this repo - -A curriculum owner will review your change and approve or comment on it in due -course. - -# Why Contribute? - -Curriculum on Learn is publicly and freely available under Learn's -[Educational Content License](https://learn.co/content-license). By -embracing an open-source contribution model, our goal is for the curriculum -on Learn to become, in time, the best educational content the world has -ever seen. - -We need help from the community of Learners to maintain and improve the -educational content. Everything from fixing typos, to correcting -out-dated information, to improving exposition, to adding better examples, -to fixing tests—all contributions to making the curriculum more effective are -welcome. diff --git a/LICENSE.md b/LICENSE.md index b31af39e..86321741 100644 --- a/LICENSE.md +++ b/LICENSE.md @@ -1,7 +1,5 @@ -# Learn.co Educational Content License +# Open Source Project -Copyright (c) 2015 Flatiron School, Inc +Copyright (c) 2025 , David Munyiri -The Flatiron School, Inc. owns this Educational Content. However, the Flatiron School supports the development and availability of educational materials in the public domain. Therefore, the Flatiron School grants Users of the Flatiron Educational Content set forth in this repository certain rights to reuse, build upon and share such Educational Content subject to the terms of the Educational Content License set forth [here](http://learn.co/content-license) (http://learn.co/content-license). You must read carefully the terms and conditions contained in the Educational Content License as such terms govern access to and use of the Educational Content. - -Flatiron School is willing to allow you access to and use of the Educational Content only on the condition that you accept all of the terms and conditions contained in the Educational Content License set forth [here](http://learn.co/content-license) (http://learn.co/content-license). By accessing and/or using the Educational Content, you are agreeing to all of the terms and conditions contained in the Educational Content License. If you do not agree to any or all of the terms of the Educational Content License, you are prohibited from accessing, reviewing or using in any way the Educational Content. +The material is open source and is open for use or intepretation by anyone. diff --git a/.gitignore b/Python.gitignore similarity index 65% rename from .gitignore rename to Python.gitignore index 68bc17f9..cb0f8dc8 100644 --- a/.gitignore +++ b/Python.gitignore @@ -1,6 +1,6 @@ # Byte-compiled / optimized / DLL files __pycache__/ -*.py[cod] +*.py[codz] *$py.class # C extensions @@ -46,7 +46,7 @@ htmlcov/ nosetests.xml coverage.xml *.cover -*.py,cover +*.py.cover .hypothesis/ .pytest_cache/ cover/ @@ -94,20 +94,35 @@ ipython_config.py # install all needed dependencies. #Pipfile.lock +# UV +# Similar to Pipfile.lock, it is generally recommended to include uv.lock in version control. +# This is especially recommended for binary packages to ensure reproducibility, and is more +# commonly ignored for libraries. +#uv.lock + # poetry # Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control. # This is especially recommended for binary packages to ensure reproducibility, and is more # commonly ignored for libraries. # https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control #poetry.lock +#poetry.toml # pdm # Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control. +# pdm recommends including project-wide configuration in pdm.toml, but excluding .pdm-python. +# https://pdm-project.org/en/latest/usage/project/#working-with-version-control #pdm.lock -# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it -# in version control. -# https://pdm.fming.dev/#use-with-ide -.pdm.toml +#pdm.toml +.pdm-python +.pdm-build/ + +# pixi +# Similar to Pipfile.lock, it is generally recommended to include pixi.lock in version control. +#pixi.lock +# Pixi creates a virtual environment in the .pixi directory, just like venv module creates one +# in the .venv directory. It is recommended not to include this directory in version control. +.pixi # PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm __pypackages__/ @@ -121,6 +136,7 @@ celerybeat.pid # Environments .env +.envrc .venv env/ venv/ @@ -158,3 +174,30 @@ cython_debug/ # and can be added to the global gitignore or merged into this file. For a more nuclear # option (not recommended) you can uncomment the following to ignore the entire idea folder. #.idea/ + +# Abstra +# Abstra is an AI-powered process automation framework. +# Ignore directories containing user credentials, local state, and settings. +# Learn more at https://abstra.io/docs +.abstra/ + +# Visual Studio Code +# Visual Studio Code specific template is maintained in a separate VisualStudioCode.gitignore +# that can be found at https://github.com/github/gitignore/blob/main/Global/VisualStudioCode.gitignore +# and can be added to the global gitignore or merged into this file. However, if you prefer, +# you could uncomment the following to ignore the entire vscode folder +# .vscode/ + +# Ruff stuff: +.ruff_cache/ + +# PyPI configuration file +.pypirc + +# Marimo +marimo/_static/ +marimo/_lsp/ +__marimo__/ + +# Streamlit +.streamlit/secrets.toml diff --git a/README.md b/README.md index f37c298c..b7556788 100644 --- a/README.md +++ b/README.md @@ -1,289 +1,96 @@ -# Phase 1 Project Description -You've made it all the way through the first phase of this course - take a minute to celebrate your awesomeness! +# Safety-Based Aircraft Investment Analysis -Now you will put your new skills to use with a large end-of-Phase project! +## 📌 Overview +This project analyzes aviation accident data to provide insights and recommendations for making **data-driven, safety-based aircraft investment decisions**. Using Python, the analysis evaluates accident trends, computes risk metrics, and ranks aircraft models to guide strategic investments in the aviation sector. -In this project description, we will cover: +--- -* [***Project Overview:***](#project-overview) the project goal, audience, and dataset -* [***Deliverables:***](#deliverables) the specific items you are required to produce for this project -* [***Grading:***](#grading) how your project will be scored -* [***Getting Started:***](#getting-started) guidance for how to begin your first project +## 👤 Author Information +- **Student Name:** David Munyiri +- **Student Pace:** Part-time +- **Instructor:** Fidelis Wanalwenge +- **Project Review Date:** 27/07/2025 -## Project Overview +--- -For this project, you will use data cleaning, imputation, analysis, and visualization to generate insights for a business stakeholder. +## 📂 Project Structure +- **student.ipynb**: Main Jupyter notebook containing the analysis. +- **data/Aviation_Data.csv**: Aviation dataset used for analysis. +- **outputs/** *(optional)*: Visualizations and processed data files. -### Business Problem +--- -Your company is expanding in to new industries to diversify its portfolio. Specifically, they are interested in purchasing and operating airplanes for commercial and private enterprises, but do not know anything about the potential risks of aircraft. You are charged with determining which aircraft are the lowest risk for the company to start this new business endeavor. You must then translate your findings into actionable insights that the head of the new aviation division can use to help decide which aircraft to purchase. +## 🛠️ Technologies Used +- **Python 3** +- **Libraries:** + - `pandas` - Data cleaning and manipulation + - `matplotlib` - Visualization + - `seaborn` - Advanced plotting -### The Data +--- -In the `data` folder is a [dataset](https://www.kaggle.com/datasets/khsamaha/aviation-accident-database-synopses) from the National Transportation Safety Board that includes aviation accident data from 1962 to 2023 about civil aviation accidents and selected incidents in the United States and international waters. +## ✅ Analysis Workflow +1. **Data Acquisition** + - Import aviation accident data into a pandas DataFrame. -It is up to you to decide what data to use, how to deal with missing values, how to aggregate the data, and how to visualize it in an interactive dashboard. +2. **Data Cleaning & Preparation** + - Handle missing values + - Create combined fields (e.g., `Make_Model`) -### Key Points +3. **Exploratory Data Analysis (EDA)** + - Accident frequency by aircraft model + - Severity patterns and contributing factors + - Correlation analysis -* **Your analysis should yield three concrete business recommendations.** The key idea behind dealing with missing values, aggregating and visualizaing data is to help your organization make data driven decisions. You will relate your findings to business intelligence by making recommendations for how the business should move forward with the new aviation opportunity. +4. **Safety Index Computation** + Calculate key safety metrics: + - Fatality Index + - Injury Index + - Damage Severity Ind + Calculate a composite **Risk Score** for each aircraft model. + Weighted Risk Score: (fatality_index * 0.5) + (damage_severity_index * 0.3) + (injury_index * 0.2) -* **Communicating about your work well is extremely important.** Your ability to provide value to an organization - or to land a job there - is directly reliant on your ability to communicate with them about what you have done and why it is valuable. Create a storyline your audience (the head of the aviation division) can follow by walking them through the steps of your process, highlighting the most important points and skipping over the rest. +5. **Investment Recommendations** + - Identify aircraft models with **high safety ratings** for investment. -* **Use plenty of visualizations.** Visualizations are invaluable for exploring your data and making your findings accessible to a non-technical audience. Spotlight visuals in your presentation, but only ones that relate directly to your recommendations. Simple visuals are usually best (e.g. bar charts and line graphs), and don't forget to format them well (e.g. labels, titles). +--- -## Deliverables +## 📊 Key Visualizations +- **Correlation Heatmap of the various indices by Model** (Heat Map) +- **Damage, injury and fatality vs. risk score to validate correlation weighting** (Scatter plot) +- **Safety indices vs aircraft models for further analysis** ( Bar Graph) +- **Safety Score Rankings** (Table/Chart) -There are three deliverables for this project: +--- -* A **non-technical presentation** -* A **Jupyter Notebook** -* A **GitHub repository** -* An **Interactive Dashboard** +## ▶️ How to Run +1. Clone this repository or download the files.(https://github.com/dkamash/dsc-phase-1-Aircraft-Investment.git ) +2. Install required dependencies: + ```bash + pip install pandas matplotlib seaborn + ``` +3. Place `Aviation_Data.csv` in the `data/` directory. +4. Launch Jupyter Notebook: + ```bash + jupyter notebook student.ipynb + ``` +5. Run all cells to reproduce the analysis and visualizations. -### Non-Technical Presentation +--- -The non-technical presentation is a slide deck presenting your analysis to business stakeholders. +## ✅ Deliverables +- Complete aviation safety risk analysis. +- Visual reports supporting investment recommendations. +- Ranked list of safest aircraft models. -* ***Non-technical*** does not mean that you should avoid mentioning the technologies or techniques that you used, it means that you should explain any mentions of these technologies and avoid assuming that your audience is already familiar with them. -* ***Business stakeholders*** means that the audience for your presentation is the business, not the class or teacher. Do not assume that they are already familiar with the specific business problem. +--- -The presentation describes the project ***goals, data, methods, and results***. It must include at least ***three visualizations*** which correspond to ***three business recommendations***. +## 📝 : Tableau Public Visuals: +*https://public.tableau.com/app/profile/david.munyiri/viz/Tableau_17536476127480/Dashboard1?publish=yes* -We recommend that you follow this structure, although the slide titles should be specific to your project: +--- -1. Beginning - * Overview - * Business Understanding -2. Middle - * Data Understanding - * Data Analysis -3. End - * Recommendations - * Next Steps - * Thank You - * This slide should include a prompt for questions as well as your contact information (name and LinkedIn profile) - -You will give a live presentation of your slides and submit them in PDF format on Canvas. The slides should also be present in the GitHub repository you submit with a file name of `presentation.pdf`. - -The graded elements of the presentation are: - -* Presentation Content -* Slide Style -* Presentation Delivery and Answers to Questions - -See the [Grading](#grading) section for further explanation of these elements. - -For further reading on creating professional presentations, check out: - -* [Presentation Content](https://github.com/learn-co-curriculum/dsc-project-presentation-content) -* [Slide Style](https://github.com/learn-co-curriculum/dsc-project-slide-design) - -### Jupyter Notebook - -The Jupyter Notebook is a notebook that uses Python and Markdown to present your analysis to a data science audience. - -* ***Python and Markdown*** means that you need to construct an integrated `.ipynb` file with Markdown (headings, paragraphs, links, lists, etc.) and Python code to create a well-organized, skim-able document. - * The notebook kernel should be restarted and all cells run before submission, to ensure that all code is runnable in order. - * Markdown should be used to frame the project with a clear introduction and conclusion, as well as introducing each of the required elements. -* ***Data science audience*** means that you can assume basic data science proficiency in the person reading your notebook. This differs from the non-technical presentation. - -Along with the presentation, the notebook also describes the project ***goals, data, methods, and results***. - -You will submit the notebook in PDF format on Canvas as well as in `.ipynb` format in your GitHub repository. - -The graded elements for the Jupyter Notebook are: - -* Business Understanding -* Data Understanding -* Data Preparation -* Data Analysis -* Code Quality - -See the [Grading](#grading) section for further explanation of these elements. - -### GitHub Repository - -The GitHub repository is the cloud-hosted directory containing all of your project files as well as their version history. - -This repository link will be the project link that you include on your resume, LinkedIn, etc. for prospective employers to view your work. Note that we typically recommend that 3 links are highlighted (out of 5 projects) so don't stress too much about getting this one to be perfect! There will also be time after graduation for cosmetic touch-ups. - -A professional GitHub repository has: - -1. `README.md` - * A file called `README.md` at the root of the repository directory, written in Markdown; this is what is rendered when someone visits the link to your repository in the browser - * This file contains these sections: - * Overview - * Business Understanding - * Include stakeholder and key business questions - * Data Understanding and Analysis - * Source of data - * Description of data - * Three visualizations (the same visualizations presented in the slides and notebook) - * Conclusion - * Summary of conclusions including three relevant findings -2. Commit history - * Progression of updates throughout the project time period, not just immediately before the deadline - * Clear commit messages - * Commits from all team members (if a group project) -3. Organization - * Clear folder structure - * Clear names of files and folders - * Easily-located notebook and presentation linked in the README -4. Notebook(s) - * Clearly-indicated final notebook that runs without errors - * Exploratory/working notebooks (can contain errors, redundant code, etc.) from all team members (if a group project) -5. `.gitignore` - * A file called `.gitignore` at the root of the repository directory instructs Git to ignore large, unnecessary, or private files - * Because it starts with a `.`, you will need to type `ls -a` in the terminal in order to see that it is there - * GitHub maintains a [Python .gitignore](https://github.com/github/gitignore/blob/master/Python.gitignore) that may be a useful starting point for your version of this file - * To tell Git to ignore more files, just add a new line to `.gitignore` for each new file name - * Consider adding `.DS_Store` if you are using a Mac computer, as well as project-specific file names - * If you are running into an error message because you forgot to add something to `.gitignore` and it is too large to be pushed to GitHub [this blog post](https://medium.com/analytics-vidhya/tutorial-removing-large-files-from-git-78dbf4cf83a?sk=c3763d466c7f2528008c3777192dfb95)(friend link) should help you address this - -You wil submit a link to the GitHub repository on Canvas. - -See the [Grading](#grading) section for further explanation of how the GitHub repository will be graded. - -For further reading on creating professional notebooks and `README`s, check out [this reading](https://github.com/learn-co-curriculum/dsc-repo-readability-v2-2). - -### Interactive Dashboard - -The interactive dashboard is a collection of views that allows the viewer to change the views to understand different features in the data. This dashboard will be linked within your GitHub repository README.md file so that users can explore your analysis. Make sure you follow visual best practices that you have learned in this course. Below is an example of what you could produce for this assignment. -![tableau dashboard for aviation accidents](https://raw.githubusercontent.com/learn-co-curriculum/dsc-phase-1-project-v3/master/example_dashboard.png) - -## Grading - -***To pass this project, you must pass each project rubric objective.*** The project rubric objectives for Phase 1 are: - -1. Data Communication -2. Authoring Jupyter Notebooks -3. Data Manipulation and Analysis with `pandas` -4. Interactive Data Visualization - -### Data Communication - -Communication is a key "soft skill". In [this survey](https://www.payscale.com/data-packages/job-skills), 46% of hiring managers said that recent college grads were missing this skill. - -Because "communication" can encompass such a wide range of contexts and skills, we will specifically focus our Phase 1 objective on Data Communication. We define Data Communication as: - -> Communicating basic data analysis results to diverse audiences via writing and live presentation - -To further define some of these terms: - -* By "basic data analysis" we mean that you are filtering, sorting, grouping, and/or aggregating the data in order to answer business questions. This project does not involve inferential statistics or machine learning, although descriptive statistics such as measures of central tendency are encouraged. -* By "results" we mean your ***three visualizations and recommendations***. -* By "diverse audiences" we mean that your presentation and notebook are appropriately addressing a business and data science audience, respectively. - -Below are the definitions of each rubric level for this objective. This information is also summarized in the rubric, which is attached to the project submission assignment. - -#### Exceeds Objective -Creates and describes appropriate visualizations for given business questions, where each visualization fulfills all elements of the checklist - -> This "checklist" refers to the Data Visualization checklist within the larger Phase 1 Project Checklist - -#### Meets Objective (Passing Bar) -Creates and describes appropriate visualizations for given business questions - -> This objective can be met even if all checklist elements are not fulfilled. For example, if there is some illegible text in one of your visualizations, you can still meet this objective - -#### Approaching Objective -Creates visualizations that are not related to the business questions, or uses an inappropriate type of visualization - -> Even if you create very compelling visualizations, you cannot pass this objective if the visualizations are not related to the business questions - -> An example of an inappropriate type of visualization would be using a line graph to show the correlation between two independent variables, when a scatter plot would be more appropriate - -#### Does Not Meet Objective -Does not submit the required number of visualizations - -### Authoring Jupyter Notebooks - -According to [Kaggle's 2020 State of Data Science and Machine Learning Survey](https://www.kaggle.com/kaggle-survey-2020), 74.1% of data scientists use a Jupyter development environment, which is more than twice the percentage of the next-most-popular IDE, Visual Studio Code. Jupyter Notebooks allow for reproducible, skim-able code documents for a data science audience. Comfort and skill with authoring Jupyter Notebooks will prepare you for job interviews, take-home challenges, and on-the-job tasks as a data scientist. - -The key feature that distinguishes *authoring Jupyter Notebooks* from simply *writing Python code* is the fact that Markdown cells are integrated into the notebook along with the Python cells in a notebook. You have seen examples of this throughout the curriculum, but now it's time for you to practice this yourself! - -Below are the definitions of each rubric level for this objective. This information is also summarized in the rubric, which is attached to the project submission assignment. - -#### Exceeds Objective -Uses Markdown and code comments to create a well-organized, skim-able document that follows all best practices - -> Refer to the [repository readability reading](https://github.com/learn-co-curriculum/dsc-repo-readability-v2-2) for more tips on best practices - -#### Meets Objective (Passing Bar) -Uses some Markdown to create an organized notebook, with an introduction at the top and a conclusion at the bottom - -#### Approaching Objective -Uses Markdown cells to organize, but either uses only headers and does not provide any explanations or justifications, or uses only plaintext without any headers to segment out sections of the notebook - -> Headers in Markdown are delineated with one or more `#`s at the start of the line. You should have a mixture of headers and plaintext (text where the line does not start with `#`) - -#### Does Not Meet Objective -Does not submit a notebook, or does not use Markdown cells at all to organize the notebook - -### Data Manipulation and Analysis with `pandas` - -`pandas` is a very popular data manipulation library, with over 2 million downloads on Anaconda (`conda install pandas`) and over 19 million downloads on PyPI (`pip install pandas`) at the time of this writing. In our own internal data, we see that the overwhelming majority of Flatiron School DS grads use `pandas` on the job in some capacity. - -Unlike in base Python, where the Zen of Python says "There should be one-- and preferably only one --obvious way to do it", there is often more than one valid way to do something in `pandas`. However there are still more efficient and less efficient ways to use it. Specifically, the best `pandas` code is *performant* and *idiomatic*. - -Performant `pandas` code utilizes methods and broadcasting rather than user-defined functions or `for` loops. For example, if you need to strip whitespace from a column containing string data, the best approach would be to use the [`pandas.Series.str.strip` method](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.strip.html) rather than writing your own function or writing a loop. Or if you want to multiply everything in a column by 100, the best approach would be to use broadcasting (e.g. `df["column_name"] * 100`) instead of a function or loop. You can still write your own functions if needed, but only after checking that there isn't a built-in way to do it. - -Idiomatic `pandas` code has variable names that are meaningful words or abbreviations in English, that are related to the purpose of the variables. You can still use `df` as the name of your DataFrame if there is only one main DataFrame you are working with, but as soon as you are merging multiple DataFrames or taking a subset of a DataFrame, you should use meaningful names. For example, `df2` would not be an idiomatic name, but `movies_and_reviews` could be. - -We also recommend that you rename all DataFrame columns so that their meanings are more understandable, although it is fine to have acronyms. For example, `"col1"` would not be an idiomatic name, but `"USD"` could be. - -Below are the definitions of each rubric level for this objective. This information is also summarized in the rubric, which is attached to the project submission assignment. - -#### Exceeds Objective -Uses `pandas` to prepare data and answer business questions in an idiomatic, performant way - -#### Meets Objective (Passing Bar) -Successfully uses `pandas` to prepare data in order to answer business questions - -> This includes projects that _occasionally_ use base Python when `pandas` methods would be more appropriate (such as using `enumerate()` on a DataFrame), or occasionally performs operations that do not appear to have any relevance to the business questions - -#### Approaching Objective -Uses `pandas` to prepare data, but makes significant errors - -> Examples of significant errors include: the result presented does not actually answer the stated question, the code produces errors, the code _consistently_ uses base Python when `pandas` methods would be more appropriate, or the submitted notebook contains significant quantities of code that is unrelated to the presented analysis (such as copy/pasted code from the curriculum or StackOverflow) - -#### Does Not Meet Objective -Unable to prepare data using `pandas` - -> This includes projects that successfully answer the business questions, but do not use `pandas` (e.g. use only base Python, or use some other tool like R, Tableau, or Excel) - -### Interactive Data Visualization - -Tableau is a powerful data analysis tool that allows data to be presented in a manner that allows it to be easily digestible with visualizations and charts to aid in the simplification of the data and its analysis. Tableau contains many customizable features and makes it easy to share in many ways. We recommend you use Tableau for your interactive data visualization now that you have experience with it. - -Here are the definitions of each rubric level for this objective. - -#### Exceeds Objective -Creates an easy to use dashboard to answer business questions - -#### Meets Objective -Successfully creates a dashboard to answer business questions - -#### Approaching Objective -Creates a dashboard, but it is difficult to use - -#### Does Not Meet Objective -Unable to create a dashboard - -## Getting Started - -Please start by reviewing the contents of this project description. If you have any questions, please ask your instructor ASAP. - -Next, you will need to complete the [***Project Proposal***](#project_proposal) which must be reviewed by your instructor before you can continue with the project. - -Then, you will need to create a GitHub repository. There are three options: -Interactive Data Visualization -1. Look at the [Phase 1 Project Templates and Examples repo](https://github.com/learn-co-curriculum/dsc-project-template) and follow the directions in the MVP branch. -2. Fork the [Phase 1 Project Repository](https://github.com/learn-co-curriculum/dsc-phase-1-project-v3), clone it locally, and work in the `student.ipynb` file. Make sure to also add and commit a PDF of your presentation to your repository with a file name of `presentation.pdf`. -3. Create a new repository from scratch by going to [github.com/new](https://github.com/new) and copying the data files from one of the above resources into your new repository. This approach will result in the most professional-looking portfolio repository, but can be more complicated to use. So if you are getting stuck with this option, try one of the above options instead. - -## Summary -This project will give you a valuable opportunity to develop your data science skills using real-world data. The end-of-phase projects are a critical part of the program because they give you a chance to bring together all the skills you've learned, apply them to realistic projects for a business stakeholder, practice communication skills, and get feedback to help you improve. You've got this! +## 📌 License +This project is for **educational purposes only**. diff --git a/example_dashboard.png b/example_dashboard.png deleted file mode 100644 index f46de3dd..00000000 Binary files a/example_dashboard.png and /dev/null differ diff --git a/index.ipynb b/index.ipynb deleted file mode 100644 index 9f73253c..00000000 --- a/index.ipynb +++ /dev/null @@ -1,727 +0,0 @@ -{ - "cells": [ - { - "cell_type": "markdown", - "id": "0cf0d3bf", - "metadata": {}, - "source": [ - "# Phase 1 Project Description" - ] - }, - { - "cell_type": "markdown", - "id": "23ef5c52", - "metadata": {}, - "source": [ - "You've made it all the way through the first phase of this course - take a minute to celebrate your awesomeness!\n", - "\n", - "Now you will put your new skills to use with a large end-of-Phase project!\n", - "\n", - "In this project description, we will cover:\n", - "\n", - "* [***Project Overview:***](#project-overview) the project goal, audience, and dataset\n", - "* [***Deliverables:***](#deliverables) the specific items you are required to produce for this project\n", - "* [***Grading:***](#grading) how your project will be scored\n", - "* [***Getting Started:***](#getting-started) guidance for how to begin your first project" - ] - }, - { - "cell_type": "markdown", - "id": "ff21d421", - "metadata": {}, - "source": [ - "## Project Overview" - ] - }, - { - "cell_type": "markdown", - "id": "b273e0c7", - "metadata": {}, - "source": [ - "For this project, you will use data cleaning, imputation, analysis, and visualization to generate insights for a business stakeholder." - ] - }, - { - "cell_type": "markdown", - "id": "ff346426", - "metadata": {}, - "source": [ - "### Business Problem" - ] - }, - { - "cell_type": "markdown", - "id": "6b1ab930", - "metadata": {}, - "source": [ - "Your company is expanding in to new industries to diversify its portfolio. Specifically, they are interested in purchasing and operating airplanes for commercial and private enterprises, but do not know anything about the potential risks of aircraft. You are charged with determining which aircraft are the lowest risk for the company to start this new business endeavor. You must then translate your findings into actionable insights that the head of the new aviation division can use to help decide which aircraft to purchase." - ] - }, - { - "cell_type": "markdown", - "id": "57270bf5", - "metadata": {}, - "source": [ - "### The Data" - ] - }, - { - "cell_type": "markdown", - "id": "ab7012a4", - "metadata": {}, - "source": [ - "In the `data` folder is a [dataset](https://www.kaggle.com/datasets/khsamaha/aviation-accident-database-synopses) from the National Transportation Safety Board that includes aviation accident data from 1962 to 2023 about civil aviation accidents and selected incidents in the United States and international waters.\n", - "\n", - "It is up to you to decide what data to use, how to deal with missing values, how to aggregate the data, and how to visualize it in an interactive dashboard." - ] - }, - { - "cell_type": "markdown", - "id": "f8d228e6", - "metadata": {}, - "source": [ - "### Key Points" - ] - }, - { - "cell_type": "markdown", - "id": "8216724c", - "metadata": {}, - "source": [ - "* **Your analysis should yield three concrete business recommendations.** The key idea behind dealing with missing values, aggregating and visualizaing data is to help your organization make data driven decisions. You will relate your findings to business intelligence by making recommendations for how the business should move forward with the new aviation opportunity.\n", - "\n", - "* **Communicating about your work well is extremely important.** Your ability to provide value to an organization - or to land a job there - is directly reliant on your ability to communicate with them about what you have done and why it is valuable. Create a storyline your audience (the head of the aviation division) can follow by walking them through the steps of your process, highlighting the most important points and skipping over the rest.\n", - "\n", - "* **Use plenty of visualizations.** Visualizations are invaluable for exploring your data and making your findings accessible to a non-technical audience. Spotlight visuals in your presentation, but only ones that relate directly to your recommendations. Simple visuals are usually best (e.g. bar charts and line graphs), and don't forget to format them well (e.g. labels, titles)." - ] - }, - { - "cell_type": "markdown", - "id": "7a1b556f", - "metadata": {}, - "source": [ - "## Deliverables" - ] - }, - { - "cell_type": "markdown", - "id": "c284830f", - "metadata": {}, - "source": [ - "There are three deliverables for this project:\n", - "\n", - "* A **non-technical presentation**\n", - "* A **Jupyter Notebook**\n", - "* A **GitHub repository**\n", - "* An **Interactive Dashboard**" - ] - }, - { - "cell_type": "markdown", - "id": "30f3a6e4", - "metadata": {}, - "source": [ - "### Non-Technical Presentation" - ] - }, - { - "cell_type": "markdown", - "id": "71164c91", - "metadata": {}, - "source": [ - "The non-technical presentation is a slide deck presenting your analysis to business stakeholders.\n", - "\n", - "* ***Non-technical*** does not mean that you should avoid mentioning the technologies or techniques that you used, it means that you should explain any mentions of these technologies and avoid assuming that your audience is already familiar with them.\n", - "* ***Business stakeholders*** means that the audience for your presentation is the business, not the class or teacher. Do not assume that they are already familiar with the specific business problem.\n", - "\n", - "The presentation describes the project ***goals, data, methods, and results***. It must include at least ***three visualizations*** which correspond to ***three business recommendations***.\n", - "\n", - "We recommend that you follow this structure, although the slide titles should be specific to your project:\n", - "\n", - "1. Beginning\n", - " * Overview\n", - " * Business Understanding\n", - "2. Middle\n", - " * Data Understanding\n", - " * Data Analysis\n", - "3. End\n", - " * Recommendations\n", - " * Next Steps\n", - " * Thank You\n", - " * This slide should include a prompt for questions as well as your contact information (name and LinkedIn profile)\n", - "\n", - "You will give a live presentation of your slides and submit them in PDF format on Canvas. The slides should also be present in the GitHub repository you submit with a file name of `presentation.pdf`.\n", - "\n", - "The graded elements of the presentation are:\n", - "\n", - "* Presentation Content\n", - "* Slide Style\n", - "* Presentation Delivery and Answers to Questions\n", - "\n", - "See the [Grading](#grading) section for further explanation of these elements.\n", - "\n", - "For further reading on creating professional presentations, check out:\n", - "\n", - "* [Presentation Content](https://github.com/learn-co-curriculum/dsc-project-presentation-content)\n", - "* [Slide Style](https://github.com/learn-co-curriculum/dsc-project-slide-design)" - ] - }, - { - "cell_type": "markdown", - "id": "3bd428c3", - "metadata": {}, - "source": [ - "### Jupyter Notebook" - ] - }, - { - "cell_type": "markdown", - "id": "b8d53d3f", - "metadata": {}, - "source": [ - "The Jupyter Notebook is a notebook that uses Python and Markdown to present your analysis to a data science audience.\n", - "\n", - "* ***Python and Markdown*** means that you need to construct an integrated `.ipynb` file with Markdown (headings, paragraphs, links, lists, etc.) and Python code to create a well-organized, skim-able document.\n", - " * The notebook kernel should be restarted and all cells run before submission, to ensure that all code is runnable in order.\n", - " * Markdown should be used to frame the project with a clear introduction and conclusion, as well as introducing each of the required elements.\n", - "* ***Data science audience*** means that you can assume basic data science proficiency in the person reading your notebook. This differs from the non-technical presentation.\n", - "\n", - "Along with the presentation, the notebook also describes the project ***goals, data, methods, and results***.\n", - "\n", - "You will submit the notebook in PDF format on Canvas as well as in `.ipynb` format in your GitHub repository.\n", - "\n", - "The graded elements for the Jupyter Notebook are:\n", - "\n", - "* Business Understanding\n", - "* Data Understanding\n", - "* Data Preparation\n", - "* Data Analysis\n", - "* Code Quality\n", - "\n", - "See the [Grading](#grading) section for further explanation of these elements." - ] - }, - { - "cell_type": "markdown", - "id": "d2055de2", - "metadata": {}, - "source": [ - "### GitHub Repository" - ] - }, - { - "cell_type": "markdown", - "id": "76404472", - "metadata": {}, - "source": [ - "The GitHub repository is the cloud-hosted directory containing all of your project files as well as their version history.\n", - "\n", - "This repository link will be the project link that you include on your resume, LinkedIn, etc. for prospective employers to view your work. Note that we typically recommend that 3 links are highlighted (out of 5 projects) so don't stress too much about getting this one to be perfect! There will also be time after graduation for cosmetic touch-ups.\n", - "\n", - "A professional GitHub repository has:\n", - "\n", - "1. `README.md`\n", - " * A file called `README.md` at the root of the repository directory, written in Markdown; this is what is rendered when someone visits the link to your repository in the browser\n", - " * This file contains these sections:\n", - " * Overview\n", - " * Business Understanding\n", - " * Include stakeholder and key business questions\n", - " * Data Understanding and Analysis\n", - " * Source of data\n", - " * Description of data\n", - " * Three visualizations (the same visualizations presented in the slides and notebook)\n", - " * Conclusion\n", - " * Summary of conclusions including three relevant findings\n", - "2. Commit history\n", - " * Progression of updates throughout the project time period, not just immediately before the deadline\n", - " * Clear commit messages\n", - " * Commits from all team members (if a group project)\n", - "3. Organization\n", - " * Clear folder structure\n", - " * Clear names of files and folders\n", - " * Easily-located notebook and presentation linked in the README\n", - "4. Notebook(s)\n", - " * Clearly-indicated final notebook that runs without errors\n", - " * Exploratory/working notebooks (can contain errors, redundant code, etc.) from all team members (if a group project)\n", - "5. `.gitignore`\n", - " * A file called `.gitignore` at the root of the repository directory instructs Git to ignore large, unnecessary, or private files\n", - " * Because it starts with a `.`, you will need to type `ls -a` in the terminal in order to see that it is there\n", - " * GitHub maintains a [Python .gitignore](https://github.com/github/gitignore/blob/master/Python.gitignore) that may be a useful starting point for your version of this file\n", - " * To tell Git to ignore more files, just add a new line to `.gitignore` for each new file name\n", - " * Consider adding `.DS_Store` if you are using a Mac computer, as well as project-specific file names\n", - " * If you are running into an error message because you forgot to add something to `.gitignore` and it is too large to be pushed to GitHub [this blog post](https://medium.com/analytics-vidhya/tutorial-removing-large-files-from-git-78dbf4cf83a?sk=c3763d466c7f2528008c3777192dfb95)(friend link) should help you address this\n", - "\n", - "You wil submit a link to the GitHub repository on Canvas.\n", - "\n", - "See the [Grading](#grading) section for further explanation of how the GitHub repository will be graded.\n", - "\n", - "For further reading on creating professional notebooks and `README`s, check out [this reading](https://github.com/learn-co-curriculum/dsc-repo-readability-v2-2)." - ] - }, - { - "cell_type": "markdown", - "id": "6bd15d88", - "metadata": {}, - "source": [ - "### Interactive Dashboard" - ] - }, - { - "cell_type": "markdown", - "id": "136adf8e", - "metadata": {}, - "source": [ - "The interactive dashboard is a collection of views that allows the viewer to change the views to understand different features in the data. This dashboard will be linked within your GitHub repository README.md file so that users can explore your analysis. Make sure you follow visual best practices that you have learned in this course. Below is an example of what you could produce for this assignment.\n", - "![tableau dashboard for aviation accidents](https://raw.githubusercontent.com/learn-co-curriculum/dsc-phase-1-project-v3/master/example_dashboard.png)" - ] - }, - { - "cell_type": "markdown", - "id": "24f0b36b", - "metadata": {}, - "source": [ - "## Grading" - ] - }, - { - "cell_type": "markdown", - "id": "26cbcebc", - "metadata": {}, - "source": [ - "***To pass this project, you must pass each project rubric objective.*** The project rubric objectives for Phase 1 are:\n", - "\n", - "1. Data Communication\n", - "2. Authoring Jupyter Notebooks\n", - "3. Data Manipulation and Analysis with `pandas`\n", - "4. Interactive Data Visualization" - ] - }, - { - "cell_type": "markdown", - "id": "9a28c5bc", - "metadata": {}, - "source": [ - "### Data Communication" - ] - }, - { - "cell_type": "markdown", - "id": "e2d13765", - "metadata": {}, - "source": [ - "Communication is a key \"soft skill\". In [this survey](https://www.payscale.com/data-packages/job-skills), 46% of hiring managers said that recent college grads were missing this skill.\n", - "\n", - "Because \"communication\" can encompass such a wide range of contexts and skills, we will specifically focus our Phase 1 objective on Data Communication. We define Data Communication as:\n", - "\n", - "> Communicating basic data analysis results to diverse audiences via writing and live presentation\n", - "\n", - "To further define some of these terms:\n", - "\n", - "* By \"basic data analysis\" we mean that you are filtering, sorting, grouping, and/or aggregating the data in order to answer business questions. This project does not involve inferential statistics or machine learning, although descriptive statistics such as measures of central tendency are encouraged.\n", - "* By \"results\" we mean your ***three visualizations and recommendations***.\n", - "* By \"diverse audiences\" we mean that your presentation and notebook are appropriately addressing a business and data science audience, respectively.\n", - "\n", - "Below are the definitions of each rubric level for this objective. This information is also summarized in the rubric, which is attached to the project submission assignment." - ] - }, - { - "cell_type": "markdown", - "id": "8482a9e0", - "metadata": {}, - "source": [ - "#### Exceeds Objective" - ] - }, - { - "cell_type": "markdown", - "id": "5fb0b6d8", - "metadata": {}, - "source": [ - "Creates and describes appropriate visualizations for given business questions, where each visualization fulfills all elements of the checklist\n", - "\n", - "> This \"checklist\" refers to the Data Visualization checklist within the larger Phase 1 Project Checklist" - ] - }, - { - "cell_type": "markdown", - "id": "cb5b218e", - "metadata": {}, - "source": [ - "#### Meets Objective (Passing Bar)" - ] - }, - { - "cell_type": "markdown", - "id": "513ec39d", - "metadata": {}, - "source": [ - "Creates and describes appropriate visualizations for given business questions\n", - "\n", - "> This objective can be met even if all checklist elements are not fulfilled. For example, if there is some illegible text in one of your visualizations, you can still meet this objective" - ] - }, - { - "cell_type": "markdown", - "id": "385b8c3b", - "metadata": {}, - "source": [ - "#### Approaching Objective" - ] - }, - { - "cell_type": "markdown", - "id": "d6408ca4", - "metadata": {}, - "source": [ - "Creates visualizations that are not related to the business questions, or uses an inappropriate type of visualization\n", - "\n", - "> Even if you create very compelling visualizations, you cannot pass this objective if the visualizations are not related to the business questions\n", - "\n", - "> An example of an inappropriate type of visualization would be using a line graph to show the correlation between two independent variables, when a scatter plot would be more appropriate" - ] - }, - { - "cell_type": "markdown", - "id": "2db8f9c7", - "metadata": {}, - "source": [ - "#### Does Not Meet Objective" - ] - }, - { - "cell_type": "markdown", - "id": "c34d6c95", - "metadata": {}, - "source": [ - "Does not submit the required number of visualizations" - ] - }, - { - "cell_type": "markdown", - "id": "91fae056", - "metadata": {}, - "source": [ - "### Authoring Jupyter Notebooks" - ] - }, - { - "cell_type": "markdown", - "id": "83c5aa56", - "metadata": {}, - "source": [ - "According to [Kaggle's 2020 State of Data Science and Machine Learning Survey](https://www.kaggle.com/kaggle-survey-2020), 74.1% of data scientists use a Jupyter development environment, which is more than twice the percentage of the next-most-popular IDE, Visual Studio Code. Jupyter Notebooks allow for reproducible, skim-able code documents for a data science audience. Comfort and skill with authoring Jupyter Notebooks will prepare you for job interviews, take-home challenges, and on-the-job tasks as a data scientist.\n", - "\n", - "The key feature that distinguishes *authoring Jupyter Notebooks* from simply *writing Python code* is the fact that Markdown cells are integrated into the notebook along with the Python cells in a notebook. You have seen examples of this throughout the curriculum, but now it's time for you to practice this yourself!\n", - "\n", - "Below are the definitions of each rubric level for this objective. This information is also summarized in the rubric, which is attached to the project submission assignment." - ] - }, - { - "cell_type": "markdown", - "id": "b9a66aa0", - "metadata": {}, - "source": [ - "#### Exceeds Objective" - ] - }, - { - "cell_type": "markdown", - "id": "72a0a5ed", - "metadata": {}, - "source": [ - "Uses Markdown and code comments to create a well-organized, skim-able document that follows all best practices\n", - "\n", - "> Refer to the [repository readability reading](https://github.com/learn-co-curriculum/dsc-repo-readability-v2-2) for more tips on best practices" - ] - }, - { - "cell_type": "markdown", - "id": "da992bca", - "metadata": {}, - "source": [ - "#### Meets Objective (Passing Bar)" - ] - }, - { - "cell_type": "markdown", - "id": "20984b83", - "metadata": {}, - "source": [ - "Uses some Markdown to create an organized notebook, with an introduction at the top and a conclusion at the bottom" - ] - }, - { - "cell_type": "markdown", - "id": "63c0ab76", - "metadata": {}, - "source": [ - "#### Approaching Objective" - ] - }, - { - "cell_type": "markdown", - "id": "54c094f1", - "metadata": {}, - "source": [ - "Uses Markdown cells to organize, but either uses only headers and does not provide any explanations or justifications, or uses only plaintext without any headers to segment out sections of the notebook\n", - "\n", - "> Headers in Markdown are delineated with one or more `#`s at the start of the line. You should have a mixture of headers and plaintext (text where the line does not start with `#`)" - ] - }, - { - "cell_type": "markdown", - "id": "d2caa058", - "metadata": {}, - "source": [ - "#### Does Not Meet Objective" - ] - }, - { - "cell_type": "markdown", - "id": "041349f8", - "metadata": {}, - "source": [ - "Does not submit a notebook, or does not use Markdown cells at all to organize the notebook" - ] - }, - { - "cell_type": "markdown", - "id": "f0bd316a", - "metadata": {}, - "source": [ - "### Data Manipulation and Analysis with `pandas`" - ] - }, - { - "cell_type": "markdown", - "id": "c3cadaef", - "metadata": {}, - "source": [ - "`pandas` is a very popular data manipulation library, with over 2 million downloads on Anaconda (`conda install pandas`) and over 19 million downloads on PyPI (`pip install pandas`) at the time of this writing. In our own internal data, we see that the overwhelming majority of Flatiron School DS grads use `pandas` on the job in some capacity.\n", - "\n", - "Unlike in base Python, where the Zen of Python says \"There should be one-- and preferably only one --obvious way to do it\", there is often more than one valid way to do something in `pandas`. However there are still more efficient and less efficient ways to use it. Specifically, the best `pandas` code is *performant* and *idiomatic*.\n", - "\n", - "Performant `pandas` code utilizes methods and broadcasting rather than user-defined functions or `for` loops. For example, if you need to strip whitespace from a column containing string data, the best approach would be to use the [`pandas.Series.str.strip` method](https://pandas.pydata.org/docs/reference/api/pandas.Series.str.strip.html) rather than writing your own function or writing a loop. Or if you want to multiply everything in a column by 100, the best approach would be to use broadcasting (e.g. `df[\"column_name\"] * 100`) instead of a function or loop. You can still write your own functions if needed, but only after checking that there isn't a built-in way to do it.\n", - "\n", - "Idiomatic `pandas` code has variable names that are meaningful words or abbreviations in English, that are related to the purpose of the variables. You can still use `df` as the name of your DataFrame if there is only one main DataFrame you are working with, but as soon as you are merging multiple DataFrames or taking a subset of a DataFrame, you should use meaningful names. For example, `df2` would not be an idiomatic name, but `movies_and_reviews` could be.\n", - "\n", - "We also recommend that you rename all DataFrame columns so that their meanings are more understandable, although it is fine to have acronyms. For example, `\"col1\"` would not be an idiomatic name, but `\"USD\"` could be.\n", - "\n", - "Below are the definitions of each rubric level for this objective. This information is also summarized in the rubric, which is attached to the project submission assignment." - ] - }, - { - "cell_type": "markdown", - "id": "b3789af2", - "metadata": {}, - "source": [ - "#### Exceeds Objective" - ] - }, - { - "cell_type": "markdown", - "id": "eedaca9d", - "metadata": {}, - "source": [ - "Uses `pandas` to prepare data and answer business questions in an idiomatic, performant way" - ] - }, - { - "cell_type": "markdown", - "id": "c89f285f", - "metadata": {}, - "source": [ - "#### Meets Objective (Passing Bar)" - ] - }, - { - "cell_type": "markdown", - "id": "ce5c9b18", - "metadata": {}, - "source": [ - "Successfully uses `pandas` to prepare data in order to answer business questions\n", - "\n", - "> This includes projects that _occasionally_ use base Python when `pandas` methods would be more appropriate (such as using `enumerate()` on a DataFrame), or occasionally performs operations that do not appear to have any relevance to the business questions" - ] - }, - { - "cell_type": "markdown", - "id": "7d9656ca", - "metadata": {}, - "source": [ - "#### Approaching Objective" - ] - }, - { - "cell_type": "markdown", - "id": "9f3b2074", - "metadata": {}, - "source": [ - "Uses `pandas` to prepare data, but makes significant errors\n", - "\n", - "> Examples of significant errors include: the result presented does not actually answer the stated question, the code produces errors, the code _consistently_ uses base Python when `pandas` methods would be more appropriate, or the submitted notebook contains significant quantities of code that is unrelated to the presented analysis (such as copy/pasted code from the curriculum or StackOverflow)" - ] - }, - { - "cell_type": "markdown", - "id": "3f1b750b", - "metadata": {}, - "source": [ - "#### Does Not Meet Objective" - ] - }, - { - "cell_type": "markdown", - "id": "77c11e1b", - "metadata": {}, - "source": [ - "Unable to prepare data using `pandas`\n", - "\n", - "> This includes projects that successfully answer the business questions, but do not use `pandas` (e.g. use only base Python, or use some other tool like R, Tableau, or Excel)" - ] - }, - { - "cell_type": "markdown", - "id": "d49beec3", - "metadata": {}, - "source": [ - "### Interactive Data Visualization" - ] - }, - { - "cell_type": "markdown", - "id": "8998ec0a", - "metadata": {}, - "source": [ - "Tableau is a powerful data analysis tool that allows data to be presented in a manner that allows it to be easily digestible with visualizations and charts to aid in the simplification of the data and its analysis. Tableau contains many customizable features and makes it easy to share in many ways. We recommend you use Tableau for your interactive data visualization now that you have experience with it.\n", - "\n", - "Here are the definitions of each rubric level for this objective." - ] - }, - { - "cell_type": "markdown", - "id": "3c31fa6e", - "metadata": {}, - "source": [ - "#### Exceeds Objective" - ] - }, - { - "cell_type": "markdown", - "id": "af14cc9d", - "metadata": {}, - "source": [ - "Creates an easy to use dashboard to answer business questions" - ] - }, - { - "cell_type": "markdown", - "id": "6b6541d4", - "metadata": {}, - "source": [ - "#### Meets Objective" - ] - }, - { - "cell_type": "markdown", - "id": "9e86f8bc", - "metadata": {}, - "source": [ - "Successfully creates a dashboard to answer business questions" - ] - }, - { - "cell_type": "markdown", - "id": "a50b9933", - "metadata": {}, - "source": [ - "#### Approaching Objective" - ] - }, - { - "cell_type": "markdown", - "id": "e5d08da9", - "metadata": {}, - "source": [ - "Creates a dashboard, but it is difficult to use" - ] - }, - { - "cell_type": "markdown", - "id": "9d1cdd74", - "metadata": {}, - "source": [ - "#### Does Not Meet Objective" - ] - }, - { - "cell_type": "markdown", - "id": "8f0e7d90", - "metadata": {}, - "source": [ - "Unable to create a dashboard" - ] - }, - { - "cell_type": "markdown", - "id": "e55ad567", - "metadata": {}, - "source": [ - "## Getting Started" - ] - }, - { - "cell_type": "markdown", - "id": "4aa2dfa0", - "metadata": {}, - "source": [ - "Please start by reviewing the contents of this project description. If you have any questions, please ask your instructor ASAP.\n", - "\n", - "Next, you will need to complete the [***Project Proposal***](#project_proposal) which must be reviewed by your instructor before you can continue with the project.\n", - "\n", - "Then, you will need to create a GitHub repository. There are three options:\n", - "Interactive Data Visualization\n", - "1. Look at the [Phase 1 Project Templates and Examples repo](https://github.com/learn-co-curriculum/dsc-project-template) and follow the directions in the MVP branch.\n", - "2. Fork the [Phase 1 Project Repository](https://github.com/learn-co-curriculum/dsc-phase-1-project-v3), clone it locally, and work in the `student.ipynb` file. Make sure to also add and commit a PDF of your presentation to your repository with a file name of `presentation.pdf`.\n", - "3. Create a new repository from scratch by going to [github.com/new](https://github.com/new) and copying the data files from one of the above resources into your new repository. This approach will result in the most professional-looking portfolio repository, but can be more complicated to use. So if you are getting stuck with this option, try one of the above options instead." - ] - }, - { - "cell_type": "markdown", - "id": "5fc1215d", - "metadata": {}, - "source": [ - "## Summary" - ] - }, - { - "cell_type": "markdown", - "id": "51c4f7fe", - "metadata": {}, - "source": [ - "This project will give you a valuable opportunity to develop your data science skills using real-world data. The end-of-phase projects are a critical part of the program because they give you a chance to bring together all the skills you've learned, apply them to realistic projects for a business stakeholder, practice communication skills, and get feedback to help you improve. You've got this!" - ] - } - ], - "metadata": { - "kernelspec": { - "display_name": "Python (learn-env)", - "language": "python", - "name": "learn-env" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.9.16" - } - }, - "nbformat": 4, - "nbformat_minor": 5 -} diff --git a/student - Jupyter Notebook.pdf b/student - Jupyter Notebook.pdf new file mode 100644 index 00000000..2bd179bc Binary files /dev/null and b/student - Jupyter Notebook.pdf differ diff --git a/student.ipynb b/student.ipynb index d3bb34af..8db372a4 100644 --- a/student.ipynb +++ b/student.ipynb @@ -7,28 +7,1728 @@ "## Final Project Submission\n", "\n", "Please fill out:\n", - "* Student name: \n", - "* Student pace: self paced / part time / full time\n", - "* Scheduled project review date/time: \n", - "* Instructor name: \n", + "* Student name: David Munyiri\n", + "* Student pace: part time\n", + "* Scheduled project review date/time: 27/07/2025 23:59:59\n", + "* Instructor name: Fidelis Wanalwenge\n", "* Blog post URL:\n" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "\n", + "# ================================================\n", + "# Aviation Safety Risk Analysis Report\n", + "# ================================================\n", + "\n", + "## Introduction\n", + "### This notebook analyzes aviation accident data to provide recommendations for selecting the safest aircraft models for business, commercial, or personal purposes.\n", + "\n", + "### Key objectives:\n", + "- Clean and prepare the data\n", + "- Compute safety risk metrics (Fatality, Severe Injury, Damage Severity)\n", + "- Calculate a weighted Risk Score\n", + "- Identify aircraft models with best safety records\n", + "- Provide data exports for Tableau visualization\n", + "#\n", + "# -------------------------------" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Data Exploration" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "C:\\Users\\david.munyiri\\AppData\\Local\\anaconda3\\envs\\learn-env\\lib\\site-packages\\IPython\\core\\interactiveshell.py:3145: DtypeWarning: Columns (6,7,28) have mixed types.Specify dtype option on import or set low_memory=False.\n", + " has_raised = await self.run_ast_nodes(code_ast.body, cell_name,\n" + ] + } + ], + "source": [ + "#Load the data into a pandas Dataframe\n", + "import pandas as pd\n", + "import seaborn as sns\n", + "import matplotlib.pyplot as plt\n", + "\n", + "Aviation_df = pd.read_csv(\"data/Aviation_Data.csv\")" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(90348, 31)" + ] + }, + "execution_count": 2, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#Check the size of the Aviation raw data\n", + "Aviation_df.shape" + ] + }, { "cell_type": "code", - "execution_count": null, + "execution_count": 3, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "Index(['Event.Id', 'Investigation.Type', 'Accident.Number', 'Event.Date',\n", + " 'Location', 'Country', 'Latitude', 'Longitude', 'Airport.Code',\n", + " 'Airport.Name', 'Injury.Severity', 'Aircraft.damage',\n", + " 'Aircraft.Category', 'Registration.Number', 'Make', 'Model',\n", + " 'Amateur.Built', 'Number.of.Engines', 'Engine.Type', 'FAR.Description',\n", + " 'Schedule', 'Purpose.of.flight', 'Air.carrier', 'Total.Fatal.Injuries',\n", + " 'Total.Serious.Injuries', 'Total.Minor.Injuries', 'Total.Uninjured',\n", + " 'Weather.Condition', 'Broad.phase.of.flight', 'Report.Status',\n", + " 'Publication.Date'],\n", + " dtype='object')" + ] + }, + "execution_count": 3, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#View the all the columns of the raw data\n", + "Aviation_df.columns" + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "\n", + "RangeIndex: 90348 entries, 0 to 90347\n", + "Data columns (total 31 columns):\n", + " # Column Non-Null Count Dtype \n", + "--- ------ -------------- ----- \n", + " 0 Event.Id 88889 non-null object \n", + " 1 Investigation.Type 90348 non-null object \n", + " 2 Accident.Number 88889 non-null object \n", + " 3 Event.Date 88889 non-null object \n", + " 4 Location 88837 non-null object \n", + " 5 Country 88663 non-null object \n", + " 6 Latitude 34382 non-null object \n", + " 7 Longitude 34373 non-null object \n", + " 8 Airport.Code 50249 non-null object \n", + " 9 Airport.Name 52790 non-null object \n", + " 10 Injury.Severity 87889 non-null object \n", + " 11 Aircraft.damage 85695 non-null object \n", + " 12 Aircraft.Category 32287 non-null object \n", + " 13 Registration.Number 87572 non-null object \n", + " 14 Make 88826 non-null object \n", + " 15 Model 88797 non-null object \n", + " 16 Amateur.Built 88787 non-null object \n", + " 17 Number.of.Engines 82805 non-null float64\n", + " 18 Engine.Type 81812 non-null object \n", + " 19 FAR.Description 32023 non-null object \n", + " 20 Schedule 12582 non-null object \n", + " 21 Purpose.of.flight 82697 non-null object \n", + " 22 Air.carrier 16648 non-null object \n", + " 23 Total.Fatal.Injuries 77488 non-null float64\n", + " 24 Total.Serious.Injuries 76379 non-null float64\n", + " 25 Total.Minor.Injuries 76956 non-null float64\n", + " 26 Total.Uninjured 82977 non-null float64\n", + " 27 Weather.Condition 84397 non-null object \n", + " 28 Broad.phase.of.flight 61724 non-null object \n", + " 29 Report.Status 82508 non-null object \n", + " 30 Publication.Date 73659 non-null object \n", + "dtypes: float64(5), object(26)\n", + "memory usage: 21.4+ MB\n" + ] + } + ], + "source": [ + "#Get information on the data types and content in different columns\n", + "Aviation_df.info()" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
Event.IdInvestigation.TypeAccident.NumberEvent.DateLocationCountryLatitudeLongitudeAirport.CodeAirport.Name...Purpose.of.flightAir.carrierTotal.Fatal.InjuriesTotal.Serious.InjuriesTotal.Minor.InjuriesTotal.UninjuredWeather.ConditionBroad.phase.of.flightReport.StatusPublication.Date
020001218X45444AccidentSEA87LA0801948-10-24MOOSE CREEK, IDUnited StatesNaNNaNNaNNaN...PersonalNaN2.00.00.00.0UNKCruiseProbable CauseNaN
120001218X45447AccidentLAX94LA3361962-07-19BRIDGEPORT, CAUnited StatesNaNNaNNaNNaN...PersonalNaN4.00.00.00.0UNKUnknownProbable Cause19-09-1996
220061025X01555AccidentNYC07LA0051974-08-30Saltville, VAUnited States36.9222-81.8781NaNNaN...PersonalNaN3.0NaNNaNNaNIMCCruiseProbable Cause26-02-2007
320001218X45448AccidentLAX96LA3211977-06-19EUREKA, CAUnited StatesNaNNaNNaNNaN...PersonalNaN2.00.00.00.0IMCCruiseProbable Cause12-09-2000
420041105X01764AccidentCHI79FA0641979-08-02Canton, OHUnited StatesNaNNaNNaNNaN...PersonalNaN1.02.0NaN0.0VMCApproachProbable Cause16-04-1980
\n", + "

5 rows × 31 columns

\n", + "
" + ], + "text/plain": [ + " Event.Id Investigation.Type Accident.Number Event.Date \\\n", + "0 20001218X45444 Accident SEA87LA080 1948-10-24 \n", + "1 20001218X45447 Accident LAX94LA336 1962-07-19 \n", + "2 20061025X01555 Accident NYC07LA005 1974-08-30 \n", + "3 20001218X45448 Accident LAX96LA321 1977-06-19 \n", + "4 20041105X01764 Accident CHI79FA064 1979-08-02 \n", + "\n", + " Location Country Latitude Longitude Airport.Code \\\n", + "0 MOOSE CREEK, ID United States NaN NaN NaN \n", + "1 BRIDGEPORT, CA United States NaN NaN NaN \n", + "2 Saltville, VA United States 36.9222 -81.8781 NaN \n", + "3 EUREKA, CA United States NaN NaN NaN \n", + "4 Canton, OH United States NaN NaN NaN \n", + "\n", + " Airport.Name ... Purpose.of.flight Air.carrier Total.Fatal.Injuries \\\n", + "0 NaN ... Personal NaN 2.0 \n", + "1 NaN ... Personal NaN 4.0 \n", + "2 NaN ... Personal NaN 3.0 \n", + "3 NaN ... Personal NaN 2.0 \n", + "4 NaN ... Personal NaN 1.0 \n", + "\n", + " Total.Serious.Injuries Total.Minor.Injuries Total.Uninjured \\\n", + "0 0.0 0.0 0.0 \n", + "1 0.0 0.0 0.0 \n", + "2 NaN NaN NaN \n", + "3 0.0 0.0 0.0 \n", + "4 2.0 NaN 0.0 \n", + "\n", + " Weather.Condition Broad.phase.of.flight Report.Status Publication.Date \n", + "0 UNK Cruise Probable Cause NaN \n", + "1 UNK Unknown Probable Cause 19-09-1996 \n", + "2 IMC Cruise Probable Cause 26-02-2007 \n", + "3 IMC Cruise Probable Cause 12-09-2000 \n", + "4 VMC Approach Probable Cause 16-04-1980 \n", + "\n", + "[5 rows x 31 columns]" + ] + }, + "execution_count": 5, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#View a snapshot of the raw data\n", + "Aviation_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
MakeModelAircraft.CategoryEngine.TypeInjury.SeverityAircraft.damage
count888268879732287818128788985695
unique82371231815131094
topCessna152AirplaneReciprocatingNon-FatalSubstantial
freq22227236727617695306735764148
\n", + "
" + ], + "text/plain": [ + " Make Model Aircraft.Category Engine.Type Injury.Severity \\\n", + "count 88826 88797 32287 81812 87889 \n", + "unique 8237 12318 15 13 109 \n", + "top Cessna 152 Airplane Reciprocating Non-Fatal \n", + "freq 22227 2367 27617 69530 67357 \n", + "\n", + " Aircraft.damage \n", + "count 85695 \n", + "unique 4 \n", + "top Substantial \n", + "freq 64148 " + ] + }, + "execution_count": 6, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "#View statistics of columns of interest \n", + "Aviation_df[['Make', 'Model','Aircraft.Category', 'Engine.Type', 'Injury.Severity','Aircraft.damage']].describe()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Data Cleaning\n", + "\n", + "Based on a quick exploration, the dataset appears to contain records of accidents and incidents involving various aircraft types, with **airplanes** being the most frequent category.\n", + "\n", + "The focus of our analysis will be on accident records and remove rows missing:\n", + "\n", + "- Make, Model, Aircraft Category\n", + "- Injury counts (fatal, serious, minor, uninjured)\n", + "\n", + "which are critical to our eventual recommendation. This cleaning process ensures that the dataset remains relevant, consistent, and ready for further analysis.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 7, "metadata": {}, "outputs": [], "source": [ - "# Your code here - remember to use markdown cells for comments as well!" + "# Filter only 'Accident' type investigations\n", + "accidents_df = Aviation_df[Aviation_df['Investigation.Type'] == 'Accident'].copy()\n", + "\n", + "# Standardize Make and Model columns before grouping\n", + "accidents_df['Make'] = accidents_df['Make'].str.lower().str.strip()\n", + "accidents_df['Model'] = accidents_df['Model'].str.lower().str.strip()\n", + "\n", + "# Rebuild combined make_model field\n", + "accidents_df['make_model'] = accidents_df['Make'] + ' ' + accidents_df['Model']\n", + "\n", + "# Define critical columns to keep\n", + "critical_columns = [\n", + " 'Make', 'Model', 'Aircraft.Category',\n", + " 'Total.Fatal.Injuries', 'Total.Serious.Injuries',\n", + " 'Total.Minor.Injuries', 'Total.Uninjured'\n", + "]\n", + "\n", + "# Drop rows with missing critical values\n", + "accidents_df.dropna(subset=critical_columns, inplace=True)\n", + "\n", + "# Fill in missing aircraft damage field\n", + "accidents_df['Aircraft.damage'] = accidents_df['Aircraft.damage'].fillna('Unknown')\n", + "\n", + "# Convert injuries to numeric\n", + "injury_cols = ['Total.Fatal.Injuries', 'Total.Serious.Injuries', 'Total.Minor.Injuries', 'Total.Uninjured']\n", + "for col in injury_cols:\n", + " accidents_df[col] = pd.to_numeric(accidents_df[col], errors='coerce').fillna(0)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Aggregate Accident Statistics by Aircraft Make and Model" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "#Define the columns that the data will be grouped by\n", + "grouped_df = accidents_df.groupby(['make_model'])\n", + "\n", + "#Total risk factor counts\n", + "model_summary_df = grouped_df.agg(\n", + " total_accidents=('Model', 'count'),\n", + " total_fatalities=('Total.Fatal.Injuries', 'sum'),\n", + " total_serious=('Total.Serious.Injuries', 'sum'),\n", + " total_minor=('Total.Minor.Injuries', 'sum'),\n", + " total_uninjured=('Total.Uninjured', 'sum'),\n", + " total_destroyed=('Aircraft.damage', lambda x: (x == 'Destroyed').sum())\n", + ").reset_index()\n", + "\n", + "model_summary_df['make_model'] = model_summary_df['make_model'].str.lower().str.strip()\n", + "\n", + "# Total people onboard\n", + "model_summary_df['total_people'] = (\n", + " model_summary_df['total_fatalities'] +\n", + " model_summary_df['total_serious'] +\n", + " model_summary_df['total_minor'] +\n", + " model_summary_df['total_uninjured']\n", + ")\n", + "\n", + "# Filter for valid data\n", + "model_summary_df = model_summary_df[\n", + " (model_summary_df['total_people'] > 0) &\n", + " (model_summary_df['total_accidents'] >= 10)\n", + "]\n", + "\n", + "# Add a combined Make_Model label for easier charting\n", + "# model_summary_df['make_model'] = model_summary_df['Make'] + ' ' + model_summary_df['Model']\n" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "total_fatalities 0\n", + "total_serious 0\n", + "total_minor 0\n", + "total_destroyed 0\n", + "total_accidents 0\n", + "dtype: int64\n", + "Empty DataFrame\n", + "Columns: [make_model, total_accidents, total_fatalities, total_serious, total_minor, total_uninjured, total_destroyed, total_people]\n", + "Index: []\n" + ] + } + ], + "source": [ + "# Check for missing values in critical columns\n", + "print(model_summary_df[['total_fatalities', 'total_serious', 'total_minor', 'total_destroyed', 'total_accidents']].isnull().sum())\n", + "\n", + "# Look at models with very few accidents or zero values in critical columns\n", + "print(model_summary_df[model_summary_df['total_accidents'] < 10])\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Compute Risk Indexes\n", + "\n", + "Based on the available dataset, we derive indexes that help us estimate and assign a safety evaluation of each aircraft model\n", + "\n", + "- **Fatality Index** = Fatalities / Total People Onboard\n", + "- **Injury Index** = (All Injuries) / Total People\n", + "- **Damage Severity Index** = Weighted damage / Total Accidents\n" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Index(['make_model', 'total_accidents', 'total_fatalities', 'total_serious',\n", + " 'total_minor', 'total_uninjured', 'total_destroyed', 'total_people',\n", + " 'fatality_index', 'injury_index', 'damage_severity_index'],\n", + " dtype='object')\n" + ] + } + ], + "source": [ + "# Define fatality index\n", + "model_summary_df['fatality_index'] = model_summary_df['total_fatalities'] / model_summary_df['total_people']\n", + "\n", + "#Define injury index\n", + "model_summary_df['injury_index'] = (\n", + " model_summary_df['total_serious'] + model_summary_df['total_minor']\n", + ") / model_summary_df['total_people']\n", + "\n", + "#Define damage severity index\n", + "model_summary_df['damage_severity_index'] = model_summary_df['total_destroyed'] / model_summary_df['total_accidents']\n", + "print(model_summary_df.columns)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Calculate Weighted Risk Score\n", + "## Define weights for each index — update these anytime to change importance or client priority/preference\n", + "\n", + "- **Fatality Index** = 0.5\n", + "- **Injury Index** = 0.2\n", + "- **Damage Severity Index** = 0.3\n", + "\n" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
make_modeltotal_accidentstotal_fatalitiestotal_serioustotal_minortotal_uninjuredtotal_destroyedtotal_peoplefatality_indexinjury_indexdamage_severity_indexrisk_score
7488vans rv4159.02.05.05.0321.00.4285710.3333330.2000000.340952
7489vans rv6146.06.08.02.0422.00.2727270.6363640.2857140.349351
7491vans rv7114.06.01.03.0314.00.2857140.5000000.2727270.324675
7495vans rv8145.01.02.010.0318.00.2777780.1666670.2142860.236508
7750yakovlev yak 521110.03.02.05.0120.00.5000000.2500000.0909090.327273
\n", + "
" + ], + "text/plain": [ + " make_model total_accidents total_fatalities total_serious \\\n", + "7488 vans rv4 15 9.0 2.0 \n", + "7489 vans rv6 14 6.0 6.0 \n", + "7491 vans rv7 11 4.0 6.0 \n", + "7495 vans rv8 14 5.0 1.0 \n", + "7750 yakovlev yak 52 11 10.0 3.0 \n", + "\n", + " total_minor total_uninjured total_destroyed total_people \\\n", + "7488 5.0 5.0 3 21.0 \n", + "7489 8.0 2.0 4 22.0 \n", + "7491 1.0 3.0 3 14.0 \n", + "7495 2.0 10.0 3 18.0 \n", + "7750 2.0 5.0 1 20.0 \n", + "\n", + " fatality_index injury_index damage_severity_index risk_score \n", + "7488 0.428571 0.333333 0.200000 0.340952 \n", + "7489 0.272727 0.636364 0.285714 0.349351 \n", + "7491 0.285714 0.500000 0.272727 0.324675 \n", + "7495 0.277778 0.166667 0.214286 0.236508 \n", + "7750 0.500000 0.250000 0.090909 0.327273 " + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Define damage weights\n", + "WEIGHTS = {\n", + " 'fatality_index': 0.5,\n", + " 'damage_severity_index': 0.3,\n", + " 'injury_index': 0.2\n", + "}\n", + "# Compute Risk score using weighted fatality, damage_severity and Injury indices\n", + "model_summary_df['risk_score'] = (\n", + " model_summary_df['fatality_index'] * WEIGHTS['fatality_index'] +\n", + " model_summary_df['damage_severity_index'] * WEIGHTS['damage_severity_index'] +\n", + " model_summary_df['injury_index'] * WEIGHTS['injury_index']\n", + ")\n", + "model_summary_df.tail()\n" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(431, 12)" + ] + }, + "execution_count": 12, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "model_summary_df.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(431, 12)" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "model_summary_df_cleaned = model_summary_df.dropna(subset=['risk_score'])\n", + "model_summary_df_cleaned.shape" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Visualize Risk Index Distributions" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "plt.figure(figsize=(15, 8))\n", + "sns.heatmap(\n", + " model_summary_df[['fatality_index', 'damage_severity_index', 'injury_index', 'risk_score']].corr(),\n", + " annot=True,\n", + " cmap='coolwarm',\n", + " fmt='.2f'\n", + ")\n", + "plt.title(\"Correlation Matrix of Risk Indexes\")\n", + "plt.tight_layout()\n", + "plt.show()\n" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "# Scatter plot: Damage Severity Index vs Risk Score\n", + "plt.figure(figsize=(8, 6))\n", + "sns.scatterplot(data=model_summary_df, x='damage_severity_index', y='risk_score', color='orange')\n", + "plt.title('Damage Severity Index vs Risk Score')\n", + "plt.xlabel('Damage Severity Index')\n", + "plt.ylabel('Risk Score')\n", + "plt.tight_layout()\n", + "plt.show()\n", + "\n", + "# Scatter plot: Fatality Index vs Risk Score\n", + "plt.figure(figsize=(8, 6))\n", + "sns.scatterplot(data=model_summary_df, x='fatality_index', y='risk_score', color='red')\n", + "plt.title('Fatality Index vs Risk Score')\n", + "plt.xlabel('Fatality Index')\n", + "plt.ylabel('Risk Score')\n", + "plt.tight_layout()\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "plt.figure(figsize=(10, 8)) # Increased figure size\n", + "\n", + "# Create the scatter plot\n", + "scatter = sns.scatterplot(\n", + " data=model_summary_df,\n", + " x='fatality_index',\n", + " y='risk_score',\n", + " size='total_people',\n", + " hue='risk_score',\n", + " palette='coolwarm',\n", + " sizes=(30, 200),\n", + " alpha=0.7\n", + ")\n", + "\n", + "# Add reference lines\n", + "plt.axhline(0.3, linestyle='--', color='gray', alpha=0.5)\n", + "plt.axvline(0.2, linestyle='--', color='gray', alpha=0.5)\n", + "\n", + "# Customize titles and labels\n", + "plt.title(\"Aircraft Risk Profile\\n(Bubble Size Represents Total People Involved)\", pad=20, fontsize=14)\n", + "plt.xlabel(\"Fatality Index (Fatalities/Total People)\", fontsize=12)\n", + "plt.ylabel(\"Composite Risk Score\", fontsize=12)\n", + "\n", + "\n", + "\n", + "# Method 2: If you really want bottom-left inside the plot\n", + "plt.legend(\n", + " bbox_to_anchor=(0.80, 0.0), # Inside bottom-left\n", + " loc='lower left',\n", + " borderaxespad=0.5,\n", + " frameon=True,\n", + " title='Risk Score'\n", + " )\n", + "# Add tight_layout\n", + "plt.tight_layout() \n", + "\n", + "plt.show()" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
make_modeltotal_accidentstotal_peoplefatality_indexdamage_severity_indexinjury_indexrisk_score
1428boeing 757161810.00.00000.00000.00500.0010
4759maule m-5-210c1226.00.00000.00000.03850.0077
3010embraer emb-145lr10426.00.00000.00000.03990.0080
1387boeing 737 7h4141655.00.00060.00000.08940.0182
1460boeing 777172422.00.00000.05880.00870.0194
3202evektor-aerotechnik as sportstar2027.00.00000.00000.11110.0222
1938cessna 180j2542.00.00000.00000.11900.0238
1930cessna 180a1430.00.03330.00000.06670.0300
4047hughes 269a2033.00.00000.00000.15150.0303
283air tractor at 6021213.00.00000.00000.15380.0308
\n", + "
" + ], + "text/plain": [ + " make_model total_accidents total_people \\\n", + "1428 boeing 757 16 1810.0 \n", + "4759 maule m-5-210c 12 26.0 \n", + "3010 embraer emb-145lr 10 426.0 \n", + "1387 boeing 737 7h4 14 1655.0 \n", + "1460 boeing 777 17 2422.0 \n", + "3202 evektor-aerotechnik as sportstar 20 27.0 \n", + "1938 cessna 180j 25 42.0 \n", + "1930 cessna 180a 14 30.0 \n", + "4047 hughes 269a 20 33.0 \n", + "283 air tractor at 602 12 13.0 \n", + "\n", + " fatality_index damage_severity_index injury_index risk_score \n", + "1428 0.0000 0.0000 0.0050 0.0010 \n", + "4759 0.0000 0.0000 0.0385 0.0077 \n", + "3010 0.0000 0.0000 0.0399 0.0080 \n", + "1387 0.0006 0.0000 0.0894 0.0182 \n", + "1460 0.0000 0.0588 0.0087 0.0194 \n", + "3202 0.0000 0.0000 0.1111 0.0222 \n", + "1938 0.0000 0.0000 0.1190 0.0238 \n", + "1930 0.0333 0.0000 0.0667 0.0300 \n", + "4047 0.0000 0.0000 0.1515 0.0303 \n", + "283 0.0000 0.0000 0.1538 0.0308 " + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Filter out models with zero risk score\n", + "filtered_df = model_summary_df[model_summary_df['risk_score'] > 0]\n", + "\n", + "# Sort top 10 lowest non-zero risk models\n", + "top_10 = filtered_df.sort_values('risk_score').head(10)\n", + "\n", + "plt.figure(figsize=(10, 6))\n", + "sns.barplot(\n", + " data=top_10,\n", + " x='risk_score',\n", + " y='make_model',\n", + " hue='make_model',\n", + " dodge=False\n", + ")\n", + "plt.title(\"Top 10 Lowest-Risk Aircraft Models\")\n", + "plt.xlabel(\"Risk Score\")\n", + "plt.ylabel(\"make_model\")\n", + "plt.tight_layout()\n", + "plt.show()\n", + "\n", + "# Display summary table with key stats\n", + "top_10[[\n", + " 'make_model', 'total_accidents', 'total_people',\n", + " 'fatality_index', 'damage_severity_index', 'injury_index', 'risk_score'\n", + "]].round(4)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": {}, + "outputs": [], + "source": [ + "# Remove models with zero risk score as indication of limited data for analysis\n", + "model_summary_df = model_summary_df[model_summary_df['risk_score'] > 0]\n", + "\n", + "# filter models with zero fatality, damage and injury index for respective charts\n", + "fatality_filtered = model_summary_df[model_summary_df['fatality_index'] > 0]\n", + "damage_filtered = model_summary_df[model_summary_df['damage_severity_index'] > 0]\n", + "injury_filtered = model_summary_df[model_summary_df['injury_index'] > 0]\n" + ] + }, + { + "cell_type": "code", + "execution_count": 19, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "# Sort and select top 10\n", + "lowest_fatality = fatality_filtered.sort_values('fatality_index').head(10)\n", + "\n", + "# Plot\n", + "plt.figure(figsize=(10,6))\n", + "sns.barplot(data=lowest_fatality, x='fatality_index', y='make_model', palette='Greens_r')\n", + "plt.title(\"Top 10 Models with Lowest Non-Zero Fatality Rates\")\n", + "plt.xlabel(\"Fatality Index\")\n", + "plt.ylabel(\"Aircraft Make-Model\")\n", + "plt.tight_layout()\n", + "plt.show()\n" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "# Top 10 models with lowest non-zero Damage Severity Index\n", + "damage_filtered = model_summary_df[model_summary_df['damage_severity_index'] > 0]\n", + "lowest_damage = damage_filtered.sort_values('damage_severity_index').head(10)\n", + "\n", + "plt.figure(figsize=(10,6))\n", + "sns.barplot(data=lowest_damage, x='damage_severity_index', y='make_model', palette='Purples_r')\n", + "plt.title(\"Top 10 Models with Lowest Non-Zero Damage Severity Index\")\n", + "plt.xlabel(\"Damage Severity Index\")\n", + "plt.ylabel(\"Aircraft Make-Model\")\n", + "plt.tight_layout()\n", + "plt.show()\n" + ] + }, + { + "cell_type": "code", + "execution_count": 21, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "# Top 10 models with lowest non-zero Damage Severity Index\n", + "damage_filtered = model_summary_df[model_summary_df['damage_severity_index'] > 0]\n", + "lowest_damage = damage_filtered.sort_values('damage_severity_index').head(10)\n", + "\n", + "plt.figure(figsize=(10,6))\n", + "sns.barplot(data=lowest_damage, x='damage_severity_index', y='make_model', palette='Oranges_r')\n", + "plt.title(\"Top 10 Models with Lowest Non-Zero Damage Severity Index\")\n", + "plt.xlabel(\"Damage Severity Index\")\n", + "plt.ylabel(\"Aircraft Make-Model\")\n", + "plt.tight_layout()\n", + "plt.show()\n" + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": {}, + "outputs": [ + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": { + "needs_background": "light" + }, + "output_type": "display_data" + } + ], + "source": [ + "# Top 10 models with lowest non-zero Injury Index\n", + "lowest_injury = injury_filtered.sort_values('injury_index').head(10)\n", + "\n", + "plt.figure(figsize=(10,6))\n", + "sns.barplot(data=lowest_injury, x='injury_index', y='make_model', palette='Blues_r')\n", + "plt.title(\"Top 10 Models with Lowest Non-Zero Severe Injury Index\")\n", + "plt.xlabel(\"Severe Injury Index\")\n", + "plt.ylabel(\"Aircraft Make-Model\")\n", + "plt.tight_layout()\n", + "plt.show()\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Data Analysis\n", + "## Recommend the aircraft with the lowest fatality, injury, damage and overall risk i.e. the ones that itersect across all the metrics." + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Models appearing in top 30 for all 3 metrics:\n", + "{'cessna 180a', 'cessna 195a', 'boeing 747'}\n", + "\n", + "✅ Top 10 Models by Combined Safety Rank:\n", + " make_model fatality_index injury_index \\\n", + "1428 boeing 757 0.000000 0.004972 \n", + "4759 maule m-5-210c 0.000000 0.038462 \n", + "3010 embraer emb-145lr 0.000000 0.039906 \n", + "1930 cessna 180a 0.033333 0.066667 \n", + "3202 evektor-aerotechnik as sportstar 0.000000 0.111111 \n", + "1387 boeing 737 7h4 0.000604 0.089426 \n", + "1938 cessna 180j 0.000000 0.119048 \n", + "1981 cessna 195a 0.038462 0.076923 \n", + "5632 piper pa 28-161 0.032258 0.096774 \n", + "1460 boeing 777 0.000000 0.008671 \n", + "\n", + " damage_severity_index risk_score combined_rank \n", + "1428 0.000000 0.000994 10.0 \n", + "4759 0.000000 0.007692 19.0 \n", + "3010 0.000000 0.007981 23.0 \n", + "1930 0.000000 0.030000 78.0 \n", + "3202 0.000000 0.022222 80.0 \n", + "1387 0.000000 0.018187 90.0 \n", + "1938 0.000000 0.023810 91.0 \n", + "1981 0.000000 0.034615 95.0 \n", + "5632 0.000000 0.035484 115.0 \n", + "1460 0.058824 0.019381 117.0 \n" + ] + }, + { + "data": { + "image/png": "\n", + "text/plain": [ + "
" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "# Increase range\n", + "top_n = 30\n", + "top_fatality = fatality_filtered.sort_values('fatality_index').head(top_n)['make_model']\n", + "top_injury = injury_filtered.sort_values('injury_index').head(top_n)['make_model']\n", + "top_risk = model_summary_df.sort_values('risk_score').head(top_n)['make_model']\n", + "top_damage = damage_filtered.sort_values('damage_severity_index').head(top_n)['make_model']\n", + "\n", + "# Intersection\n", + "common_models = set(top_fatality) & set(top_injury) & set(top_risk)\n", + "\n", + "if common_models:\n", + " print(f\"✅ Models appearing in top {top_n} for all 3 metrics:\")\n", + " print(common_models)\n", + "else:\n", + " print(f\"❌ No exact overlap in top {top_n}. Computing combined ranking...\")\n", + "\n", + "# Compute combined rank\n", + "model_summary_df['rank_fatality'] = model_summary_df['fatality_index'].rank(method='min')\n", + "model_summary_df['rank_injury'] = model_summary_df['injury_index'].rank(method='min')\n", + "model_summary_df['rank_risk'] = model_summary_df['risk_score'].rank(method='min')\n", + "model_summary_df['rank_damage'] = model_summary_df['damage_severity_index'].rank(method='min')\n", + "\n", + "# Compute combined rank across 4 metrics\n", + "model_summary_df['combined_rank'] = (\n", + " model_summary_df['rank_fatality'] +\n", + " model_summary_df['rank_injury'] +\n", + " model_summary_df['rank_risk'] +\n", + " model_summary_df['rank_damage']\n", + ")\n", + "\n", + "# Sort by combined rank\n", + "combined_top = model_summary_df.sort_values('combined_rank').head(10)\n", + "print(\"\\n✅ Top 10 Models by Combined Safety Rank:\")\n", + "print(combined_top[['make_model', 'fatality_index', 'injury_index', 'damage_severity_index', 'risk_score', 'combined_rank']])\n", + "\n", + "# Optional: Venn Diagram for visualization\n", + "from matplotlib_venn import venn3\n", + "\n", + "plt.figure(figsize=(8,6))\n", + "venn3([set(top_fatality), set(top_injury), set(top_damage)],\n", + " set_labels=('Top Fatality', 'Top Severe Injury', 'Top Damage'))\n", + "plt.title(\"Overlap of Top 10 Models Across All Metrics\")\n", + "plt.show()\n" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "✅ Export completed successfully!\n" + ] + }, + { + "data": { + "text/html": [ + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
make_modeltotal_accidentstotal_peoplefatality_indexinjury_indexdamage_severity_indexrisk_scorecombined_rank
32aero commander 1001321.00.0952380.2857140.0769230.127839690.0
71aero commander s2r1818.00.2222220.2222220.3888890.2722221254.0
98aeronca 11ac2950.00.1400000.3400000.0689660.158690853.0
101aeronca 15ac1012.00.0833330.0833330.0000000.058333185.0
110aeronca 7ac85129.00.1627910.2868220.0823530.163466865.0
113aeronca 7bcm1417.00.0588240.4705880.0714290.144958773.0
115aeronca 7ccm1016.00.0000000.3125000.0000000.062500374.0
163aerospatiale as3501532.00.2812500.0937500.0666670.179375715.0
233agusta a1091130.00.6333330.1666670.5454550.5136361420.0
283air tractor at 6021213.00.0000000.1538460.0000000.030769128.0
\n", + "
" + ], + "text/plain": [ + " make_model total_accidents total_people fatality_index \\\n", + "32 aero commander 100 13 21.0 0.095238 \n", + "71 aero commander s2r 18 18.0 0.222222 \n", + "98 aeronca 11ac 29 50.0 0.140000 \n", + "101 aeronca 15ac 10 12.0 0.083333 \n", + "110 aeronca 7ac 85 129.0 0.162791 \n", + "113 aeronca 7bcm 14 17.0 0.058824 \n", + "115 aeronca 7ccm 10 16.0 0.000000 \n", + "163 aerospatiale as350 15 32.0 0.281250 \n", + "233 agusta a109 11 30.0 0.633333 \n", + "283 air tractor at 602 12 13.0 0.000000 \n", + "\n", + " injury_index damage_severity_index risk_score combined_rank \n", + "32 0.285714 0.076923 0.127839 690.0 \n", + "71 0.222222 0.388889 0.272222 1254.0 \n", + "98 0.340000 0.068966 0.158690 853.0 \n", + "101 0.083333 0.000000 0.058333 185.0 \n", + "110 0.286822 0.082353 0.163466 865.0 \n", + "113 0.470588 0.071429 0.144958 773.0 \n", + "115 0.312500 0.000000 0.062500 374.0 \n", + "163 0.093750 0.066667 0.179375 715.0 \n", + "233 0.166667 0.545455 0.513636 1420.0 \n", + "283 0.153846 0.000000 0.030769 128.0 " + ] + }, + "execution_count": 24, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Ensure make_model column exists and is normalized\n", + "if 'make_model' not in model_summary_df.columns:\n", + " model_summary_df['make_model'] = (model_summary_df['Make'] + ' ' + model_summary_df['Model']).str.strip().str.lower()\n", + "else:\n", + " model_summary_df['make_model'] = model_summary_df['make_model'].str.strip().str.lower()\n", + "\n", + "# Define columns for export\n", + "export_cols = [\n", + " 'make_model',\n", + " 'total_accidents', 'total_people',\n", + " 'fatality_index', 'injury_index', 'damage_severity_index', 'risk_score', 'combined_rank'\n", + "]\n", + "\n", + "# Check if all columns exist\n", + "missing_cols = [col for col in export_cols if col not in model_summary_df.columns]\n", + "if missing_cols:\n", + " print(f\"⚠️ Missing columns: {missing_cols}\")\n", + "else:\n", + " # Export CSV and Excel\n", + " model_summary_df[export_cols].to_csv('Aviation_Safety_Tableau.csv', index=False)\n", + " model_summary_df[export_cols].to_excel('Aviation_Safety_Tableau.xlsx', index=False)\n", + " print(\"✅ Export completed successfully!\")\n", + "\n", + "# Show a preview of exported data\n", + "model_summary_df[export_cols].head(10)\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# **Aircraft Safety Analysis – Recommended Models**\n", + "\n", + "Based on the computed safety indices (**Fatality Index**, **Injury Index**, **Damage Severity Index**) and overall **Risk Score**, here are the insights deduced:\n", + "\n", + "### **Insights**\n", + "1. **Models with lowest risk scores** tend to have fewer accidents and lower fatality ratios.\n", + "2. **Purpose of flight patterns** show that some of these safer models are commonly used for **personal purposes**.\n", + "3. **Engine configurations** (type and number) may indicate suitability for specific operations.\n", + "\n", + "---\n", + "\n", + "## ✅ Recommendations for Client:\n", + "- **Personal Use:* For private operations, prioritize single-engine piston types with historically low injury rates.\n", + "- **Top 10 models as illustrated in the bar graph with the boeing 757 being the safest evaluated model too invest in. " ] } ], "metadata": { "kernelspec": { - "display_name": "Python 3", + "display_name": "Python (learn-env)", "language": "python", - "name": "python3" + "name": "learn-env" }, "language_info": { "codemirror_mode": { @@ -40,7 +1740,7 @@ "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", - "version": "3.6.4" + "version": "3.8.5" } }, "nbformat": 4,