Data Science for Road Safety: Data Preparation & Analysis

📄 Project Overview

This project applies the initial, critical phases of the CRISP-DM methodology to analyze fatal traffic accidents in the United States. The primary goal is to perform a comprehensive data preparation and feature engineering process on the Fatality Analysis Reporting System (FARS) dataset from the NHTSA, building a robust foundation for future predictive modeling.

This repository covers the first three phases of the data science lifecycle:

Business Understanding: Defining the problem of road safety and identifying key analytical goals, such as understanding temporal trends and geographical "black spots".
Data Understanding: Exploring multiple raw data tables to assess their structure, quality, and relevance.
Data Preparation: A deep dive into cleaning, transforming, and engineering features to create a single, analysis-ready dataset.

✨ Analysis Showcase & Methodology

The core of this project lies in its methodical and transparent data preparation pipeline, which transforms complex, multi-table raw data into a clean, feature-rich dataset.

Multi-Source Data Integration:
- Successfully merged over eight different tables from the FARS 2020 dataset (e.g., accident, person, vehicle, weather) into a unified analytical table.
- Integrated and harmonized datasets from three consecutive years (2018-2020) to enable time-series analysis.
Advanced Feature Engineering & Transformation:
- Binary Feature Creation: Converted multiple categorical variables into meaningful binary flags (0/1) to simplify complexity for machine learning models. Examples include:
  - DRINKING & DRUGS: Transformed from multi-level codes into simple binary indicators of substance use, respecting the presumption of innocence for "unknown" or "unreported" cases.
  - VEHICLECC & DRIMPAIR: Created binary features to represent the presence of mechanical failure or driver impairment.
  - NIGHT_HOUR: Engineered a binary variable from timestamps to flag accidents occurring at night (6 PM - 6 AM).
- Derived Features: Created new variables from existing ones, such as calculating vehicle AGE from the model year and creating age groups (AGE_GRUP) from driver ages.
Rigorous Data Cleaning and Imputation Strategy:
- Methodically identified and handled thousands of missing (NA) and out-of-range values (e.g., 99, 9999) resulting from data merges and reporting gaps.
- Implemented a logical imputation for MOD_YEAR, filling missing values for a given vehicle with the model year of the primary vehicle in the same accident.
- Filtered out irrelevant records, such as those involving non-motorized vehicles (VEH_NO = 0) or specific non-causal driver roles (DRIVERRF), to create a focused dataset on primary causal factors.
Dimensionality Reduction with PCA:
- Conducted a Principal Component Analysis (PCA) on the final, cleaned numerical variables to identify the main drivers of variance in the data.
- The analysis revealed that the first three principal components explain approximately 37% of the total variance, with strong contributions from variables like DRINKING and DRUGS. This confirms the relevance of the engineered features and prepares the data for more efficient modeling.

💻 Technologies Used

Language: R
Core Libraries:
- Tidyverse (dplyr, readr): For data manipulation, cleaning, and transformation.
- ggplot2: For data visualization (used in the full analysis).
- stats: For conducting the Principal Component Analysis (prcomp).
Environment: RStudio with R Markdown

🏆 Project Outcome: Analysis-Ready Datasets

This project delivers three clean and processed datasets, ready for the next phases of a data science project (modeling and evaluation):

analysis18_20: A time-series table containing accident cases, year, and fatality counts from 2018-2020.
accFacts20: The main feature-rich dataset containing over 25 engineered variables about accident factors (driver, vehicle, environment).
accBpoint20: A geographical dataset containing location information (State, County, Lat/Lon) to identify and map accident "black spots".

🚀 Getting Started

To explore the data preparation process:

Clone the repository.
Ensure you have R and RStudio installed.
Install the required packages as listed in the R Markdown file (e.g., tidyverse, stats).
Run the .Rmd notebook to follow the entire step-by-step logic, from raw data loading to the final PCA.

👤 Author

Antonio Barrera Mora

LinkedIn: https://www.linkedin.com/in/anbamo/
GitHub: @Kamaranis

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
FARS2018NationalAuxiliaryCSV		FARS2018NationalAuxiliaryCSV
FARS2019NationalAuxiliaryCSV		FARS2019NationalAuxiliaryCSV
FARS2020NationalCSV		FARS2020NationalCSV
.gitattributes		.gitattributes
.gitignore		.gitignore
Fatality Analysis Reporting System (FARS) Analytical User’s Manual, 1975-2020.pdf		Fatality Analysis Reporting System (FARS) Analytical User’s Manual, 1975-2020.pdf
Final report (EN).pdf		Final report (EN).pdf
LICENSE		LICENSE
README.md		README.md
accident.csv		accident.csv
ds_against_traffic_accidents.Rmd		ds_against_traffic_accidents.Rmd
ds_against_traffic_accidents.html		ds_against_traffic_accidents.html
ds_against_traffic_accidents.md		ds_against_traffic_accidents.md
p_brand.html		p_brand.html
references.bib		references.bib

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Data Science for Road Safety: Data Preparation & Analysis

📄 Project Overview

✨ Analysis Showcase & Methodology

💻 Technologies Used

🏆 Project Outcome: Analysis-Ready Datasets

🚀 Getting Started

👤 Author

About

Uh oh!

Languages

License

AnbarTop/FARS-Data-Analysis-Preparation

Folders and files

Latest commit

History

Repository files navigation

Data Science for Road Safety: Data Preparation & Analysis

📄 Project Overview

✨ Analysis Showcase & Methodology

💻 Technologies Used

🏆 Project Outcome: Analysis-Ready Datasets

🚀 Getting Started

👤 Author

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Languages