Project Documentation – Water Quality Prediction
- Problem Statement Access to safe drinking water is essential for human health. Water sources may contain harmful contaminants such as heavy metals (lead, arsenic, mercury), chemicals (nitrates, chloramine), or biological hazards (bacteria, viruses). These pollutants can cause severe health issues if consumed above safe limits. The challenge is to build a machine learning model that can automatically predict whether water is safe (1) or not safe (0) based on its chemical and biological composition.
- Project Goal • Develop an end-to-end ML pipeline to classify water samples as Safe or Not Safe. • Perform data ingestion, cleaning, transformation, feature engineering, EDA, model training, and evaluation. • Compare different ML algorithms (KNN, Random Forest, etc.) and select the best-performing model. • Provide a deployable solution (e.g., Streamlit/FastAPI app) to predict water safety.
- Data Collection The dataset contains measurements of various water contaminants. Each feature has a known safety threshold. Features (input variables): • Aluminium – dangerous if > 2.8 • Ammonia – dangerous if > 32.5 • Arsenic – dangerous if > 0.01 • Barium – dangerous if > 2 • Cadmium – dangerous if > 0.005 • Chloramine – dangerous if > 4 • Chromium – dangerous if > 0.1 • Copper – dangerous if > 1.3 • Fluoride – dangerous if > 1.5 • Bacteria – dangerous if > 0 • Viruses – dangerous if > 0 • Lead – dangerous if > 0.015 • Nitrates – dangerous if > 10 • Nitrites – dangerous if > 1 • Mercury – dangerous if > 0.002 • Perchlorate – dangerous if > 56 • Radium – dangerous if > 5 • Selenium – dangerous if > 0.5 • Silver – dangerous if > 0.1 • Uranium – dangerous if > 0.3 Target variable: • is_safe (1 = safe, 0 = not safe)
- Data Cleaning • Handle missing values: o Numerical features → filled with mean. o Categorical features (if any) → filled with mode. • Remove duplicates. • Ensure proper data types (float, int).
- Data Transformation • Standardization/Normalization: Scale features to have similar ranges. • Encode categorical features (if applicable). • Outlier detection using domain thresholds.
- Feature Engineering • Create binary features for “above threshold” vs “within safe limit”. Example: • data["aluminium_high"] = (data["aluminium"] > 2.8).astype(int) • Interaction features: combine correlated chemicals (e.g., Nitrates + Nitrites). • Dimensionality reduction (if needed) using PCA.
- Exploratory Data Analysis (EDA) • Distribution plots of each chemical. • Correlation heatmap. • Compare safe vs not safe water samples. • Boxplots to detect outliers.
- Model Training • Split dataset: 70% train, 30% test. • Models to compare: o KNN (K-Nearest Neighbors) o Random Forest Classifier o Logistic Regression o SVM (Support Vector Machine) • Perform GridSearchCV for hyperparameter tuning.
- Model Evaluation Metrics: • Accuracy • Precision, Recall, F1-score • Confusion Matrix • ROC-AUC curve Expected Outcome: • Best model selected automatically. • Accuracy > 90% (depending on dataset quality).
- Deployment (Future Work) • Save the best model using joblib/pickle. • Deploy using Streamlit app or FastAPI. • CI/CD pipeline with GitHub Actions + Docker + Cloud (AWS/GCP/Heroku).