Ridge Regression is a regularized version of Linear Regression that helps prevent overfitting by adding a penalty to large coefficients. It is particularly useful when dealing with multicollinearity (high correlation between independent variables).
Ridge Regression minimizes the following cost function:
where:
-
$J(\mathbf{w})$ is the loss function (sum of squared errors with a penalty term). -
$y_i$ is the actual output. -
$\mathbf{w}^T x_i$ is the predicted output. -
$\lambda$ (alpha) is the regularization parameter, controlling the penalty on large coefficients. -
$\sum w_j^2$ is the L2 regularization term, which discourages large values of weights.
- The L2 penalty
$\lambda \sum w_j^2$ shrinks the magnitude of coefficients, preventing them from becoming too large. - Helps in cases where features are highly correlated, reducing overfitting.
-
I
$\lambda = 0$ → Ridge Regression behaves like Ordinary Least Squares (OLS) as there is no penalty for coefficients. -
If
$\lambda$ is large → Model coefficients are heavily penalized, leading to smaller values and preventing overfitting. -
Choosing
$\lambda$ → Cross-validation is commonly used to find the optimal value.
Lasso (Least Absolute Shrinkage and Selection Operator) Regression is a regularized version of Linear Regression that uses an L1 penalty to shrink some coefficients to exactly zero.
- This makes it useful for feature selection and reducing model complexity.
Lasso Regression minimizes the following cost function:
where:
-
$J(\mathbf{w})$ is the loss function (sum of squared errors with an L1 penalty). -
$\mathbf{w}^T x_i$ is the predicted output. -
$\lambda$ is the regularization parameter, controlling the strength of penalty. -
$\sum |w_j|$ is the L1 regularization term, enforcing sparsity.
- The L1 penalty ( \lambda \sum |w_j| ) encourages some coefficients to become exactly zero.
- This leads to automatic feature selection, as irrelevant features are eliminated.
- Helps in handling high-dimensional datasets where feature selection is necessary.
- If ( \lambda = 0 ) → Lasso Regression behaves like Ordinary Least Squares (OLS).
- If ( \lambda ) is large → More coefficients shrink to zero, making the model sparse.
- Choosing ( \lambda ) → Use cross-validation to find the optimal value.
- Lasso Regression is ideal for feature selection and handling sparse datasets.
- The L1 regularization term leads to simpler models with fewer, more relevant features.
- Tuning ( \lambda ) is crucial to balancing model sparsity and predictive performance.
KNN is a non-parametric, instance-based algorithm that classifies a data point based on the majority class of its k nearest neighbors.
Given a dataset:
where
For a test point
where:
- C is the set of class labels.
-
$N_k(x)$ is the set of k nearest neighbors of x based on distance. -
$1 (y_i = c)$ is the indicator function, a counter that adds 1 if the neighbor belongs to the class, otherwise 0.
A neighbor of a point x is any other data point in the dataset whose distance from x is among the k smallest distances. Each data point in the dataset consists of:
- A feature vector
$x_i$ (e.g., age, height, weight). - A label
$y_i$ (for classification) or a numerical value$y_i$ (for regression). - When we receive a new test point x , we:
- Compute the distance from x to all data points in the training set.
- Sort the distances in ascending order.
- Select the k closest points as neighbors.
Euclidean Distance (default):
Other options:
-
Manhattan Distance:
$$d(x, x') = \sum_{i=1}^{n} |x_i - x'_i|$$ -
Minkowski Distance (generalized):
$$d(x, x') = \left( \sum_{i=1}^{n} |x_i - x'_i|^p \right)^{\frac{1}{p}}$$
- Small ( k ): Sensitive to noise, may overfit.
- Large ( k ): Smooths decision boundary, may underfit.
- Use cross-validation to find the best ( k ).
Instead of simple majority voting, assign weights based on distance:
where
Instead of voting, KNN regression averages the values of the nearest neighbors.
Instead of a simple average, this is a weighted average where closer neighbors have more influence.
Decision Trees are used for classification and regression by recursively splitting data into subsets based on feature values. The goal is to create pure nodes where all samples belong to the same category (classification) or have minimal variance (regression).
At each node (point where the data is split based on a features' value), the best split is chosen using an impurity measure:
where:
-
$p_i$ is the probability of class ( i ) in the node. - Lower values mean purer splits.
A split is chosen to maximize Information Gain:
For regression, Decision Trees minimize variance using Mean Squared Error (MSE):
where ( \hat{y} ) is the mean target value in the node.
The tree stops growing when:
- A maximum depth is reached.
- Nodes have fewer than a minimum number of samples.
- Further splits do not significantly reduce impurity.
- Pre-pruning: Limits tree depth or minimum samples per node.
- Post-pruning: Removes branches that do not improve generalization.
Feature | Classification | Regression |
---|---|---|
Output | Class labels | Continuous values |
Splitting Metric | Gini, Entropy | MSE |
Prediction Rule | Majority class in leaf | Mean of leaf values |
- Decision Trees are easy to interpret and handle non-linearity well.
- Overfitting is common, so pruning and hyperparameter tuning are crucial.
- They serve as the foundation for powerful models like Random Forests and Gradient Boosting.
Random Forest is an ensemble learning algorithm that combines multiple Decision Trees to improve accuracy, reduce overfitting, and enhance generalization.
- Bagging (Bootstrap Aggregation): Each tree is trained on a random subset of data sampled with replacement.
- Random Feature Selection: At each split, only a random subset of features is considered.
- Majority Voting (Classification): The final class is chosen based on the most common prediction among trees.
- Averaging (Regression): The final prediction is the average of all tree outputs.
Each tree is trained on a dataset
where
At each split, only a random subset of features ( F_b ) is considered:
where $F$is the total set of features.
For classification, the final prediction is the majority vote among trees:
where:
-
$B$ is the number of trees. -
$T_b(x)$ is the class prediction from the ( b^{th} ) tree. -
$c$ is a class label.
For regression, the final prediction is the average of all tree outputs:
- Reduces Overfitting: Aggregating multiple trees lowers variance.
- Handles High-Dimensional Data: Feature randomness helps in feature selection.
- Scales Well: Parallelizable across multiple processors.
- Robust to Noise: Reduces overfitting by using bootstrap sampling.
K-Means is a centroid-based clustering algorithm that partitions a dataset into
Given a dataset
where:
-
$\mu_k$ is the centroid (mean) of cluster$C_k$ -
$| x_i - \mu_k |^2$ is the squared Euclidean distance between a data point and its assigned cluster centroid
- Initialize
$K$ centroids randomly. - Assign each data point to the nearest centroid.
- Update centroids by computing the mean of assigned points.
- Repeat steps 2-3 until centroids converge or a maximum number of iterations is reached.
- Computationally efficient for large datasets.
- Works well when clusters are well-separated and spherical.
- Requires specifying
$K$ beforehand. - Sensitive to outliers and initial centroid placement.
Agglomerative clustering is a hierarchical, bottom-up approach that merges data points into clusters iteratively.
Given a dataset
where
- Single Linkage:
$$d(A, B) = \min_{x_i \in A, x_j \in B} \| x_i - x_j \|$$ (distance between the closest points in clusters) - Complete Linkage:
$$d(A, B) = \max_{x_i \in A, x_j \in B} \| x_i - x_j \|$$ (distance between the farthest points in clusters) - Average Linkage:
$$d(A, B) = \frac{1}{|A||B|} \sum_{x_i \in A} \sum_{x_j \in B} \| x_i - x_j \|$$ (average distance between all pairs in clusters)
- Treat each data point as its own cluster.
- Compute pairwise distances between all clusters.
- Merge the closest clusters based on linkage criterion.
- Repeat until a single cluster remains or a predefined number of clusters is reached.
- Does not require specifying the number of clusters in advance.
- Can capture complex cluster structures.
- Computationally expensive for large datasets (
$O(n^2)$ complexity)
An imbalanced dataset occurs when one class significantly outnumbers the other(s), which can lead to biased model performance. For example, in fraud detection, the number of fraudulent transactions is much lower than legitimate ones.
- Bias towards majority class: Models may predict the majority class more often, ignoring the minority class.
- Poor generalization: The model may not learn meaningful patterns for minority class instances.
- Skewed performance metrics: Accuracy may be misleading, as high accuracy can be achieved by always predicting the majority class.
- Random Oversampling: Duplicates minority class samples.
- SMOTE (Synthetic Minority Over-sampling Technique): Generates synthetic data points.
from imblearn.over_sampling import SMOTE
smote = SMOTE()
X_resampled, y_resampled = smote.fit_resample(X, y)
- Random Undersampling: Removes majority class samples.
- Cluster-Based Undersampling: Selects representative samples from the majority class.
from imblearn.under_sampling import RandomUnderSampler
undersample = RandomUnderSampler()
X_resampled, y_resampled = undersample.fit_resample(X, y)
- Class Weighting: Assigns higher weights to the minority class.
from sklearn.ensemble import RandomForestClassifier
clf = RandomForestClassifier(class_weight='balanced')
clf.fit(X_train, y_train)
Anomaly detection aims to identify rare or unusual patterns that do not conform to expected behavior in a dataset.
For normally distributed data, anomalies can be detected using the probability density function (PDF):
-
$X$ : Data point -
$\mu$ : Mean of distribution -
$\sigma$ : Standard deviation - Anomalies occur when
$P(X)$ is below a certain threshold (e.g., 3 standard deviations from the mean).
Anomalies can be detected based on distance from neighbors:
-
$D(X)$ : Anomaly score of$X$ -
$X_i$ : k-nearest neighbors -
$d(X, X_i)$ : Distance metric (e.g., Euclidean) - Higher distance values indicate anomalies.
Isolation Forest isolates anomalies based on tree partitioning. The anomaly score is:
-
$E(h(X))$ : Average path length of$X$ in the tree -
$c(n)$ : Normalization factor - Shorter path length means
$X$ is more likely an anomaly.
Method | Approach | Suitable for |
---|---|---|
Gaussian Model | Probability Density | Normally distributed data |
kNN | Distance-Based | Data with clusters |
Isolation Forest | Tree Partitioning | High-dimensional data |
Selecting the right method depends on the dataset's distribution and the type of anomalies present.
A Confusion Matrix is a tabular representation of a classification model’s performance.
The F1 Score is the harmonic mean of Precision and Recall, providing a balanced evaluation of the classifier's performance.
The ROC Curve evaluates the performance of a binary classifier across different decision thresholds. It plots the True Positive Rate (TPR) against the False Positive Rate (FPR).
Given a classifier outputting predicted probabilities
The Area Under the ROC Curve (AUC-ROC) quantifies the classifier's performance across all thresholds. A higher AUC-ROC indicates better performance.
Cross Validation (CV) is a technique to estimate the generalization performance of a machine learning model by partitioning the dataset into multiple subsets for training and validation.
The dataset is divided into K equal-sized folds, where each fold is used once as a validation set while the rest serve as the training set. The model is trained K times, each time on a different combination of training and validation sets. The final cross-validation score is the average validation loss across all K folds.
- Compute the validation loss:
Compute the final cross-validation score by averaging over all
- A special case where
$K=n$ , meaning each data point is used once as a validation set while the rest serve as the training set.
- Provides a more reliable estimate of model performance than a single train/test split.
- Reduces variance in performance estimation by averaging multiple training-validation runs.
- Useful for hyperparameter tuning when combined with techniques like Grid Search.
Feature selection is the process of selecting the most relevant features from a dataset to improve model performance, reduce overfitting, and enhance interpretability. Automatic feature selection methods help streamline this process by leveraging statistical and machine learning techniques.
Filter methods evaluate features independently of the model by assessing their relationship with the target variable.
Mutual information measures the dependence between a feature
Where:
-
$P(x,y)$ : The joint probability of$X=x$ and$Y=y$ , i.e., the probability that both the feature and target take specific values togethe -
$P(x)$ : The marginal probability of$X=x$ , representing how often a feature value appears in the dataset. -
$P(y)$ : The marginal probability of$Y=y$ , representing how often a target value appears.
A higher mutual information score indicates a stronger relationship between the feature and the target.
Measures the linear relationship between a feature
Features with high absolute correlation are often retained, while highly correlated redundant features may be removed.
Wrapper methods select features by training a model and evaluating performance using subsets of features.
RFE iteratively removes the least important features by training a model and ranking feature importance:
- Train a model with all features.
- Compute feature importance scores.
- Remove the least important feature(s).
- Repeat until the desired number of features remains.
The optimal subset minimizes the validation loss:
Embedded methods perform feature selection during model training by incorporating regularization techniques.
LASSO regression adds an
where
Decision trees and ensemble models (e.g., Random Forest, XGBoost) naturally rank feature importance by evaluating the decrease in impurity caused by each feature.
The importance score for feature
where
When applying feature selection, it is essential to:
- Use cross-validation to prevent overfitting:
- Compare different feature selection methods and validate the final subset on a separate test set.
- Consider domain knowledge to avoid discarding useful but weakly correlated features.
Feature selection improves computational efficiency, enhances model interpretability, and can lead to better generalization performance.
Principal Component Analysis is a dimensionality reduction technique that transforms high-dimensional data into fewer dimensions while preserving as much important information as possible.
Imagine taking a photo of a three-dimensional object:
- From some angles, it looks cluttered and unclear.
- But if you rotate the object, you can find the best angle that captures the most details in two dimensions.
PCA does the same thing. It finds the best way to project high-dimensional data into a lower dimension while keeping important patterns.
- PCA finds new axes (directions), called Principal Components, that capture the most variance (spread of data).
- The first Principal Component (PC1) is the direction where the data varies the most.
- The second Principal Component (PC2) is perpendicular to the first and captures the next highest variance.
- Eigenvectors represent the directions of the new axes (principal components).
- Eigenvalues measure the importance (variance captured) by each principal component.
More variance means more information retained.
- We keep only the top
$k$ principal components that explain most of the variance. - This helps reduce noise, speed up computations, and avoid overfitting.
Given a dataset
- Standardize the data (zero mean, unit variance).
- Compute the covariance matrix
$\Sigma$ :
- Find the eigenvectors and eigenvalues of
$\Sigma$ . - Sort eigenvectors by eigenvalues (largest to smallest).
- Select the top
$k$ eigenvectors to form a transformation matrix$W$ . - Project the data onto new axes:
- When you have high-dimensional data and need to reduce complexity.
- When you want to remove noise and improve efficiency.
- When you want to visualize data in two or three dimensions.
- PCA loses some information (variance).
- PCA assumes linear relationships (not ideal for highly nonlinear data).
See PCA in Action, the code below demonstrates PCA on a sample face image dataset. The principal components are visualized to show the most important directions in the data.
python3 ./toolkit/pca.py
Grid Search is a systematic procedure to find the optimal hyperparameters for a machine learning model. Suppose we have:
- A dataset
$(x_i,y_i)_{i=1}^n$ , where each$x_i$ represents the features and$y_i$ the corresponding label or target. - A set of possible hyperparameter values
$\Lambda={\lambda_1,\lambda_2,\dots,\lambda_m}$ . Each$\lambda_j$ might be a single hyperparameter (e.g., regularization parameter) or a combination of multiple hyperparameters (e.g., learning rate, number of trees, etc.).
For each hyperparameter combination
- Train the model
$\theta_j$ using$\lambda_j$ . - Evaluate the model performance using an error or score function
$L(f_{\theta_j})$ . - Select the
$\lambda_j$ that yields the best (lowest or highest, depending on whether it is a loss or a score) average performance on the validation set.
One-Hot Encoding is a method to represent categorical variables as binary vectors, ensuring that machine learning models can process them numerically without imposing an ordinal relationship.
Given a categorical variable
Each category
where:
- Eliminates ordinal relationships in categorical data.
- Ensures compatibility with machine learning models.
- Can increase dimensionality, so alternatives like target encoding or embeddings may be preferable for high-cardinality data.
A Pipeline in Scikit-Learn is a structured way to automate a machine learning workflow by chaining together multiple processing steps, ensuring a streamlined and reproducible approach to model training and evaluation.
A Pipeline consists of a sequence of transformations
where each
where
In Python e.g:
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.linear_model import LogisticRegression
# Define the pipeline
pipeline = Pipeline([
('scaler', StandardScaler()), # Step 1: Standardize features
('pca', PCA(n_components=5)), # Step 2: Reduce dimensionality
('classifier', LogisticRegression()) # Step 3: Train model
])
# Fit and predict
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
This provides an efficient way to structure ML workflows, ensuring consistency and ease of use. By encapsulating preprocessing and modeling into a single object, they prevent data leakage and simplify hyperparameter tuning, making them an essential tool in modern machine learning.
Feature scaling is essential in machine learning to ensure all features contribute equally to model training. Scikit-Learn provides various scalers:
Standardizes features by removing the mean and scaling to unit variance:
-
$\mu$ : Mean of feature -
$\sigma$ : Standard deviation - Suitable for normally distributed data.
Scales features to a fixed range [0,1]:
- Retains original data distribution.
- Sensitive to outliers.
Uses median and IQR (Interquartile Range) to reduce the impact of outliers:
- IQR =
$Q_3 - Q_1$ (75th - 25th percentile) - More robust to outliers than StandardScaler.
Scales each sample (row) to unit norm:
- Useful for text classification and sparse data.
- Ensures all samples have the same magnitude.
Scaler | Effect | Outlier Sensitivity |
---|---|---|
StandardScaler | Zero mean, unit variance | High |
MinMaxScaler | Scale to [0,1] | High |
RobustScaler | Median & IQR scaling | Low |
Normalizer | Normalize row-wise | N/A |
Choose the scaler based on data characteristics and sensitivity to outliers.
Handling missing values is crucial in machine learning. Scikit-Learn provides various imputers to fill missing values effectively.
Fills missing values using a specified strategy:
- Strategies:
mean
,median
,most_frequent
,constant
- Suitable for numerical and categorical data.
Uses the k-nearest neighbors to impute missing values:
-
$N(i)$ : k-nearest neighbors of sample$i$ . - Useful when missing values depend on nearby points.
Predicts missing values using regression models:
- Iteratively estimates missing values based on other features.
- Suitable for complex relationships in data.
Identifies missing values as a separate binary feature:
- Helps models learn patterns in missing data.
Imputer | Method | Suitable For |
---|---|---|
SimpleImputer | Mean/Median/Mode | Basic missing data |
KNNImputer | Nearest neighbor averaging | Data with local dependencies |
IterativeImputer | Predictive modeling | Complex feature relationships |
MissingIndicator | Binary indicator | Feature engineering |
Choose the imputer based on data characteristics and the nature of missing values.
MLOps (Machine Learning Operations) is the practice of automating and streamlining the lifecycle of machine learning models including:
- development
- deployment
- monitoring,
- and maintenance.
It integrates principles from DevOps and applies them to ML workflows.
The goal of training a machine learning model is to minimize a loss function
where:
-
$L(\theta)$ : Loss function (e.g., MSE, Cross-Entropy) -
$\theta$ : Model parameters (weights, biases)
Once trained, an ML model is a function
where
Deployed models must be monitored over time using metrics such as accuracy, precision, recall, and drift detection:
Drift detection measures changes in data distribution:
where:
-
$D_{KL}$ : Kullback-Leibler divergence -
$P(x)$ : Distribution of training data -
$Q(x)$ : Distribution of incoming data
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.metrics import accuracy_score
import joblib
import pandas as pd
# Load and split dataset
data = pd.read_csv("your_dataset.csv")
X = data.drop(columns=["target"])
y = data["target"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Define pipeline
pipeline = Pipeline([
("scaler", StandardScaler()),
("classifier", RandomForestClassifier(random_state=42))
])
# Hyperparameter tuning
param_grid = {"classifier__n_estimators": [50, 100, 200]}
grid_search = GridSearchCV(pipeline, param_grid, cv=5, scoring="accuracy")
grid_search.fit(X_train, y_train)
# Save the best model
joblib.dump(grid_search.best_estimator_, "best_model.pkl")
# Load model and make predictions
model = joblib.load("best_model.pkl")
y_pred = model.predict(X_test)
print("Accuracy:", accuracy_score(y_test, y_pred))
# Monitor Data Drift (Basic Example)
def detect_drift(X_train, X_test):
mean_diff = abs(X_train.mean() - X_test.mean())
print("Mean Differences:", mean_diff)
detect_drift(X_train, X_test)
Stage | Process |
---|---|
Model Training | Train and tune the ML model |
Model Deployment | Save and load the model for inference |
Model Monitoring | Track model performance and detect data drift |
MLOps ensures that machine learning models remain reliable, scalable, and maintainable throughout their lifecycle.