This project aims to develop a predictive model for identifying the probability of customer churn in a telecom company. Using historical customer data, the project leverages exploratory data analysis (EDA), data preprocessing, and machine learning techniques to build an accurate churn prediction model.
Exploratory Data Analysis (EDA):
- Dataset Overview:
- The project analyzes a dataset of over 70,000 unique customers from an internet service provider.
- Includes customer-specific attributes such as subscription age, average bill amount, service failures, and download/upload speeds.
- Data Visualization:
- Histograms for numeric variables (e.g.,
subscription_age
,bill_avg
) to understand their distribution. - Correlation heatmaps to identify relationships between features, such as
service_failure_count
andchurn
. - Bar plots for binary features like
is_tv_subscriber
andis_movie_package_subscriber
to observe their impact on churn.
- Histograms for numeric variables (e.g.,
- Insights Derived:
- Customers with higher
service_failure_count
show a strong likelihood of churning. - A longer
remaining_contract
correlates with reducedchurn
likelihood. - Excessive
download_over_limit
is a potentialchurn
driver.
- Customers with higher
- Feature Importance:
- The analysis highlights key predictors of
churn
, includingservice_failure_count
,remaining_contract
, andbill_avg
.
- The analysis highlights key predictors of
Data Preprocessing:
- Processing Missing Values:
- Columns like
remaining_contract
,download_avg
, andupload_avg
were analyzed.remaining_contract
: Missing values (or zeros) are significant for churned customers, indicating potential removal due to contract termination. This column was dropped due to high missing rates (52.5%).download_avg
andupload_avg
: Outliers were identified, and missing values were replaced with the median for safety.
- Columns like
- Encoding Categorical Variables:
- Binary variables (
is_tv_subscriber
,is_movie_package_subscriber
) were found to be already binary, negating the need for additional encoding.
- Binary variables (
- Normalization of Features:
- Standardization was applied to numerical features to ensure consistency in model input.
Model Development:
- Models Tried:
- Logistic Regression
- Decision Tree
- Random Forest
- Gradient Boosting
- Gradient Boosting Performance:
- This model demonstrated the best performance among all tested models.
- Key Metrics:
- Precision: 0.83 on average, with better performance on the positive class.
- Recall: 0.83 for both classes, reflecting balanced detection capability.
- F1-Score: Strong average of 0.83, indicating reliability.
- ROC-AUC: High value of 0.91, showing excellent class separation capability.
- Confusion Matrix:
- Gradient Boosting was chosen for its ability to capture complex patterns and deliver high precision and recall, making it ideal for real-world applications.
Deployment:
- Dockerized Model:
- The project includes a Dockerized environment for easy deployment and reproducibility across different systems.
- Streamlit App:
- A Streamlit web application is integrated into the project to provide an interactive interface for churn prediction.
- Languages: Python
- Libraries:
- Pandas, NumPy for data manipulation.
- Matplotlib, Seaborn for data visualization.
- Scikit-learn for machine learning.
- Streamlit For building an interactive web application to visualize data and make predictions
- Tools:
- Jupyter Notebook for development and analysis.
- Git & GitHub for version control and collaboration.
- Docker for containerization.
To set up the environment and run the project, follow these steps:
- Clone the repository:
git clone https://github.com/jamleston/telecom-project
cd telecom-project
- Run the application:
docker-compose up --build
-
Access the application: Open your browser and go via link
-
Or you can also see our project through this link: https://projectgoit11.streamlit.app/
- Use an input form to simulate customer scenarios by entering individual customer attributes.
- Receive predictions on whether a customer is likely to churn.
- View the impact of feature values on churn predictions in real time.
├── internet_service_churn.csv # Original dataset
├── preprocessed_dataset.csv # Preprocessed dataset used for modeling
├── analysis/ # Jupyter notebooks for exploratory data analysis
│ ├── analisis_K.ipynb # EDA by Anastasya
│ └── analysis_artem.ipynb # EDA by Artem
├── images/ # Directory for storing visualizations
├── models/ # Model development notebooks
│ ├── model_decision_tree.ipynb # Decision Tree model training
│ ├── model_logistic_regression.ipynb # Logistic Regression model training
│ └── model_RF.ipynb # Random Forest model training
├── analysis_yuli.ipynb # Chosen EDA by Yuli
├── preprocessing.ipynb # Data preprocessing
├── model_GB.ipynb # Gradient Boosting model training
├── gradient_boosting_model.pkl # Serialized Gradient Boosting model
├── app.py # Streamlit application for churn prediction
├── Dockerfile # Docker setup for containerization
├── docker-compose.yml # Docker Compose file for deployment
├── requirements.txt # Python dependencies
└── README.md # Project documentation