Application of Machine Learning Techniques for Customer Churn Prediction in the Banking Sector
Previous studies have primarily focused on comparing predictive models without considering the impact of data preprocessing on model performance. Therefore, this study sets two main objectives. The first objective is to investigate the effect of resampling methods for handling imbalanced data on model effectiveness. The second objective is to compare and evaluate machine learning methods to identify the optimal model for each resampling technique, thereby determining the model that achieves the highest performance.
In the highly competitive banking industry, attrition of customers is a major challenge for banks trying to improve customer retention. While many studies have focused on building and evaluating models to predict customer churn, they often miss addressing the problem of imbalanced data, which can significantly affect the model’s accuracy.
In this study, following exploratory data analysis (EDA), we apply various techniques to address data imbalance and use a range of machine learning models, including Naïve Bayes, Logistic Regression, Support Vector Machine, Decision Tree, Random Forest, Gradient Boosting, XGBoost, and LightGBM, to predict customer churn using the dataset.
The contribution of this research lies in its comprehensive evaluation and comparison of various techniques for handling imbalanced data in churn prediction models. The study identifies SMOTE-ENN as the most effective method for resampling imbalanced data. Among the models tested, LightGBM (accuracy = 0.979) achieves the highest performance based on evaluation metrics. Additionally, the research highlights that tree-based machine learning models generally perform better when trained on imbalanced datasets.
Tree-based and ensemble models perform better than regression and probability-based methods when dealing with imbalanced data. SMOTE-ENN has been shown to improve the performance of machine learning models greatly.
Practitioners can deploy high-performance models, such as XGBoost and LightGBM, combined with effective resampling methods like SMOTE-ENN to predict customer churn in banking, marketing, and human resources.
To optimize the predictive model in the study, researchers can focus on feature selection, dimensionality reduction, or hyperparameter tuning.
Customer churn reduces revenue and threatens competitive advantage, so businesses need effective retention strategies to maintain sustainable growth. High-performance customer churn prediction models can be an effective solution to address this issue.
Deploy the model on real-world datasets while further optimizing the feature selection process and hyperparameter tuning, combined with SHAP values analysis to identify key features that significantly influence the model’s predictions.