Federated Tree-Based Ensembles with SHAP Explainability and Integrated Feature Selection for Secure Lung Cancer Health Analytics
Federated learning (FL) enables collaborative model training without centralizing raw data, thereby preserving privacy compared to traditional centralized machine learning approaches. To develop a privacy-preserving and explainable framework – where explainability is achieved through local SHAP analyses at each client and an aggregated global SHAP view – for health risk prediction in decentralized, heterogeneous environments, addressing the challenge of maintaining data confidentiality while achieving high predictive performance.
Traditional centralized health analytics solutions compromise data privacy; this paper introduces a federated learning approach that enables collaborative model building without exposing raw data.
The study integrates TreeNetX, a stable ensemble meta-learner, with SHAP-based explainability, leveraging client-specific training using gradient boosting models (XGBoost, LightGBM, CatBoost) and federated aggregation to address non-IID client distributions. Each client performs local feature selection using Recursive Feature Elimination with Cross-Validation (RFECV) before training. Experiments were conducted on three heterogeneous, tabular lung-cancer datasets from Kaggle to validate the model’s performance, using metrics including accuracy, precision, recall, F1-score, AUC, and calibration-related measures.
This paper contributes a novel federated ensemble learning framework that enhances predictive accuracy, robustness, and model interpretability, while ensuring data privacy in sensitive healthcare applications.
The federated TreeNetX model achieved 91.17% accuracy and an AUC of 0.9667. It outperformed individual client models in robustness and generalization. SHAP-based analysis provided clinically meaningful insights, enhancing model trustworthiness.
Practitioners should adopt federated ensemble strategies like TreeNetX for privacy-conscious predictive analytics to maintain compliance with data protection regulations while achieving high model performance.
Researchers should further explore hybrid federated architectures that combine explainability and advanced ensemble techniques to optimize interpretability and performance across heterogeneous environments.
The approach promotes ethical, privacy-preserving AI adoption in healthcare and other sensitive fields, contributing to safer and more trustworthy smart systems.
Future studies should investigate scaling the framework to even larger, more diverse client networks, integrating additional explainability tools, and extending applications beyond healthcare to other regulated industries.


Back