Article Data

  • Views 409
  • Dowloads 178

Original Research

Open Access Special Issue

Explainable stacking ensemble with feature tokenizer transformers for men's diabetes prediction

  • Vinh Quang Tran1
  • Younsung Choi2,*,†,
  • Haewon Byeon2,*,†,

1Department of Digital Anti-Aging Healthcare (BK21), Inje University, 50834 Gimhae, Republic of Korea

2Department of AI-Software, Inje University, 50834 Gimhae, Republic of Korea

DOI: 10.22514/jomh.2024.184 Vol.20,Issue 11,November 2024 pp.38-56

Submitted: 29 May 2024 Accepted: 22 August 2024

Published: 30 November 2024

(This article belongs to the Special Issue Prediction and management of diabetes for men's health)

*Corresponding Author(s): Younsung Choi E-mail: cys2020@inje.ac.kr
*Corresponding Author(s): Haewon Byeon E-mail: bhwpuma@naver.com

† These authors contributed equally.

Abstract

Diabetes is a leading global health concern, with millions of deaths linked to diabetes and related complications according to the World Health Organization (WHO). Early and accurate prediction is crucial for effective management. This study investigates the potential of a stacking ensemble approach for predicting diabetes in men (n = 5598). The ensemble leverages a Feature Tokenizer transformer, a deep learning technique, alongside various machine learning models. SHAP (SHapley Additive exPlanations) is used to enhance model interpretability. Compared to other stacking methods and standalone models, the proposed ensemble with a Random Forest meta-classifier, XGBoost, Feature Tokenizer Transformers (FT-Transformer) and LightGBM achieved superior performance (accuracy: 0.8786, precision: 0.7989, recall: 0.8171, F1-score: 0.8079, Area Under the Curve (AUC): 0.8618). These findings suggest that stacking ensembles with deep learning and explainable artificial intelligent (AI) hold promise for improving diabetes prediction in men, potentially leading to better clinical decision-making and patient outcomes.


Keywords

Feature tokenizer; Men’s health; Diabetes; Explainable artificial intelligent


Cite and Share

Vinh Quang Tran,Younsung Choi,Haewon Byeon. Explainable stacking ensemble with feature tokenizer transformers for men's diabetes prediction. Journal of Men's Health. 2024. 20(11);38-56.

References

[1] Petersmann A, Müller-Wieland D, Müller UA, Landgraf R, Nauck M, Freckmann G, et al. Definition, classification and diagnosis of diabetes mellitus. Experimental and Clinical Endocrinology & Diabetes. 2019; 127: S1–S7.

[2] WHO. The top 10 causes of death. 2024. Available at: https://www.who.int/news-room/fact-sheets/detail/the-top-10-causes-of-death (Accessed: 25 May 2024).

[3] WHO. Diabetes. 2023. Available at: https://www.who.int/news-room/fact-sheets/detail/diabetes (Accessed: 25 May 2024).

[4] Atkinson MA, Eisenbarth GS, Michels AW. Type 1 diabetes. The Lancet. 2014; 383: 69–82.

[5] Kautzky-Willer A, Leutner M, Harreiter J. Sex differences in type 2 diabetes. Diabetologia. 2023; 66: 986–1002.

[6] Kirwan JP, Sacks J, Nieuwoudt S. The essential role of exercise in the management of type 2 diabetes. Cleveland Clinic Journal of Medicine. 2017; 84: S15–S21.

[7] Budd J. Burnout related to electronic health record use in primary care. Journal of Primary Care & Community Health. 2023; 14: 21501319231166921.

[8] Joseph LP, Joseph EA, Prasad R. Explainable diabetes classification using hybrid Bayesian-optimized TabNet architecture. Computers in Biology and Medicine. 2022; 151: 106178.

[9] Wang Q, Cao W, Guo J, Ren J, Cheng Y, Davis DN. DMP_MI: an effective diabetes mellitus classification algorithm on imbalanced data with missing values. IEEE Access. 2019; 7: 102232–102238.

[10] Abdulhadi N, Al-Mousa A. Diabetes detection using machine learning classification methods. 2021 International Conference on Information Technology (ICIT). IEEE: Amman, Jordan. 2021.

[11] Grinsztajn L, Oyallon E, Varoquaux G. Why do tree-based models still outperform deep learning on typical tabular data? To be published in arXiv. 2022. [Preprint].

[12] Shwartz-Ziv R, Armon A. Tabular data: deep learning is not all you need. Information Fusion. 2022; 81: 84–90.

[13] El-Rashidy N, ElSayed NE, El-Ghamry A, Talaat FM. Utilizing fog computing and explainable deep learning techniques for gestational diabetes prediction. Neural Computing and Applications. 2023; 35: 7423–7442.

[14] Du Y, Rafferty AR, McAuliffe FM, Wei L, Mooney C. An explainable machine learning-based clinical decision support system for prediction of gestational diabetes mellitus. Scientific Reports. 2022; 12: 1170.

[15] Khanna VV, Chadaga K, Sampathila N, Prabhu S, P RC, Bhat D, et al. Explainable artificial intelligence-driven gestational diabetes mellitus prediction using clinical and laboratory markers. Cogent Engineering. 2024; 11: 2330266.

[16] Gorishniy Y, Rubachev I, Khrulkov V, Babenko A. Revisiting deep learning models for tabular data. 2023. Available at: http://arxiv.org/abs/2106.11959 (Accessed: 25 May 2024).

[17] Vaswani A, Shazeer N, Parmar N, Uszkoreit J, Jones L, Gomez AN, et al. Attention is all you need. 2023. Available at: http://arxiv.org/abs/1706.03762 (Accessed: 25 May 2024).

[18] Chen T, Guestrin C. XGBoost: a scalable tree boosting system. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. Association for Computing Machinery: New York, NY, USA. 2016.

[19] Ke G, Meng Q, Finley T, Wang T, Chen W, Ma W, et al. LightGBM: a highly efficient gradient boosting decision tree. 2017. Available at: https://proceedings.neurips.cc/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html (Accessed: 25 May 2024).

[20] Breiman L. Random forests. Machine Learning. 2001; 45: 5–32.

[21] Arik SÖ, Pfister T. TabNet: attentive interpretable tabular learning. Proceedings of the AAAI Conference on Artificial Intelligence. 2021; 35: 6679–6687.

[22] Prokhorenkova L, Gusev G, Vorobev A, Dorogush AV, Gulin A. CatBoost: unbiased boosting with categorical features. 2018. Available at: https://proceedings.neurips.cc/paper/2018/hash/14491b756b3a51daac41c24863285549-Abstract.html (Accessed: 25 May 2024).

[23] Lundberg SM, Lee S-I. A unified approach to interpreting model predictions. 2017. Available at: https://proceedings.neurips.cc/paper_files/paper/2017/file/8a20a8621978632d76c43dfd28b67767-Paper.pdf (Accessed: 25 May 2024).

[24] Guha A. Building explainable and interpretable model for diabetes risk prediction. International Journal of Engineering Research & Technology. 2020; 9: 1037–1042.

[25] Vakil V, Pachchigar S, Chavda C, Soni S. Explainable predictions of different machine learning algorithms used to predict early stage diabetes. 2021. Available at: https://arxiv.org/pdf/2111.09939 (Accessed: 25 May 2024).

[26] Vishwarupe V, Joshi PM, Mathias N, Maheshwari S, Mhaisalkar S, Pawar V. Explainable AI and interpretable machine learning: a case study in perspective. Procedia Computer Science. 2022; 204: 869–876.

[27] Tasin I, Nabil TU, Islam S, Khan R. Diabetes prediction using machine learning and explainable AI techniques. Healthcare Technology Letters. 2023; 10: 1–10.

[28] Kibria HB, Nahiduzzaman M, Goni MOF, Ahsan M, Haider J. An ensemble approach for the prediction of diabetes mellitus using a soft voting classifier with an explainable AI. Sensors. 2022; 22: 7268.

[29] Dutta A, Hasan MK, Ahmad M, Awal MA, Islam MA, Masud M, et al. Early prediction of diabetes using an ensemble of machine learning models. International Journal of Environmental Research and Public Health. 2022; 19: 12378.

[30] Chang K, Yoo S, Lee S. Classification and prediction of the effects of nutritional intake on diabetes mellitus using artificial neural network sensitivity analysis: 7th Korea National Health and Nutrition Examination Survey. Nutrition Research and Practice. 2023; 17: 1255–1266.

[31] Choi SB, Kim WJ, Yoo TK, Park JS, Chung JW, Lee Y, et al. Screening for prediabetes using machine learning models. Computational and Mathematical Methods in Medicine. 2014; 2014: 618976.

[32] Kumari S, Kumar D, Mittal M. An ensemble approach for classification and prediction of diabetes mellitus using soft voting classifier. International Journal of Cognitive Computing in Engineering. 2021; 2: 40–46.

[33] Pima Indians diabetes database. 1988. Available at: https://www.kaggle.com/datasets/uciml/pima-indians-diabetes-database (Accessed: 27 May 2024).

[34] Curia F. Explainable and transparency machine learning approach to predict diabetes develop. Health and Technology. 2023; 13: 769–780.

[35] Dharmarathne G, Jayasinghe TN, Bogahawaththa M, Meddage DPP, Rathnayake U. A novel machine learning approach for diagnosing diabetes with a self-explainable interface. Healthcare Analysis. 2024; 5: 100301.

[36] Diabetes dataset. 1990. Available at: https://www.kaggle.com/datasets/mathchi/diabetes-data-set (Accessed: 27 May 2024).

[37] Kweon S, Kim Y, Jang M, Kim Y, Kim K, Choi S, et al. Data resource profile: the Korea National Health and Nutrition Examination Survey (KNHANES). International Journal of Epidemiology. 2014; 43: 69–77.

[38] Domingos P. A few useful things to know about machine learning. Communications of the ACM. 2012; 55: 78–87.

[39] Bergman E, Purucker L, Hutter F. Don’t waste your time: early stopping cross-validation. 2024. Available at: http://arxiv.org/abs/2405.03389 (Accessed: 18 June 2024).

[40] Géron A. Hands-on machine learning with scikit-learn, keras, and tensorflow. 2nd edn. O’Reilly Media, Inc.: Sebastopol, CA, USA. 2019.

[41] Japkowicz N, Stephen S. The class imbalance problem: a systematic study. Intelligent Data Analysis. 2002; 6: 429–449.

[42] Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. Journal of Artificial Intelligence Research. 2002; 16: 321–357.

[43] Kursa MB, Jankowski A, Rudnicki W. Boruta—a system for feature selection. Fundamenta Informaticae. 2010; 101: 271–285.

[44] Keany E. BorutaShap: A wrapper feature selection method which combines the Boruta feature selection algorithm with Shapley values. 2020. Available at: https://doi.org/10.5281/zenodo (Accessed: 27 May 2024).

[45] Wolpert DH. Stacked generalization. Neural Networks. 1992; 5: 241–259.

[46] Akiba T, Sano S, Yanase T, Ohta T, Koyama M. Optuna: a next-generation hyperparameter optimization framework. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. Association for Computing Machinery: New York, NY, USA. 2019.

[47] Sokolova M, Lapalme G. A systematic analysis of performance measures for classification tasks. Information Processing & Management. 2009; 45: 427–437.

[48] Aumann RJ, Hart S. Handbook of game theory with economic applications. Volume 4, 2021. Elsevier: Amsterdam. 1992.

[49] shap/shap. 2024. Available at: https://github.com/shap/shap (Accessed: 28 May 2024).

[50] Feller S, Boeing H, Pischon T. Body mass index, waist circumference, and the risk of type 2 diabetes mellitus. Deutsches Ärzteblatt International. 2010; 107: 470–476.

[51] Johnson RJ, Nakagawa T, Sanchez-Lozada LG, Shafiu M, Sundaram S, Le M, et al. Sugar, uric acid, and the etiology of diabetes and obesity. Diabetes. 2013; 62: 3307–3315.

[52] Tseng C-H. Correlation of uric acid and urinary albumin excretion rate in patients with type 2 diabetes mellitus in Taiwan. Kidney International. 2005; 68: 796–801.

[53] Holzinger A, Langs G, Denk H, Zatloukal K, Müller H. Causability and explainability of artificial intelligence in medicine. WIREs Data Mining and Knowledge Discovery. 2019; 9: e1312.

[54] Muddamsetty SM, Jahromi MNS, Moeslund TB. Expert level evaluations for explainable AI (XAI) methods in the medical domain. Pattern Recognition. ICPR International Workshops and Challenges. 21 February 2021. Springer International Publishing: Cham. 2021.


Abstracted / indexed in

Science Citation Index Expanded (SciSearch) Created as SCI in 1964, Science Citation Index Expanded now indexes over 9,200 of the world’s most impactful journals across 178 scientific disciplines. More than 53 million records and 1.18 billion cited references date back from 1900 to present.

Journal Citation Reports/Science Edition Journal Citation Reports/Science Edition aims to evaluate a journal’s value from multiple perspectives including the journal impact factor, descriptive data about a journal’s open access content as well as contributing authors, and provide readers a transparent and publisher-neutral data & statistics information about the journal.

Directory of Open Access Journals (DOAJ) DOAJ is a unique and extensive index of diverse open access journals from around the world, driven by a growing community, committed to ensuring quality content is freely available online for everyone.

SCImago The SCImago Journal & Country Rank is a publicly available portal that includes the journals and country scientific indicators developed from the information contained in the Scopus® database (Elsevier B.V.)

Publication Forum - JUFO (Federation of Finnish Learned Societies) Publication Forum is a classification of publication channels created by the Finnish scientific community to support the quality assessment of academic research.

Scopus: CiteScore 0.9 (2023) Scopus is Elsevier's abstract and citation database launched in 2004. Scopus covers nearly 36,377 titles (22,794 active titles and 13,583 Inactive titles) from approximately 11,678 publishers, of which 34,346 are peer-reviewed journals in top-level subject fields: life sciences, social sciences, physical sciences and health sciences.

Norwegian Register for Scientific Journals, Series and Publishers Search for publication channels (journals, series and publishers) in the Norwegian Register for Scientific Journals, Series and Publishers to see if they are considered as scientific. (https://kanalregister.hkdir.no/publiseringskanaler/Forside).

Submission Turnaround Time

Conferences

Top