MITIGATING CLASS IMBALANCE IN TUBERCULOSIS DETECTION: COMBINING SMOTE AND TOMEK LINK WITH MODIFIED FOCAL LOSS AND CLASS WEIGHTING IN A TRANSFER LEARNING FRAMEWORK
Abstract
Tuberculosis (TB) remains a major global health concern, and early detection is crucial for effective treatment. This study addresses the challenge of class imbalance in existing machine learning models for TB prediction, which often leads to biased results. A novel approach combining the hybrid SMOTE-Tomek Links technique, modified focal loss, and class weighting was developed and applied to a dataset of X-ray images categorized into normal and TB classes. The hybrid SMOTE-Tomek Links method generates synthetic samples for the minority class while removing ambiguous samples, ensuring a balanced dataset. The modified focal loss and class weighting help focus on misclassified cases and address class disparities. The model was evaluated against benchmark models, including EfficientNetB3, Random Forest, and XGBoost, with and without SMOTE. The developed model achieved a remarkable accuracy of 99.7%, outperforming the benchmark models (92.72%–99.1%). These results demonstrate the effectiveness of the proposed approach in improving TB prediction accuracy and handling class imbalance. The study's findings provide valuable insights into medical image classification and offer a robust framework for enhancing diagnostic tools, with potential applications beyond TB detection. This research could significantly improve TB management and diagnosis in clinical settings.
References
Althomsons, S. P., Winglee, K., Heilig, C. M., Talarico, S., Silk, B., Wortham, J., ... & Navin, T. R. (2022). Using machine learning techniques and national tuberculosis surveillance data to predict excess growth in genotyped tuberculosis clusters. American journal of epidemiology, 191(11), 1936-1943.
Ahmed, Z., Mohamed, K., Zeeshan, S., & Dong, X. (2020). Artificial intelligence with multi-functional machine learning platform development for better healthcare and precision medicine. Database, 2020, baaa010.
Ai-jun, L., & Peng, Z. (2020, June). Research on unbalanced data processing algorithm base tomeklinks-smote. In Proceedings of the 2020 3rd International Conference on Artificial Intelligence and Pattern Recognition (pp. 13-17).
Awujoola, O. J., Ogwueleka, F., Odion, P. O., & Irhebhude,(2020), M. An Intelligent Homogenous Model For Prediction Of Network Intrusion Detection Using Synthetic Minority Over Sampling Technique and Local Outlier Factor.
Awujoola, O. J., Ogwueleka, F. N., Irhebhude, M. E., & Misra, S. (2021). Wrapper based approach for network intrusion detection model with combination of dual filtering technique of resample and SMOTE. In Artificial Intelligence for Cyber Security: Methods, Issues and Possible Horizons or Opportunities (pp. 139-167). Cham: Springer International Publishing.
Alamri, M., & Ykhlef, M. (2024). Hybrid undersampling and oversampling for handling imbalanced credit card data. IEEE Access.
Chekroud, A. M., Bondar, J., Delgadillo, J., Doherty, G., Wasil, A., Fokkema, M., ... & Choi, K. (2021). The promise of machine learning in predicting treatment outcomes in psychiatry. World Psychiatry, 20(2), 154-170.
Chen, J., Fu, C., Xie, H., Zheng, X., Geng, R., & Sham, C. W. (2022). Uncertainty teacher with dense focal loss for semi-supervised medical image segmentation. Computers in Biology and Medicine, 149, 106034.
Fox, G. J., Johnston, J. C., Nguyen, T. A., Majumdar, S. S., Denholm, J. T., Asldurf, H., ... & Velen, K. (2021). Active casefinding in contacts of people with TB. The International Journal of Tuberculosis and Lung Disease, 25(2), 95-105.
Gopalaswamy, R., Shanmugam, S., Mondal, R., & Subbian, S. (2020). Of tuberculosis and non-tuberculous mycobacterial infectionsa comparative analysis of epidemiology, diagnosis and treatment. Journal of biomedical science, 27, 1-17
Guo, Y. (2024). Multimodal Multilabel Classification by CLIP. arXiv preprint arXiv:2406.16141.Jonathan, J., Barakabitze, A. A., Fast, C. D., & Cox, C. (2024). Machine Learning for Prediction of Tuberculosis Detection: Case Study of Trained African Giant Pouched Rats. Online Journal of Public Health Informatics, 16, e50771
Kumar, S., Mishra, A. K., Mishra, R. K., Shrivastava, A., Chhabra, P., & Chhabra, G. (2023, July). Identification of Mycobacterium Tuberculosis Employing VGG-16 Feature Extraction and Classification Using Prominent Machine Learning Classifiers on X-rays. In International Conference on Data Science and Applications (pp. 119-130). Singapore: Springer Nature Singapore.
Kieu, S. T. H., Bade, A., Hijazi, M. H. A., & Kolivand, H. (2020). A survey of deep learning for lung disease detection on medical images: state-of-the-art, taxonomy, issues and future directions. Journal of imaging, 6(12), 131.
Leng, Q., Guo, J., Tao, J., Meng, X., & Wang, C. (2024). OBMI: oversampling borderline minority instances by a two-stage Tomek link-finding procedure for class imbalance problem. Complex & Intelligent Systems, 1-18.
Mitruka, K., Oeltmann, J. E., Ijaz, K., & Haddad, M. B. (2011). Tuberculosis outbreak investigations in the United States, 20022008. Emerging infectious diseases, 17(3), 425.
Makam, P., & Matsa, R. (2021). Big Three infectious diseases: tuberculosis, malaria and HIV/AIDS. Current Topics in Medicinal Chemistry, 21(31), 2779-2799.
Nafisah, S. I., & Muhammad, G. (2024). Tuberculosis detection in chest radiograph using convolutional neural network architecture and explainable artificial intelligence. Neural Computing and Applications, 36(1), 111-131.
Ou, C. Y., Chen, I. Y., Chang, H. T., Wei, C. Y., Li, D. Y., Chen, Y. K., & Chang, C. Y. (2024). Deep Learning-Based Classification and Semantic Segmentation of Lung Tuberculosis Lesions in Chest X-ray Images. Diagnostics, 14(9), 952
Przybya-Kasperek, M. (2022). Study of selected methods for balancing independent data sets in k-nearest neighbors classifiers with Pawlak conflict analysis. Applied Soft Computing, 129, 109612.
Rao, G. M., Ramesh, D., Sharma, V., Sinha, A., Hassan, M. M., & Gandomi, A. H. (2024). AttGRU-HMSI: enhancing heart disease diagnosis using hybrid deep learning approach. Scientific Reports, 14(1), 7833.
Rodrigues, M. M., Barreto-Duarte, B., Vinhaes, C. L., Arajo-Pereira, M., Fukutani, E. R., Bergamaschi, K. B., ... & Andrade, B. B. (2024). Machine learning algorithms using national registry data to predict loss to follow-up during tuberculosis treatment. BMC Public Health, 24(1), 1385.
Silva, L., Motta, L. G. D., & Eberly, L. (2024). Prediction of tuberculosis clusters in the riverine municipalities of the Brazilian Amazon with machine learning. Revista Brasileira de Epidemiologia, 27, e240024.
Sharma, V., Gupta, S. K., & Shukla, K. K. (2024). Deep learning models for tuberculosis detection and infected region visualization in chest X-ray images. Intelligent Medicine, 4(2), 104-113.
Swana, E. F., Doorsamy, W., & Bokoro, P. (2022). Tomek link and SMOTE approaches for machine fault classification with an imbalanced dataset. Sensors, 22(9), 3246.
Tang, N., Yuan, M., Chen, Z., Ma, J., Sun, R., Yang, Y., ... & Zhou, J. (2023). Machine Learning Prediction Model of Tuberculosis Incidence Based on Meteorological Factors and Air Pollutants. International journal of environmental research and public health, 20(5), 3910.
Viadinugroho, R. A. A. (2023). Imbalanced classification in python: SMOTE-Tomek Links method combining SMOTE with Tomek Links for imbalanced classification in python.
Viswanatha, V., Ramachandra, A. C., Togaleri, A. R., & Gowda, N. S. (2023). Tuberculosis Prediction using KNN Algorithm. International Journal of Engineering and Management Research, 13(4), 58-71.
Wang, W., Guo, W., Cai, J., Guo, W., Liu, R., Liu, X., ... & Zhang, S. (2021). Epidemiological characteristics of tuberculosis and effects of meteorological factors and air pollutants on tuberculosis in Shijiazhuang, China: A distribution lag non-linear analysis. Environmental research, 195, 110310
Wen, D., Soltan, A., Trucco, E., & Matin, R. N. (2024). From data to diagnosis: skin cancer image datasets for artificial intelligence. Clinical and Experimental Dermatology, llae112.
Xi, Y., Li, M., Zhou, F., Tang, X., Li, Z., & Tian, J. (2023). SE-Inception-ResNet Model with Focal Loss for Transmission Line Fault Classification Under Class Imbalance. IEEE Transactions on Instrumentation and Measurement.
Xie, Y., Wan, Q., Xie, H., Xu, Y., Wang, T., Wang, S., & Lei, B. (2023). Fundus image-label pairs synthesis and retinopathy screening via GANs with class-imbalanced semi-supervised learning. IEEE Transactions on Medical Imaging, 42(9), 2714-2725.
Yeung, M., Sala, E., Schnlieb, C. B., & Rundo, L. (2022). Unified focal loss: Generalising dice and cross entropy-based losses to handle class imbalanced medical image segmentation. Computerized Medical Imaging and Graphics, 95, 102026.
Copyright (c) 2025 FUDMA JOURNAL OF SCIENCES

This work is licensed under a Creative Commons Attribution 4.0 International License.
FUDMA Journal of Sciences