Leakage-Aware and Explainable Machine Learning for Healthcare Claim Fraud Detection Using Imbalanced Medical Insurance Data

Dian Hafidh Zulfikar; Ery Setiyawan Jullev Atmadji; Bagus Satrio Wahyu Poetro

doi:10.56705/z3207345

Authors

Dian Hafidh Zulfikar Universitas Islam Negeri Raden Intan Lampung
Ery Setiyawan Jullev Atmadji Politeknik Negeri Jember
Bagus Satrio Wahyu Poetro Universitas Islam Sultan Agung Semarang

DOI:

https://doi.org/10.56705/z3207345

Keywords:

Healthcare Fraud Detection, Medical Insurance Claims, Machine Learning, Imbalanced Classification, explainable AI, Data Leakage, Health Informatics

Abstract

Healthcare insurance fraud is a critical challenge in health systems because fraudulent claims may cause financial losses, increase administrative burden, and reduce trust in healthcare services. This study proposes an explainable machine learning approach for detecting fraudulent healthcare insurance claims using imbalanced medical claim data. The dataset consisted of 10,000 healthcare insurance claim records with 20 attributes, including patient information, provider characteristics, claim-related financial variables, medical codes, temporal features, and fraud labels. Fraudulent claims represented only 8.29% of the dataset, indicating a clear class imbalance problem. Several machine learning models were evaluated, including Logistic Regression, Decision Tree, Random Forest, Extra Trees, and AdaBoost, under different imbalance handling strategies, namely baseline learning, class weighting, and SMOTE. In addition, two feature scenarios were compared: a full-feature scenario and a leakage-aware scenario that excluded potentially post-decision variables such as claim status and approved amount. The experimental results showed that the best full-feature model was Logistic Regression without additional imbalance handling, achieving an accuracy of 0.9900, precision of 0.9740, recall of 0.9036, F1-score of 0.9375, ROC-AUC of 0.9989, and PR-AUC of 0.9896. The model correctly detected 150 out of 166 fraudulent claims in the test set. However, the best leakage-aware model achieved a lower F1-score of 0.6983, indicating that potentially leaked variables may substantially affect model performance. Feature importance analysis showed that claim amount, approved amount, claim submission delay, claim status, and provider-related variables were among the most influential predictors. These findings demonstrate that explainable machine learning can support healthcare claim fraud detection, but careful attention must be given to class imbalance, data leakage, and operational deployment context

References

[1] A. Du Preez, S. Bhattacharya, P. Beling, and E. Bowen, “Fraud detection in healthcare claims using machine learning: A systematic review,” Artif. Intell. Med., vol. 160, p. 103061, Feb. 2025, doi: 10.1016/j.artmed.2024.103061.

[2] Z. Hamid, F. Khalique, S. Mahmood, A. Daud, A. Bukhari, and B. Alshemaimri, “Healthcare insurance fraud detection using data mining,” BMC Med. Inform. Decis. Mak., vol. 24, no. 1, p. 112, Apr. 2024, doi: 10.1186/s12911-024-02512-4.

[3] E. Nabrawi and A. Alanazi, “Fraud Detection in Healthcare Insurance Claims Using Machine Learning,” Risks, vol. 11, no. 9, p. 160, Sep. 2023, doi: 10.3390/risks11090160.

[4] Z. Wang, X. Chen, Y. Wu, L. Jiang, S. Lin, and G. Qiu, “A robust and interpretable ensemble machine learning model for predicting healthcare insurance fraud,” Sci. Rep., vol. 15, no. 1, p. 218, Jan. 2025, doi: 10.1038/s41598-024-82062-x.

[5] K. Razzaq and M. Shah, “Next-Generation Machine Learning in Healthcare Fraud Detection: Current Trends, Challenges, and Future Research Directions,” Information, vol. 16, no. 9, p. 730, Aug. 2025, doi: 10.3390/info16090730.

[6] A. A. Amponsah, A. F. Adekoya, and B. A. Weyori, “A novel fraud detection and prevention method for healthcare claim processing using machine learning and blockchain technology,” Decis. Anal. J., vol. 4, p. 100122, Sep. 2022, doi: 10.1016/j.dajour.2022.100122.

[7] O. Cherkaoui, H. Anoun, and A. Maizate, “A benchmark of health insurance fraud detection using machine learning techniques,” IAES Int. J. Artif. Intell. IJ-AI, vol. 13, no. 2, p. 1925, Jun. 2024, doi: 10.11591/ijai.v13.i2.pp1925-1934.

[8] J. M. Johnson and T. M. Khoshgoftaar, “Data-Centric AI for Healthcare Fraud Detection,” SN Comput. Sci., vol. 4, no. 4, p. 389, May 2023, doi: 10.1007/s42979-023-01809-x.

[9] J. Lu, K. Lin, R. Chen, M. Lin, X. Chen, and P. Lu, “Health insurance fraud detection by using an attributed heterogeneous information network with a hierarchical attention mechanism,” BMC Med. Inform. Decis. Mak., vol. 23, no. 1, p. 62, Apr. 2023, doi: 10.1186/s12911-023-02152-0.

[10] S. Mardani and H. Moradi, “Using Graph Attention Networks in Healthcare Provider Fraud Detection,” IEEE Access, vol. 12, pp. 132786–132800, 2024, doi: 10.1109/ACCESS.2024.3425892.

[11] L. Settipalli and G. R. Gangadharan, “WMTDBC: An unsupervised multivariate analysis model for fraud detection in health insurance claims,” Expert Syst. Appl., vol. 215, p. 119259, Apr. 2023, doi: 10.1016/j.eswa.2022.119259.

[12] Y. Yoo, J. Shin, and S. Kyeong, “Medicare Fraud Detection Using Graph Analysis: A Comparative Study of Machine Learning and Graph Neural Networks,” IEEE Access, vol. 11, pp. 88278–88294, 2023, doi: 10.1109/ACCESS.2023.3305962.

[13] R. Bounab, K. Zarour, B. Guelib, and N. Khlifa, “Enhancing Medicare Fraud Detection Through Machine Learning: Addressing Class Imbalance With SMOTE-ENN,” IEEE Access, vol. 12, pp. 54382–54396, 2024, doi: 10.1109/ACCESS.2024.3385781.

[14] D. Farahmandazad, K. Danesh, and H. F. N. Abadi, “Application of Standard Machine Learning Models for Medicare Fraud Detection with Imbalanced Data,” Risks, vol. 13, no. 10, p. 198, Oct. 2025, doi: 10.3390/risks13100198.

[15] M. Balayet Hossain Sakil et al., “Enhancing Medicare Fraud Detection With a CNN-Transformer-XGBoost Framework and Explainable AI,” IEEE Access, vol. 13, pp. 79609–79622, 2025, doi: 10.1109/ACCESS.2025.3562577.

[16] J. T. Hancock, R. A. Bauder, H. Wang, and T. M. Khoshgoftaar, “Explainable machine learning models for Medicare fraud detection,” J. Big Data, vol. 10, no. 1, p. 154, Oct. 2023, doi: 10.1186/s40537-023-00821-5.

[17] R. Muhammad et al., “Fraud detection and explanation in medical claims using GNN architectures,” Sci. Rep., vol. 15, no. 1, p. 41734, Nov. 2025, doi: 10.1038/s41598-025-22910-6.

[18] J. Kemp, C. Barker, N. Good, and M. Bain, “Sequential pattern detection for identifying courses of treatment and anomalous claim behaviour in medical insurance,” in 2022 IEEE International Conference on Bioinformatics and Biomedicine (BIBM), Las Vegas, NV, USA: IEEE, Dec. 2022, pp. 3039–3046. doi: 10.1109/BIBM55620.2022.9995541.

[19] Y. Chen, C. Zhao, and C. Nie, “Health Insurance Fraud Detection: The Role of Feature Engineering and Preprocessing Techniques,” in Proceedings of the 2nd Guangdong-Hong Kong-Macao Greater Bay Area International Conference on Digital Economy and Artificial Intelligence, Dongguan China: ACM, Mar. 2025, pp. 858–862. doi: 10.1145/3745238.3745373.

[20] D. Dash, M. Kumar, S. Patra, A. Kumar, and A. Ganguly, “Healthcare Fraud Detection Using an Integrated ML Approach with SMOTE,” Procedia Comput. Sci., vol. 258, pp. 800–810, 2025, doi: 10.1016/j.procs.2025.04.312.

[21] H. Shi, M. A. Tayebi, J. Pei, and J. Cao, “Cost-Sensitive Learning for Medical Insurance Fraud Detection With Temporal Information,” IEEE Trans. Knowl. Data Eng., vol. 35, no. 10, pp. 10451–10463, Oct. 2023, doi: 10.1109/TKDE.2023.3240431.

[22] B. Hong, P. Lu, H. Xu, J. Lu, K. Lin, and F. Yang, “Health insurance fraud detection based on multi-channel heterogeneous graph structure learning,” Heliyon, vol. 10, no. 9, p. e30045, May 2024, doi: 10.1016/j.heliyon.2024.e30045.

[23] M. A. Mohammed, M. Boujelben, and M. Abid, “A Novel Approach for Fraud Detection in Blockchain-Based Healthcare Networks Using Machine Learning,” Future Internet, vol. 15, no. 8, p. 250, Jul. 2023, doi: 10.3390/fi15080250.