Comparative Study of Machine Learning Methods for Disease Classification Based on Natural Language Symptom Descriptions

Ery Setiyawan Jullev Atmadji; Adityo Permana Wibowo; Edi  Faizal

doi:10.56705/ijaimi.v3i2.361

Authors

Ery Setiyawan Jullev Atmadji Politeknik Negeri Jember
Adityo Permana Wibowo Universitas Teknologi Yogyakarta
Edi Faizal Universitas Teknologi Digital Indonesia

DOI:

https://doi.org/10.56705/ijaimi.v3i2.361

Keywords:

Natural Language Processing, Disease Classification, Symptom Description, Machine Learning, Support Vector Machine, Naive Bayes, Random Forest, TF-IDF, Text Classification, Telemedicine

Abstract

The growing demand for remote healthcare solutions has increased the importance of efficient disease diagnosis based on textual symptom descriptions. This study explores the application of machine learning models Multinomial Naive Bayes, Random Forest, and Support Vector Machine (SVM) to classify 24 different diseases from natural language symptom inputs. Utilizing a dataset of 1,200 balanced samples and TF-IDF for feature extraction, we trained and evaluated the models using both accuracy and cross-validation metrics. Among the models, SVM achieved the highest test accuracy of 97.5% and demonstrated consistent performance across all disease categories. These findings underscore the potential of classical machine learning approaches in enhancing digital diagnostic tools, particularly for early screening in telemedicine applications. Future work could extend this study by integrating deep learning architectures and multilingual capabilities to accommodate broader and more diverse healthcare scenarios.

References

A. Jerfy, O. Selden, and R. Balkrishnan, “The Growing Impact of Natural Language Processing in Healthcare and Public Health,” Inq. J. Heal. Care Organ. Provision, Financ., vol. 61, Jan. 2024, doi: 10.1177/00469580241290095.

S. M. A. Rahman, S. Ibtisum, E. Bazgir, and T. Barai, “The Significance of Machine Learning in Clinical Disease Diagnosis: A Review,” Int. J. Comput. Appl., vol. 185, no. 36, pp. 10–17, 2023, doi: https://doi.org/10.48550/arXiv.2310.16978.

M. M. Ahsan, S. A. Luna, and Z. Siddique, “Machine-Learning-Based Disease Diagnosis: A Comprehensive Review,” Healthcare, vol. 10, no. 3, p. 541, Mar. 2022, doi: 10.3390/healthcare10030541.

F. Sogandi, “Identifying diseases symptoms and general rules using supervised and unsupervised machine learning,” Sci. Rep., vol. 14, no. 1, p. 17956, Aug. 2024, doi: 10.1038/s41598-024-69029-8.

T. Ling, L. Jake, J. Adams, K. Osinski, X. Liu, and D. Friedland, “Interpretable machine learning text classification for clinical computed tomography reports – a case study of temporal bone fracture,” Comput. Methods Programs Biomed. Updat., vol. 3, p. 100104, 2023, doi: 10.1016/j.cmpbup.2023.100104.

A. Fuster-Palà, F. Luna-Perejón, L. Miró-Amarante, and M. Domínguez-Morales, “Optimized Machine Learning Classifiers for Symptom-Based Disease Screening,” Computers, vol. 13, no. 9, p. 233, Sep. 2024, doi: 10.3390/computers13090233.

Y. Wang, J. Zhong, and R. Kumar, “A Systematic Review of Machine Learning Applications in Infectious Disease Prediction, Diagnosis, and Outbreak Forecasting.” Apr. 15, 2025, doi: 10.20944/preprints202504.1250.v1.

N. Ghaffar Nia, E. Kaplanoglu, and A. Nasab, “Evaluation of artificial intelligence techniques in disease diagnosis and prediction,” Discov. Artif. Intell., vol. 3, no. 1, p. 5, Jan. 2023, doi: 10.1007/s44163-023-00049-5.

A. Ranjan, “An Ensemble Tf-Idf Based Approach to Protein Function Prediction via Sequence Segmentation,” IEEE/ACM Trans. Comput. Biol. Bioinforma., vol. 19, no. 5, pp. 2685–2696, 2022, doi: 10.1109/TCBB.2021.3093060.

S. M. M. Hossain, “TF-IDF feature-based spam filtering of mobile SMS using a machine learning approach,” Applied Intelligence for Industry 4.0. pp. 162–175, 2023, [Online]. Available: https://api.elsevier.com/content/abstract/scopus_id/85161154224.

J. W. Sun, “Text Classification Algorithm Based on TF-IDF and BERT,” Proceedings - 2022 11th International Conference of Information and Communication Technology, ICTech 2022. pp. 533–536, 2022, doi: 10.1109/ICTech55460.2022.00112.

H. Allam, L. Makubvure, B. Gyamfi, K. N. Graham, and K. Akinwolere, “Text Classification: How Machine Learning Is Revolutionizing Text Categorization,” Information, vol. 16, no. 2, p. 130, Feb. 2025, doi: 10.3390/info16020130.

C. Rodríguez-Penagos et al., “FBM: Combining lexicon-based ML and heuristics for Social Media Polarities,” 2013. [Online]. Available: http://www.julielab.de.

A. Ise, “Machine Learning Pipeline for multi-class text Classification,” Int. J. Eng. Appl. Sci. Technol., vol. 7, no. 2, pp. 64–69, 2022, [Online]. Available: https://www.ijeast.com/papers/64-69, Tesma0702,IJEAST, 17316.pdf.

M. Siino, I. Tinnirello, M. La Cascia, and B. -Delft, “The Text Classification Pipeline: Starting Shallow, going Deeper From Foundations to GPT in Text Classification: A Comprehensive Survey on Current Approaches and Future Trends,” 2024, doi: 10.1561/XXXXXXXXX.Marco.

G. Popoola, “Sentiment Analysis of Financial News Data using TF-IDF and Machine Learning Algorithms,” 2024 IEEE 3rd International Conference on AI in Cybersecurity, ICAIC 2024. 2024, doi: 10.1109/ICAIC60265.2024.10433843.

S. M. M. Hossain, K. M. A. Kamal, A. Sen, and I. H. Sarker, TF-IDF Feature-Based Spam Filtering of Mobile SMS Using a Machine Learning Approach. 2023.

A. Occhipinti, L. Rogers, and C. Angione, “A pipeline and comparative study of 12 machine learning models for text classification,” Expert Syst. Appl., vol. 201, p. 117193, Sep. 2022, doi: 10.1016/j.eswa.2022.117193.

K. Yusupov, “Comparative Analysis of Machine Learning and Deep Learning Models for Email Spam Classification Using TF-IDF and Word Embedding Techniques,” Lecture Notes on Data Engineering and Communications Technologies, vol. 231. pp. 114–122, 2025, doi: 10.1007/978-3-031-76452-3_11.

K. S. Gill, “Hypothesis Testing of Gaussian Naïve Bayes Classifier for Liver Disease Classification,” 2023 2nd International Conference on Futuristic Technologies, INCOFT 2023. 2023, doi: 10.1109/INCOFT60753.2023.10425015.

C. R. Dhivyaa, “Skin lesion classification using decision trees and random forest algorithms,” J. Ambient Intell. Humaniz. Comput., 2020, doi: 10.1007/s12652-020-02675-8.

A. S. Khan, “Integrating BERT Embeddings with SVM for Prostate Cancer Prediction,” Proceedings - 6th International Conference on Electrical Engineering and Information and Communication Technology, ICEEICT 2024. pp. 574–579, 2024, doi: 10.1109/ICEEICT62016.2024.10534547.

H. Azis, M. Abdullah, S. Ismail, and ..., “A Comparative Study of YOLO Models for Enhanced Vehicle Detection in Complex Aerial Scenarios,” 2025 19th Int. …, 2025, [Online]. Available: https://ieeexplore.ieee.org/abstract/document/10857527/.

A. R. Manga, H. Azis, F. Fattah, Y. Salim, and ..., “ResNet-50 for Flower Image Classification: A Comparative Study of Segmentation and Non-Segmentation Approaches,” 2025 19th …, 2025, [Online]. Available: https://ieeexplore.ieee.org/abstract/document/10857520/.