Machine Learning-Based Clustering of Viruses Using Taxonomic and Genomic Features for Health Informatics Applications

Authors

  • Adityo Permana Wibowo Universitas Teknologi Yogyakarta
  • Made Leo Radhitya Institut Bisnis dan Teknologi Indonesia
  • Edi Faizal Universitas Teknologi Digital Indonesia
  • Ika Arfiani Universitas Ahmad Dahlan

DOI:

https://doi.org/10.56705/qstvhw47

Keywords:

Virus Clustering, Machine Learning, Taxonomic Features, Genomic Features, Health Informatics, Computational Virology, Pandemic Preparedness

Abstract

Viruses remain a major concern in global public health due to their potential to cause outbreaks, epidemics, and pandemics. The rapid organization and analysis of virus-related data are important for supporting computational virology, health informatics, and pandemic preparedness. This study proposes an unsupervised machine learning approach to cluster viruses based on taxonomic and genomic characteristics. The dataset consisted of 70 virus records with attributes including family, genus, genome type, strand type, and envelope status. Since the dataset did not contain predefined epidemiological labels or risk categories, the analysis was designed as an exploratory clustering task rather than a supervised prediction task. Data preprocessing was performed by removing duplicates, handling missing values, standardizing categorical attributes, and transforming selected features using One-Hot Encoding. Three clustering algorithms were evaluated, namely K-Means, Agglomerative Clustering, and DBSCAN. The clustering performance was assessed using Silhouette Score, Davies-Bouldin Index, and Calinski-Harabasz Score, while Principal Component Analysis was applied for two-dimensional visualization. The results showed that K-Means with 10 clusters achieved a Silhouette Score of 0.7725 and a Davies-Bouldin Index of 0.8186. Agglomerative Clustering obtained the highest Silhouette Score of 0.7754, while DBSCAN produced fewer clusters with lower overall performance. Several biologically meaningful groups were identified, including clusters representing Flaviviridae, Coronaviridae, Herpesviridae, Poxviridae, and enveloped RNA viruses. However, a large proportion of records contained unknown values, which influenced the formation of a dominant incomplete-data cluster. These findings indicate that taxonomic and genomic features can support machine learning-based virus grouping, although data completeness remains a critical factor. This study provides an initial computational framework for AI-driven viral data exploration and may serve as a foundation for future viral risk stratification using enriched epidemiological and clinical features.

References

[1] P. J. Walker et al., “Changes to virus taxonomy and to the International Code of Virus Classification and Nomenclature ratified by the International Committee on Taxonomy of Viruses (2021),” Arch. Virol., vol. 166, no. 9, pp. 2633–2648, Sep. 2021, doi: 10.1007/s00705-021-05156-1.

[2] Y.-M. Chen et al., “RNA viromes from terrestrial sites across China expand environmental viral diversity,” Nat. Microbiol., vol. 7, no. 8, pp. 1312–1323, Jul. 2022, doi: 10.1038/s41564-022-01180-2.

[3] P. J. Walker et al., “Recent changes to virus taxonomy ratified by the International Committee on Taxonomy of Viruses (2022),” Arch. Virol., vol. 167, no. 11, pp. 2429–2440, Nov. 2022, doi: 10.1007/s00705-022-05516-5.

[4] M. Krupovic et al., “Bacterial Viruses Subcommittee and Archaeal Viruses Subcommittee of the ICTV: update of taxonomy changes in 2021,” Arch. Virol., vol. 166, no. 11, pp. 3239–3244, Nov. 2021, doi: 10.1007/s00705-021-05205-9.

[5] S. Nayfach et al., “Metagenomic compendium of 189,680 DNA viruses from the human gut microbiome,” Nat. Microbiol., vol. 6, no. 7, pp. 960–970, Jun. 2021, doi: 10.1038/s41564-021-00928-6.

[6] P. Simmonds et al., “Changes to virus taxonomy and the ICTV Statutes ratified by the International Committee on Taxonomy of Viruses (2024),” Arch. Virol., vol. 169, no. 11, p. 236, Nov. 2024, doi: 10.1007/s00705-024-06143-y.

[7] D. Turner et al., “Abolishment of morphology-based taxa and change to binomial species names: 2022 taxonomy update of the ICTV bacterial viruses subcommittee,” Arch. Virol., vol. 168, no. 2, p. 74, Feb. 2023, doi: 10.1007/s00705-022-05694-2.

[8] B. E. Dutilh et al., “Perspective on taxonomic classification of uncultivated viruses,” Curr. Opin. Virol., vol. 51, pp. 207–215, Dec. 2021, doi: 10.1016/j.coviro.2021.10.011.

[9] E. V. Koonin, J. H. Kuhn, V. V. Dolja, and M. Krupovic, “Megataxonomy and global ecology of the virosphere,” ISME J., vol. 18, no. 1, p. wrad042, Jan. 2024, doi: 10.1093/ismejo/wrad042.

[10] A. Zielezinski et al., “Ultrafast and accurate sequence alignment and clustering of viral genomes,” Nat. Methods, vol. 22, no. 6, pp. 1191–1194, Jun. 2025, doi: 10.1038/s41592-025-02701-7.

[11] T. S. Postler et al., “Renaming of the genus Flavivirus to Orthoflavivirus and extension of binomial species names within the family Flaviviridae,” Arch. Virol., vol. 168, no. 9, pp. 224, s00705-023-05835–1, Sep. 2023, doi: 10.1007/s00705-023-05835-1.

[12] J. H. Kuhn et al., “2022 taxonomic update of phylum Negarnaviricota (Riboviria: Orthornavirae), including the large orders Bunyavirales and Mononegavirales,” Arch. Virol., vol. 167, no. 12, pp. 2857–2906, Dec. 2022, doi: 10.1007/s00705-022-05546-z.

[13] A. P. Camargo et al., “IMG/VR v4: an expanded database of uncultivated virus genomes within a framework of extensive functional, taxonomic, and ecological metadata,” Nucleic Acids Res., vol. 51, no. D1, pp. D733–D743, Jan. 2023, doi: 10.1093/nar/gkac1037.

[14] A. P. Camargo et al., “Identification of mobile genetic elements with geNomad,” Nat. Biotechnol., vol. 42, no. 8, pp. 1303–1312, Aug. 2024, doi: 10.1038/s41587-023-01953-y.

[15] K. Zheng et al., “VITAP: a high precision tool for DNA and RNA viral classification based on meta-omic data,” Nat. Commun., vol. 16, no. 1, p. 2226, Mar. 2025, doi: 10.1038/s41467-025-57500-7.

[16] J.-Z. Jiang et al., “Virus classification for viral genomic fragments using PhaGCN2,” Brief. Bioinform., vol. 24, no. 1, p. bbac505, Jan. 2023, doi: 10.1093/bib/bbac505.

[17] J. Shang, J. Jiang, and Y. Sun, “Bacteriophage classification for assembled contigs using graph convolutional network,” Bioinformatics, vol. 37, no. Supplement_1, pp. i25–i33, Aug. 2021, doi: 10.1093/bioinformatics/btab293.

[18] Y. Zhu, G. Chen, and Y. Sun, “VirTAXA: enhancing RNA virus taxonomic classification with remote homology search and tree-based validation,” Bioinformatics, vol. 40, no. 10, p. btae575, Oct. 2024, doi: 10.1093/bioinformatics/btae575.

[19] C. Peng, J. Shang, J. Guan, D. Wang, and Y. Sun, “ViraLM: empowering virus discovery through the genome foundation model,” Bioinformatics, vol. 40, no. 12, p. btae704, Nov. 2024, doi: 10.1093/bioinformatics/btae704.

[20] F. Alipour, C. Holmes, Y. Y. Lu, K. A. Hill, and L. Kari, “Leveraging machine learning for taxonomic classification of emerging astroviruses,” Front. Mol. Biosci., vol. 10, p. 1305506, Jan. 2024, doi: 10.3389/fmolb.2023.1305506.

[21] G. Chen, J. Jiang, and Y. Sun, “RNAVirHost: a machine learning–based method for predicting hosts of RNA viruses through viral genomes,” GigaScience, vol. 13, p. giae059, Jan. 2024, doi: 10.1093/gigascience/giae059.

[22] K. S. Azevedo, L. C. De Souza, M. G. F. Coutinho, R. De M. Barbosa, and M. A. C. Fernandes, “Deepvirusclassifier: a deep learning tool for classifying SARS-CoV-2 based on viral subtypes within the coronaviridae family,” BMC Bioinformatics, vol. 25, no. 1, p. 231, Jul. 2024, doi: 10.1186/s12859-024-05754-1.

[23] J. Shang, C. Peng, H. Liao, X. Tang, and Y. Sun, “PhaBOX: a web server for identifying and characterizing phage contigs in metagenomic data,” Bioinforma. Adv., vol. 3, no. 1, p. vbad101, Jan. 2023, doi: 10.1093/bioadv/vbad101.

[24] B. Hegarty et al., “Benchmarking informatics approaches for virus discovery: caution is needed when combining in silico identification methods,” mSystems, vol. 9, no. 3, pp. e01105-23, Mar. 2024, doi: 10.1128/msystems.01105-23.

[25] J. Galeeva, P. Kuzmichenko, A. Manolov, A. Lukashev, and E. Ilina, “Bioinformatics Tools and Approaches for Virus Discovery in Genomic Data: A Systematic Review,” Viruses, vol. 17, no. 12, p. 1538, Nov. 2025, doi: 10.3390/v17121538.

Downloads

Published

2026-05-24