Optimizing Air Quality Index Classification Using Multiple Machine Learning Models and Oversampling Techniques

Nuwairy El Furqany

doi:10.56705/ijaimi.v3i2.322

Authors

Nuwairy El Furqany Syiah Kuala University

DOI:

https://doi.org/10.56705/ijaimi.v3i2.322

Keywords:

Air Quality Index, Machine Learning, Classification, Random Oversampling

Abstract

Air quality significantly affects public health, environmental stability, and ecosystem balance. Accurate classification of the Air Quality Index (AQI) is critical for effective monitoring and management. Previous studies often relied on a single machine learning algorithm, which limited classification performance, particularly under class imbalance conditions. This study evaluates multiple machine learning algorithms for AQI classification, including Logistic Regression, Decision Tree, K-Nearest Neighbors, Random Forest, Support Vector Machine, and Naïve Bayes. A random oversampling technique was applied to address the imbalance among AQI categories. The dataset consists of secondary data on pollutant concentrations (PM₁₀, SO₂, CO, O₃, NO₂) and AQI categories collected from five monitoring stations between 2010 and 2023. Model performance was assessed using accuracy, precision, recall, and F1-score. Before applying oversampling, the Random Forest model achieved an accuracy of 97.68%. After applying random oversampling, performance improved to 99.60%, with consistently high precision, recall, and F1-scores across classes. Feature importance analysis revealed that ozone (O₃) was the most influential pollutant, contributing 67.14% to model decision-making. The results demonstrate that combining random oversampling with ensemble-based machine learning substantially enhances AQI classification performance. This approach offers a robust and scalable framework for future air quality monitoring and environmental data analysis applications.

References

[1] F. Islam, S. K. Nukala, P. Shrestha, T. Badgery-Parker, and F. Foo, “Air pollution and cardiovascular disease: A systematic review of the effects of air pollution, including bushfire smoke, on cardiovascular disease,” American Heart Journal Plus: Cardiology Research and Practice, vol. 54, p. 100546, 2025, doi: 10.1016/j.ahjo.2025.100546.

[2] J. Guo, G. Chai, X. Song, H. Xu, Z. Li, X. Feng, and K. Yang, “Long-term exposure to particulate matter on cardiovascular and respiratory diseases in low- and middle-income countries: A systematic review and meta-analysis,” Frontiers in Public Health, vol. 11, p. 1134341, 2023, https://doi.org/10.3389/fpubh.2023.1134341.

[3] IQAir, “Kualitas udara di Indonesia,” Nov. 15, 2024. [Online]. Available: https://www.iqair.com/id/indonesia.

[4] D. P. Ramadhan and A. Triayudi, “Jakarta air quality classification based on air pollutant standard index using C4.5 and Naïve Bayes algorithms,” Journal of Technology and Information Systems, vol. 2, no. 4, 2024, https://doi.org/10.58905/saga.v2i4.395.

[5] M. A. F. Razan, N. J. Alifah, Q. A’yuni, M. Wati, and H. -, “Application of K-Means Clustering Algorithm for Air Quality Pattern Analysis in Jakarta,” JUTIKOMP, vol. 8, no. 1, pp. 64–80, 2025, https://doi.org/10.34012/jutikomp.v8i1.7028.

[6] F. Liu and Y. Dai, “Product processing quality classification model for small-sample and imbalanced data environment,” Computational Intelligence and Neuroscience, vol. 2022, p. 9024165, 2022, https://doi.org/10.1155/2022/9024165.

[7] C. Shi, Y. Wang, Y. Wan and S. Wu, "Air Quality Prediction Based on Machine Learning," 2022 International Conference on Machine Learning and Knowledge Engineering (MLKE), Guilin, China, 2022, pp. 1-5, https://doi.org/10.1109/MLKE55170.2022.00008.

[8] I. G. Iwan Sudipa, M. Habibi, E. S. Jullev Atmadji, and I. Arfiani, “Predictive Modeling of Air Quality Levels Using Decision Tree Classification: Insights from Environmental and Demographic Factors”, ijodas, vol. 5, no. 3, pp. 251-258, Dec. 2024, https://doi.org/10.56705/ijodas.v5i3.201.

[9] M. Karmoude, B. Munhungewarwa, I. Chiraira, R. Mckenzie, J. Kong, B. Smith, G. Ayana, N. Njara, T. Mathaha, M. Kumar, and B. Mellado, “Machine learning for air quality prediction and data analysis: Review on recent advancements, challenges, and outlooks,” Science of The Total Environment, 2025, https://doi.org/10.1016/j.scitotenv.2025.180593.

[10] S. Ameer et al., "Comparative Analysis of Machine Learning Techniques for Predicting Air Quality in Smart Cities," in IEEE Access, vol. 7, pp. 128325-128338, 2019, https://doi.org/10.1109/ACCESS.2019.2925082.

[11] Z. Chen, N. Liu, H. Tang, X. Gao, Y. Zhang, H. Kan, F. Deng, B. Zhao, X. Zeng, Y. Sun, H. Qian, W. Liu, J. Mo, X. Zheng, C. Huang, C. Sun, and Z. Zhao, “Health effects of exposure to sulfur dioxide, nitrogen dioxide, ozone, and carbon monoxide between 1980 and 2019: A systematic review and meta-analysis,” Indoor Air, vol. 32, no. 11, p. e13170, 2022, https://doi.org/10.1111/ina.13170 .

[12] P. Vongelis, N. G. Koulouris, P. Bakakos, and N. Rovina, “Air pollution and effects of tropospheric ozone (O₃) on public health,” International Journal of Environmental Research and Public Health, vol. 22, no. 5, p. 709, 2025, https://doi.org/10.3390/ijerph22050709.

[13] J. S. Ji, L. Liu, J. Zhang, et al., “NO₂ and PM₂.₅ air pollution co-exposure and temperature effect modification on pre-mature mortality in advanced age: a longitudinal cohort study in China,” Environmental Health, vol. 21, no. 97, 2022, https://doi.org/10.1186/s12940-022-00901-8.

[14] S. Sohrab, N. Csikós, and P. Szilassi, “Effect of geographical parameters on PM10 pollution in European landscapes: a machine learning algorithm-based analysis,” Environmental Sciences Europe, vol. 36, no. 152, 2024, https://doi.org/10.1186/s12302-024-00972-z.

[15] F. Hamami and I. A. Dahlan, “Air quality classification in urban environment using machine learning approach,” IOP Conference Series: Earth and Environmental Science, vol. 986, no. 1, p. 012004, 2022, https://doi.org/10.1088/1755-1315/986/1/012004.

[16] V. R. Joseph, “Optimal ratio for data splitting,” Statistical Analysis and Data Mining: The ASA Data Science Journal, vol. 15, no. 4, pp. 531–538, 2022, https://doi.org/10.1002/sam.11583.

[17] L. B. V. de Amorim, G. D. C. Cavalcanti, and R. M. O. Cruz, “The choice of scaling technique matters for classification performance,” Applied Soft Computing, vol. 133, p. 109924, 2023, https://doi.org/10.1016/j.asoc.2022.109924.

[18] R. Idroes, et al., “Application of genetic algorithm-multiple linear regression and artificial neural network determinations for prediction of Kovats retention index,” International Review on Modelling and Simulations (IREMOS), vol. 14, no. 2, p. 137, 2021, https://doi.org/10.15866/iremos.v14i2.20460.

[19] N. El Furqany, M. Subianto, and A. Rusyana, “Hybrid ensemble learning with SMOTEENN and soft voting for stunting risk prediction: A SHAP-based explainable approach,” Journal of Applied Data Sciences, vol. 6, no. 4, pp. 2989–3004, Dec. 2025, https://doi.org/10.47738/jads.v6i4.829.

[20] W. Chen, K. Yang, Z. Yu, et al., “A survey on imbalanced learning: latest research, applications and future directions,” Artificial Intelligence Review, vol. 57, p. 137, 2024, https://doi.org/10.1007/s10462-024-10759-6.

[21] Y. B. Wah, et al., “Machine learning and synthetic minority over-sampling techniques for imbalanced data: Improving machine failure prediction,” Computers, Materials & Continua, vol. 75, no. 3, pp. 4821–4841, 2023, https://doi.org/10.32604/cmc.2023.034470.

[22] J. Han, et al., Data mining: Concepts, models, methods, and algorithms, 3rd ed. Elsevier; Morgan Kaufmann, 2012.

[23] V. Chang, J. Bailey, Q. A. Xu, T. Li, and X. Cao, “Pima Indians diabetes mellitus classification based on machine learning (ML) algorithms,” Neural Computing and Applications, vol. 35, no. 24, pp. 16157–16173, 2023, https://doi.org/10.1007/s00521-022-07049-z.

[24] V. Chang, J. Bailey, Q. A. Xu, and Z. Sun, “Pima Indians diabetes mellitus classification based on machine learning (ML) algorithms,” Neural Computing and Applications, 2022, https://doi.org/10.1007/s00521-022-07049-z.

[25] J. Kozak, B. Probierz, K. Kania, and P. Juszczuk, “Preference-driven classification measure,” Entropy, vol. 24, no. 4, p. 531, 2022, https://doi.org/10.3390/e24040531.

Optimizing Air Quality Index Classification Using Multiple Machine Learning Models and Oversampling Techniques

Authors

DOI:

Keywords:

Abstract

References

Downloads

Published

Issue

Section

sampul

visitor

download