Machine Learning Classifiers and Data Synthesis Techniques to Tackle with Highly Imbalanced COVID-19 Data

Document Type : Original Article

Authors

1 Department of Computer Engineering, University College of Nabi Akram, Tabriz, Iran.

2 Department of Animal Science, Faculty of Agriculture, University of Yasouj, Yasouj, Iran.

Abstract

The COVID-19 pandemic has highlighted the urgent need for rapid and accurate diagnostic methods. In this study, we evaluate three machine learning models—Random Forest (RF), Logistic Regression (LR) and Decision Tree (DT)—for detecting COVID-19 trained on preprocessed imbalanced datasets with 5086 negative and 558 positive cases. To this end, we demonstrate the capability of two advanced data synthesis algorithms, Conditional Tabular Generative Adversarial Network (CTGAN) and Tabular Variational Autoencoder (TVAE), in addressing the class imbalance inherent in the dataset. The classifiers trained on the original as well as the balanced datasets were evaluated for comparison. Our findings reveal that RF obtains the highest accuracy of 98.83% on the CTGAN-balanced dataset. In conclusion, our results verify the potential of coupling data synthesis with traditional machine learning for the diagnosis of COVID-19. We hope that this research will make a significant contribution to the current AI (Artificial Intelligence) efforts in combating the pandemic.

Keywords

Main Subjects


 
CAPTCHA Image