Omar Mozahim Malallah

البحوث العلمية

2022

The Effect of Data Splitting Methods on Classification Performance in Wrapper-Based Gene-Selection Model

2022-11

Academic Journal of Nawroz University (القضية : 11) (الحجم : 4)

Considering the high conditionality of gene expression datasets, selecting informative genes is key to improving classification performance. The outcomes of data classification, on the other hand, are affected by data splitting strategies for the training-testing task. In light of the above facts, this paper aims to investigate the impact of three different data splitting methods on the performance of eight well-known classifiers when paired by Cuttlefish algorithm (CFA) as a Gene-Selection. The classification algorithms included in this study are K-Nearest Neighbors (KNN), Logistic Regression (LR), Gaussian Naive Bayes (GNB), Linear Support Vector Machine (SVM-L), Sigmoid Support Vector Machine (SVM-S), Random Forest (RF), Decision Tree (DT), and Linear Discriminant Analysis (LDA). Whereas the tested data splitting methods are cross-validation (CV), train-test (TT), and train-validation-test (TVT). The efficacy of the investigated classifiers was evaluated on nine cancer gene expression datasets using various evaluation metrics, such as accuracy, F1-score, Friedman test. Experimental results revealed that LDA and SVM-L outperformed other algorithms in general. In contrast, the RF and DT algorithms provided the worst results. In most often used datasets, the results of all algorithms demonstrated that the train-test method of data separation is more accurate than the train-validation-test method, while the cross-validation method was superior to both. Furthermore, RF and GNB was affected by data splitting techniques less than other classifiers, whereas the LDA was the most affected one.