A survey of prediction method using feature selection in genomics

No Thumbnail Available
Date
2020
Journal Title
Journal ISSN
Volume Title
Publisher
UMT Lahore
Abstract
Microarrays have enabled the scientists to study expression of hundreds and thousands of genes in a single experiment with very low number samples and biological replicates, most of microarray gene expression studies are used to classify human cancer using expression levels of genes in cancer cells compared with normal cells. The data scientists have huge amount of data submitted in the repositories for analysis and development of algorithms for efficient classification and selection of genes. This huge amount of data with fewer number of biological samples has huge dimensionality in the data due to a number of biological and technical reason. Feature selection in this scenario can not only reduce the dimensionality of data but it can also reduce the number of features to be analyzed without affecting the prediction capability, thus reduce the computational load and save time being used for data analysis. Two data sets having samples from three leukemia subclasses acute lymphoblastic leukemia from B and T cell and acute myeloid leukemia. While training data set comprised of 30 samples (10 acute lymphoblastic leukemia -B, 10 acute lymphoblastic leukemia –T and 10 acute myeloid leukemia). The test data set consisted of 38 samples (19 acute lymphoblastic leukemia -B, 8 acute lymphoblastic leukemia –T and 11 acute myeloid leukemia). The metagene factors were extracted and evaluated by the method described by Tamayo et al., 2007. Feature selection is a process where you automatically select those features in your data that contribute most to the prediction variable or output in which you are interested. Feature selection algorithms like Information Gain, Gain ratio, Gini Index, ReliefF, The Fast Correlation-Based Filter, Analysis of Variance (ANOVA) and Chi2 were used. For testing of hypothesis that feature selection may improve prediction efficiency of metagene classification the method proposed by Tamayo et al., 2007 was modified with hybrid approach. The feature selection was performed in python using orange 3.25 GUI and the selected data were analyzed using metagene projection model. Feature selection algorithms application was planned with for types of data selection using each algorithm i.e., 100%, 75%, 50% and 25% of total data were selected for metagene prediction using each of the seven feature selection methods. The feature selection method was applied on original data having 5571 features/genes in python using Orange 3.25 graphic user interface and 100% features were selected i.e., 5571 features were analyzed for metagene prediction SVM algorithm in R graphic user interface and the results are shown in bio plots of model and test samples, hierarchical clustering of original data and projected data and heat maps of original and projected data were developed. To evaluate the prediction ability of the model with special reference to data filtering using various feature selection categories two parameters were calculated i.e., brier score and error percentages were calculated. ReliefF was proved to be the best method for feature selection based on lowest brier score and lowest error percentage.
Description
Keywords
Citation
Collections