Discovering Differences in Gender-Related Skeletal Muscle Aging Through the Majority Voting-Based Identification of Differently Expressed Genes
Abdouladeem Dreder1,2, Muhammad Tahir1, Huseyin Seker1 and Muhammad Anwar1
1Bio-Health Informatics Research Group, Faculty of Engineering and Environment,
The University of Northumbria at Newcastle, Newcastle-upon-Tyne, UK
2Biotechnology Research Centre, P. O. Box 30313, Tripoli, Libya
Understanding gene function (GF) is still a significant challenge in system biology. Previously, several machine learning and computational techniques have been used to understand GF. However, these previous attempts have not produced a comprehensive interpretation of the relationship between genes and differences in both age and gender. Although there are several thousand of genes, very few differentially expressed genes play an active role in understanding the age and gender differences. The core aim of this study is to uncover new biomarkers that can contribute towards distinguishing between male and female according to the gene expression levels of skeletal muscle (SM) tissues. In our proposed multi-filter system (MFS), genes are first sorted using three different ranking techniques (t-test, Wilcoxon and Receiver Operating Characteristic (ROC)). Later, important genes are acquired using majority voting based on the principle that combining multiple models can improve the generalization of the system. Experiments were conducted on Micro Array gene expression dataset and results have indicated a significant increase in classification accuracy when compared with existing system.
Multi-Filter System, Filter Techniques, Micro Array Gene Expression, Skeletal muscle
Sexual dimorphism of skeletal muscle can occur due to age  and many of these age-related changes in skeletal muscle appear to be influenced by gender , , . For example, the muscle mass of men is larger than that of women, especially for type II fibers, while the type I muscle fibers proportion of oxidative is higher in women . Welle et al. reported that the muscle mass of men is larger than that of women , , , due to the higher level of testosterone and the anabolic effect of testosterone is well known. However, previous studies have failed to identify which genes are responsible for anabolic effects. The molecular biases related to gender difference are still fuzzy ; 50% of the cell mass of the human body is muscle, so skeletal muscle is considered an important issue. There are several changes in skeletal muscle related to age that seem to be influenced by gender . These changes in gene expression could be responsible for the decline in muscle function . In relation to sex, despite the fact that there are higher number of genes in expression related to gender difference, very few genes can help to interpret the gender difference issue . For the profiles of men and women, there are few comparisons of broad gene expression that have been carried out .
Janssen et al  reported that the reduction of skeletal muscle (SM) mass related to age starts in the third decade. This decrease starts to appear in the lower body SM. To find differences between men and women, they used t-test, Pearson correlation and multiple regression to determine the relationship between age and skeletal muscle. Liu et al  used basic statistical analysis to make a comparison between males and females in each set of age using gene expression profiles from skeletal muscle tissue. They identified important sex and age related gene functional groups using intensity-based Bayesian moderated t-test and logistic regression. This was the first study that offers global proof for the occurrence of extensive sex changes in the aging process of human skeletal muscle. Although the study showed interesting results, but they had used genes belonging to X and Y chromosomes, which can easily discriminate genders. Experiments were conducted using 3 groups namely older women versus old men, young women versus older women, and young men versus older men. But the main problem in their study is that important genes are identified using whole training data. This can lead to poor generalization because one of the fundamental goal of machine learning is to generalize beyond the samples in the training data.
The main goal of this study is to extend the work reported by Liu et al  by identifying important genes with good generalization ability. So in this work we prefer to applied work reported by Liu et al. , propose a method and make a comparison with them in order to show the effectiveness of our approach using two datasets which will be explained in proposed method and material section.
In our proposed approach which is basically inspired from ensemble of feature ranking methods for data intensive application , genes are first sorted using three different ranking techniques (t-test, Wilcoxon and ROC). Later, important genes are acquired using majority voting based on the principle that combining multiple models can improve the generalization of the system. The scope of this paper is the selection of the most reliable genes and the evaluation of classification power of selected genes. Experiments were conducted on Micro Array gene expression dataset and results have indicated a significant increase in classification accuracy when compared with the genes obtained by the system in . In this study we applied our proposed technique on two data sets and our system is able to identify differentially-expressed genes for the following three case studies in relation to age and gender differences
• Young Women versus Old Women
• Young Men versus Old Men
• Old Men versus Old Women
This paper is organized as follows. Section II describes material and the proposed method followed by results and discussion in Section III. Section IV concludes the paper.
II. MATERIAL AND PROPOSED METHOD
A. Micro array gene expression data set
In this study, two datasets contain a microarray dataset of gene expression of skeletal muscle atissue. Datasetsare publicly available in the Gene Expression Omnibus (GEO) database , . A total of 58 individuals were involved in this investigation:
22 healthy males and females of various ages, in which 7 males and 7 females were young (20-29 years old), and 4 males and 4 females were old (61-81 years old), were included in studyA. The whole Ribonucleic Acid (RNA) was extracted and gene expression profiling was implemented utilizing Affymetrix human genome U133 Plus 2chip. As in , this data set is divided into three cases, the first case involved 11 females (7 young and 4 old), second case consist of 11 males (7 young and 4 old) and the last case contain of 8 samples (4 old men and 4 old women).In study B,genes subset selection using Feature ranking techniques Bioinformatics data have extremely high dimensionality, and around 55,000 genes with only 36 samples, 15 young (7 men, 8 women) and 21 old (10 men and 11 women) were included in study B.
B. Genes subset selection using Feature ranking techniques
The datasets used in this study have extremely high dimensionality. The first dataset consists of around 55,000 genes with only 22 samples.While the second dataset  involves 55,000 genes with 36 samples. This is considered a signiﬁcant challenge to machine learning methods,this means that there are a large number of features than samples. To address this problem, it is important to select small relevant features subset to reduce processing time and avoid over ﬁtting problem . One of the possible solution is feature selection using feature ranking methods. In this study, three different ﬁlter methods are investigated. These methods are summarized as follows
- t-test: a statistical hypothesis where the statistic follows a Student’s t distribution . It is usually used to evaluate if the averages of two classes are not statistically similar by computing the variability and difference between two classes.
- Entropy: is normally used for high dimensional data to select the suitable number of features using the principle of Entropy.In this method the distance between the probability density functions is measured by divergence, which means that the features with higher divergence are considered more suitable for discriminating classes .
- Receiver Operating Characteristic (ROC): offers an active method to characterize the classifier sensitivity versus specificity. It is drawn between sensitivity and 1-specificity, for different values of the threshold, and based on the area under the ROC curve, ranking of the features is performed [20,21].
Selected subset of genes are tested for its generalization power using supervised classification. k-nearestneighbor (KNN) classifier (k=1,3) is used to evaluate the system performance. The leave-one-out cross validation (LOOCV) technique is used for evaluation.
D. Proposed System
Figure I shows the framework of the proposed system which is inspired from the fact that combining multiple models can improve the generalization of the system. We first divided the data set using leave-one-out-cross validation into T folds. In other words, there are 20 folds for 20 samples where each fold consists of 19 samples for training and one sample for testing. For each fold multi filter system (MFS) is applied, which includes three different rank feature filters T-test, ROC and Entropy. Each filter ranking technique is responsible to sort genes according to criteria specified in the filter ranking methods. From these sorted genes, N unique subset of genes are obtained based on majority voting. Which is shown in Table I. Let’s assume that there are total 10 genes and the objective is to select top 5 genes. Genes 9 and 10 are selected by all feature ranking techniques so these are most important genes. Genes 1, 4, 5 are selected twice and thus are also considered as important genes by the system. It should be noted that due to majority voting genes 2, 3 and 6 are not selected by the system. Later, KNN is applied on the new subset of genes in order to check the predictive performance.
Fig. I. Proposed Multi-Filter System (MFS).
k-nearest neighbor: The main objective of k-nearest neighbor (k-NN) classifier is to discover set of k objects in the training set that are similar to the objects in the testgroup 
where a is the feature vector of xth sample.
III. RESULTS AND DISCUSSION
In this section, we will evaluate the performance of the multi-filter system (MFS). The proposed system is also compared with the system presented in  and, in which genes are identified for three categories (male young versus male old, female young versus female old and male old versus female old) from total of 54623 genes. In order to have a fair comparison, the same number of genes are selected from MFS and compared with the genes identified in  and .The evaluation metrics used in this study are: Classification accuracy, Sensitivity and Specificity.
The main purpose of confusion matrix is to display the percentage of correctly classified true positive (TP) and true negative (TN) objects and false classified false positive (FP) and false negative (FN) objects as shown in TableXIII:
In the present study the dataset contains microarray dataset which is publicly available at the Gene Expression Omnibus (GEO) dataset http://www.ncbi.nlm.nih.gov/ ,for use by the scientific community: GSE38718.
Subjects, the subjects ware 22 healthy male and female in various ages, young 11 males and 11 females (20-29 years old) and 8 old subjects (4males and 4 females) all of the (61-81years old). Total RNA was extracted and gene expression profiling was performed using the Affymetrix Human Genome U133 plus 2 chip.
A. Case Study 1: Young Men versus Old Men
This case study consists of 11 male samples (7 young and 4 old). Table II shows the performance of MFS when compared with the genes identified by Liu et al . It is observed that the best performance is obtained using 3NN classifier which is 90.9% while genes obtained by  only able to achieve 81.8%. This improvement is mainly due to high specificity. Further analysis has revealed that out of 75 genes, only 9 genes are common in both systems. Some new genes are identified, that can play an important role in age differences of young and old males. Some of the new genes are shown in Table III along with 9 genes that are selected by both systems.
These new genes can be very useful for biologist in order to identify the differences between young and old males. Figure II, III, IV, V, VI, VII areshown the performance of the system by varying the number of genes. Compared with . It is observed that the best performance is obtained by using 10 or 20 genes and afterwards, there is a 10% drop in performance. This may be due to selection of some genes that can degrade the performance of the system. Future work aims to investigate wrapper techniques to identify these genes.
B. Case Study 2: Old Men versus Old Women
Another objective of this investigation was to examine basal level gene expression among Old Men and Old Women. In This case study consists of 8 adults (4 old men versus 4 old women). Table IV shows the performance of MFS when compared with the genes identified by Liu et al. . It is observed that genes selected using MFS have classification accuracy of 100% using both 1NN and 3NN with high Sensitivity and Specificity.
C. Case Study 3: Young Female versus Old Female
This case study consists of 11 female samples (7 young and 4 old). Table V shows the performance of MFS when compared with the genes identified by Liu et al . Again, the best performance is obtained using 1NN classifier which is 91%. While genes identified by  are only able to achieve 72.2% accuracy which indicates the important improved generalization ability of the proposed system. We argue that improvement in performance is mainly due to high Specificity as Sensitivity which is same in the both systems.
This dataset consist of 54,623 genes expression dataset, it is publicly available at the Gene Expression Omnibus (GEO) dataset http://www.ncbi.nlm.nih.gov/. The subjects were 36 healthy young males and females in various ages, young (19 female (8 young, 11 old) and (17male (7young, 10 old)) Total RNA was extracted and gene expression profiling was performed using the Affymetrix Human Genome U133 plus 2 chip
Case Study 1: Young male versus Old male
This case of study involves 17 male samples (7 young and 10 old). Table VII shows the comparison of performance of MFS (with second dataset) with the genes identified by Raue et al . According to the information in Table1, the best performance is obtained using 1NN and 3NN classifier which is 88% when genes were obtained by Raue et al are only able to achieve 0.82%.There are 39 genes are common in both systems. Some new genes are identified, that can play an important role in age differences of young and old males. TableVII lists the name of some new genes identified by the proposed system along with some common genes selected by both systems
Case Study 2: Young Female versus Old Female
This case of study involves 19 male samples (8 young and 11 old). Table IX shows the comparison of performance of MFS (with second dataset) with the genes identified by Raue et al  , the best performance is obtained using 3NN classifier which is 100% while genes obtained by  only able to achieve 81.8%. This improvement is mainly due to high specificity and Sensitivity (approximately 100%). Further analysis has shown that out of 102 genes, only 8 genes are common in both systems. Some new genes are identified, that can play an important role in age differences of young and old males. Table X lists the name of some new genes identified by the proposed system along with some common genes selected by both systems
Case Study 3: Old Men versus Old Women
This case study involves 21 male samples (10 men and 11 women). Table XI shows the comparison of performance of MFS (with second dataset) with the genes identified by Raue et al . According to the Table 1, the best performance is gained using 3NN classifier which is approximately 95% while genes obtained by  were only able to achieve 76%.Only 9 genes are common in both systems. Some new genes are identified, that can play an important role in age differences of young male and old male. Some new genes are shown in Table X II lists the name of some new genes identified by the proposed system along with some common genes selected by both systems.
In this study, multi-filter system (MFS) is proposed to identify important genes for Males and Females using skeletal muscle. Genes are first sorted using three different ranking techniques (t-test, Wilcoxon and ROC). The Proposed system is evaluated on publicly available microarray datasets of gene expression of skeletal muscle tissue.
Later, important genes are acquired using majority voting based on the principle that combining multiple models can improve the generalization of the system. The results have indicated that the classification performance achieved by the proposed system yields the best classification performance when compared with similar number of genes identified in previous study . Future work aims to improve the performance by identifying more important genes through Wrapper Feature Ranking techniques rather than filter based feature ranking techniques.
 S. Welle, R. Tawil, and C. A. Thornton, “Sex-related differences in gene expression in human skeletal muscle”, PLoS One, vol. 3, no. 1, pp. e1385-e1385, 2008
 D. Liu, M. A. sartor, G. A. nader, E. E. pistilli, L. tanton, C. Lilly, et al., “Microarray analysis reveals novel features of the muscle aging process in men and women”, Biological Sciences, vol. 68, no.9, pp. 1035–1044, 2013
 D. D. Liu, M. A. Sartor, G. A. Nader, L. Gutmann, M. K. Treutelaar, E. E. Pistilli, H. B. IglayReger, C. F. Burant, E. P. Hoffman, and P. M. Gordon, “Skeletal muscle gene expression in response to resistance exercise: sex speciﬁc regulation”, BMC genomics, vol. 11, no. 1, pp. 659, 2010.
 G. Sifakis, I. Valavanis, O. Papadodima, and A. A. Chatziioannou, “Identifying Gender Independent Biomarkers Responsible for human Muscle Aging Using Microarray Data”, Bioinformatics and Bioengineering (BIBE), no. pp. 1-5, 2013
 S. M. Roth, R. E. Ferrell, D. G. Peters, E. J. Metter, B. F. Hurley, and M. A. Rogers, “Inﬂuence of age, sex, and strength training on human muscle gene expression determined by microarray”, Physiological genomics, vol. 10, pp. 181-190, 2002.
 Y. Saeys, I. a. Inza, and P. Larranaga, ”A review of feature selection techniques in bioinformatics”, bioinformatics, vol. 23, no. pp. 2507-2517, 2007.
 Y. Su, T. M. Murali, V. Pavlovic, M. Schaffer, and S. Kasif, “RankGene: identiﬁcation of diagnostic genes based on expression data”, BIOINFORMATICS, vol. 19, pp. 1578-1579, 2003
 K. Murphy. “Machine learning: a probabilistic perspective”. Cambridge MA: MIT Press, 2012.
 N. Thouleimat, D. Hernandez-Lobato, and P. Dupont, “Variance Estimators for t-Test Ranking Inﬂuence the Stability and Predictive Performance of Microarray Gene Signatures”, European Conference on Computational Biology, 2010.
 S. Sahan, K. Polata, H. Kodazb, and S. Gne, “Anewhybrid method based on fuzzy-artiﬁcial immune system and k-nn algorithm for breast cancer diagnosis”, ELSEVIER, vol. 37, no. pp. 415-423, 2007.
 M. Visser, M. Pahor, F. Tylavsky, S. B. Kritchevsky, J. A. Cauley, A. B. Newman, B. A. Blunt, and T. B. Harris, “One-and two-year change in body composition as measured by DXA in a population-based cohort of older men and women”, Journal of applied physiology, vol. 94, no. pp. 2368-2374, 2003.
 V. A. Hughes, W. R. Frontera, R. Roubenoff, W. J. Evans, and M. A. F. Singh, “Longitudinal changes in body composition in older men and women: role of body weight change and physical activity”, The American journal of clinical nutrition, no. pp. 473-481, 2002
 I. Janssen, S. B, Heymsﬁeld, Z. Wang, and R. Ross, “Skeletal muscle mass and distribution in 468 men and women aged 1888 yr”, Journal of applied physiology vol. 89.no.1, pp. 81-88, 2000.
 X. Wu, V. Kumar, J. R. Quinlan, J. Ghosh, Q. Yang, H. Motoda, G. J. McLachlan, A. Ng, B. Liu, and S. Y. Philip, “Top 10 algorithms in data mining”, Knowledge and Information Systems, vol. 14, no. pp. 1-37, 2008.
 A. C. Haury, P. Gestraud, and J.-P. Vert, “The inﬂuence of feature selection methods on accuracy, stability and interpretability of molecular signatures”, PloS one, vol. 6, p. e28210, 2011.
 W. Altidor, T. M. Khoshgoftaar and J V Hulse and A. Napolitano, “Ensemble Feature Ranking Methods for Data Intensive Computing Applications”, Handbook of Data Intensive Computing, pp 349-376, 2011
 A. Y. Guo, K. S. LeunG, P. M. F. Siu, J. H. Qin, S. K. H. Chow, L. Qin, C. Y. Li, and W. H. Cheung, “Muscle mass, structural and functional investigations of senescence-accelerated mouse P8 (SAMP8)”, Experimental Animals, vol. 64, p. 425, 2015.
 R. R. Kalyani, M. Corriere, and L. Ferrucci, “Age-related and disease-related muscle loss: the effect of diabetes, obesity, and other diseases”, The Lancet Diabetes & Endocrinology, vol. 2, no. pp. 819-829, 2014.
 U. Raue, T. A. Trappe, S. T. Estrem, H.-R. Qian, L. M. Helvering, R. C. Smith, and S. Trappe, “Transcriptome signature of resistance exercise adaptations: mixed muscle and fiber type specific profiles in young and old adults,” Journal of Applied Physiology, vol. 112, no. pp. 1625-1636, 2012.
. R. Sharma, R. B. Pachori, and U. R. Acharya, “An integrated index for the identification of focal electroencephalogram signals using discrete wavelet transform and entropy measures,” Entropy, vol. 17, no. pp. 5218-5240, 2015.
 U. R. Acharya, E. Ng, L. W. J. Eugene, K. P. Noronha, L. C. Min, K. P. Nayak, and S. V. Bhandary, “Decision support system for the glaucoma using Gabor transformation,” Biomedical Signal Processing and Control, vol. 15, no. pp. 18-26, 2015.