About

Many protein exist in natures as oligomers with various different quaternary structural attributes rather than a single chain. Predicting their attributes is an essential task in computational biology for the advancement of the proteomics. However, the existing methods did not consider the integration of heterogeneous coding and the accuracy of subunit categories with low data number. To end this, we proposed a predictive tool which can predicting more than 12 subunit protein oligomers, QUATgo. At the same time, three kinds of sequence coding were used, including dipeptide composition which was first time using to predict protein quaternary structural attributes, protein half-life characteristics and we modified the coding method of the Functional Domain Composition which proposed by the predecessors to solve the problem of large feature vectors. QUATgo solves the problem of insufficient data in a single subunit using a two-stage architecture and uses 10 times cross-validation to test the predictive accuracy of the classifier, the first-stage prediction model uses a random forest algorithm to generate sixteen homologous, heterologous oligomers and monomer respectively. The accuracy of the first-stage classifier is 63.4%. However, the number of training data of the hetero-10mer is insufficient so the training data of the hetero-10mer and the hetero-more than 12mer is regarded as the same category X. If the result of the first stage classifier is class X the sequence will sent to second stage classifier which was constructed with support vector machines, and can the prediction result of the hetero-10mer and hetero-more than 12mer with an accuracy of 97.5%, QUATgo will eventually have 61.4% cross-validation accuracy and 63.4% independent test accuracy. In case study, QUATgo can accurately predicts the variable complex structure of the MERS-CoV ectodomains.