DEsi→ Abstract    
Motivation: RNA interference (RNAi) by small interfering RNAs (siRNAs) has become a powerful tool in the fields of molecular biology and medicine. The success of RNAi gene silencing depends on the specificity of siRNAs for particular mRNA sequences. Some siRNA design guidelines have been established from siRNA sequence analysis, but designing siRNA sequences based only on these limited rules might not be effective. Experimentally validated siRNA databases have been developed over the past few years. Because of this wealth of sequence data, modules that employ machine learning methods can be developed to predict siRNA accuracy and optimize design.

Results: In this study, we created an siRNA design tool "DEsi" that quickly selects siRNAs with high RNAi activity against a desired mRNA. DEsi combines traditional feature filters, machine learning models and BLAST to optimize siRNAs design. The prediction models in DEsi had considerable predictive power, which was validated by statistical analysis. Compared with other siRNA design tools, DEsi can quickly and accurately design siRNAs against desired mRNAs.
siPRED→ Abstract    
Small interfering RNA (siRNA) has been used widely to induce gene silencing in cells. To predict the efficacy of an siRNA with respect to inhibition of its target mRNA, we developed a two layer system, siPRED, which is based on various characteristic methods in the first layer and fusion mechanisms in the second layer. Characteristic methods were constructed by support vector regression from three categories of characteristics, namely sequence, features, and rules. Fusion mechanisms considered combinations of characteristic methods in different categories and were implemented by support vector regression and neural networks to yield integrated methods. In siPRED, the prediction of siRNA efficacy through integrated methods was better than through any method that utilized only a single method. Moreover, the weighting of each characteristic method in the context of integrated methods was established by genetic algorithms so that the effect of each characteristic method could be revealed. Using a validation dataset, siPRED performed better than other predictive systems that used the scoring method, neural networks, or linear regression. Finally, siPRED can be improved to achieve a correlation coefficient of 0.777 when the threshold of the whole stacking energy is ≥ –34.6 kcal/mol.
iPhos→ Abstract    
Phosphorylation, known as being included in important cellular activities like transcription, translation, cell cycle, neuron outgrowth, and signal transduction pathway. Each of these cellular activities is related with many diseases like cancer, so the research of phosphorylation is very important.

Since the experiments manipulated in biochemical laboratory is time- and cost-consuming, there have been many researchers trying to use accurate computing tool to make precise prediction of protein phosphorylation.

iPhos is an web-based integrated protein phosphorylation prediction system, as the user input candidate sequence(s), the iPhos could send the sequence(s) to different predictor like DISPHOS, GPS, Kinasephos, Netphos, and PPSP. With the help of support vector machine, the iPhos could improve much more of the predicting performance.
iStable→ Abstract    
Mutation of a single amino acid residue can cause changes in a protein, which could then lead to a loss of protein function. Predicting the protein stability changes can provide several possible candidates for the novel protein designing. Although many prediction tools are available, the conflicting prediction results from different tools could cause confusion to users.

We proposed an integrated predictor, iStable, constructed by using sequence information and prediction results from different element predictors. In the learning model, iStable adopted the support vector machine as an integrator, while not just choosing the majority answer given by element predictors. Furthermore, the role of the sequence information played was analyzed in our model, and an 11-window size was determined. On the other hand, iStable is available with two different input types: structural and sequential. After training and cross-validation, iStable has better performance than all of the element predictors on several datasets. Under different classifications and conditions for validation, this study has also shown better overall performance in different types of secondary structures, relative solvent accessibility circumstances, and protein memberships in different superfamilies.

The trained and validated version of iStable provides an accurate approach for prediction of protein stability change.
iMADS→ Abstract    
MADS-box genes are important transcriptional factors during floral organ development. The classical classification model of MADS-box genes is constructed by Arabidopsis thaliana, whose floral tissues such as sepal is controlled by class A and E; petal is controlled by class A, B and E; stamen is controlled by class B, C and E; and carpel is controlled by class C and E. It is showed that specific classes of MADS-box genes feature in specific functions. The traditional classification estimation of MADS-box genes relies on phylogenetic analysis or multiple sequences alignment which is needed to waste time on collecting reference sequences. Data collection is the key point to affect the evaluation of target genes. This study proposed a new prediction method of MADS-box genes classification based on similarity measure evaluated by general five programs of BLAST and constructed the classification model using support vector machines which depended on 209 MADS-box genes of different plant species and validated classification model by 10 MADS-box genes of Oncidium Gower Ramsey. Furthermore, we constructed a web-based tool, iMADS, which integrates several web tools in order toshorten the wasted time and provide related information about putative class of MADS-box gene, expressed tissues in plants, conserved domain search, coiled-coil prediction and evolutionary analysis. Those contents of latter three are assayed from web tools including NCBI Conserved Domain Search, COILS and Phylodendron separately. iMADS is an information-integrated analytic tool for MADS-box genes. It may reduce costing of time and money of researchers, making a quickly-output prediction, and presenting reliable and systematic results to users.
OncidiumOrchid GenomeBase→ Abstract    
Oncidium 'Gower Ramsey' is a valuable and successful commercial orchid for the floriculture industry in Taiwan. However, no genome reference sequence currently exists for Oncidium orchids, to facilitate the development of molecular biological studies and the breeding of these orchids. In this study, we generated Oncidium cDNA libraries for six different organs, including leaves, pseudobulbs, young inflorescences, inflorescences, flower buds and mature flowers. We utilized 454-pyrosequencing technology to perform high-throughput deep sequencing of the Oncidium transcriptome, yielding more than 0.9 million reads with an average length of 328 bp, for a total of 301 million bases. De novo assembly of the sequences yielded 50,908 contig sequences with an average length of 493 bp from 796,463 reads and 120,219 singletons. The assembled sequences were annotated using BLAST, and a total of 12,757 and 13,931 UniGene transcripts from the Arabidopsis and rice genomes were matched by TBLASTX, respectively. A Gene Ontology (GO) analysis of the annotated Oncidium contigs revealed that the majority of sequenced genes were associated with cellular processes, unknown molecular functions and intracellular components. Furthermore, a complete flowering-associated expressed sequence that included most of the genes in photoperiod pathway and the 15 CONSTANS-LIKE (COL) homologs with the conserved CCT domain was obtained in this collection. These data revealed that the Oncidium EST database generated in this study has sufficient coverage to be used as a tool to investigate the flowering pathway and various other biological pathways in orchids.
HCCPhos→ Abstract    
Protein Phosphorylation would affect the signal transduction in the organism and the immune system in the Post-translational modification. And it will be the standard target of the diseases in the cellular signaling translation process. Therefore, there have disease, complementary medicine, drug research and biotechnology in the Phosphorylation researches. According to the department of health indicate, the first cause of death is malignant tumors in Taiwan, also name cancer. Among them the most important cause of death are Hepatocellular Carcinoma and Lung Carcinoma. Hence, if we can integrate information and analysis of Phosphorylation and Hepatocellular Carcinoma (HCC) by Phosphorylation Motif (Motif), and that will provide the high value information for the same researchers. There are most forecast websites provide the information limited to Protein Phosphorylation, and that never provide the integrated information of Phosphorylation and Hepatocellular Carcinoma gene Sequence. For this purpose, this research will analysis and integrates the biochemical data, and the results will display through Data Visualization. Finally, all users can input the Protein Sequence, and then they get the Data Visualization of Protein Phosphorylation and Hepatocellular Carcinoma by Graphical User Interface (GUI) form this research built system.
siBase→ Abstract    
siRecords collect a lot of experimental data. Although siRecords contained more than 10,000 cases, siRecords database is filled with repeated and conflict data. In order to solve these problems, we established a siRNA database management system.

Not only does the system provides functions of checking the siRNA efficiency but also make use of the siPRED prediction system, which recommends more efficient siRNA candidates.
OTPS→ Abstract    
Taiwan is known for its high quality oolong tea. Because of high consumer demand, some tea manufactures mix lower quality leaves with genuine Taiwan oolong tea in order to increase profits. Robust scientific methods are, therefore, needed to verify the origin and quality of tea leaves. In this study, we investigated whether two-dimensional gel electrophoresis (2-DE) and nanoscale liquid chromatography/tandem mass spectroscopy (nano-LC/MS/MS) coupled with a two-layer feature selection mechanism comprising information gain attribute evaluation (IGAE) and support vector machine feature selection (SVM-FS) are useful in identifying characteristic proteins that can be used as markers of the original source of oolong tea. Samples in this study included oolong tea leaves from 23 different sources. We found that our method had an accuracy of 95.5% in correctly identifying the origin of the leaves. Overall, our method is a novel approach for determining the origin of oolong tea leaves.
QuaBingo→ Abstract    
Quaternary structures of proteins are closely relevant to gene regulation, signal transduction, and many other biological functions of proteins. In the current study, a new method based on protein-conserved motif composition in block format for feature extraction is proposed, which is termed block composition.

The protein quaternary assembly states prediction system which combines blocks with functional domain composition, called QuaBingo, is constructed by three layers of classifiers that can categorize quaternary structural attributes of monomer, homooligomer, and heterooligomer. The building of the first layer classifier uses support vector machines (SVM) based on blocks and functional domains of proteins, and the second layer SVM was utilized to process the outputs of the first layer. Finally, the result is determined by the Random Forest of the third layer. We compared the effectiveness of the combination of block composition, functional domain composition, and pseudoamino acid composition of the model. In the 11 kinds of functional protein families, QuaBingo is 23% of Matthews Correlation Coefficient (MCC) higher than the existing prediction system. The results also revealed the biological characterization of the top five block compositions.

QuaBingo provides better predictive ability for predicting the quaternary structural attributes of proteins.
PClass→ Abstract    
Protein quaternary structure complex is also known as a multimer, which plays an important role in a cell. The dimer structure of transcription factors is involved in gene regulation, but the trimer structure of virus-infection-associated glycoproteins is related to the human immunodeficiency virus. The classification of the protein quaternary structure complex for the post-genome era of proteomics research will be of great help. Classification systems among protein quaternary structures have not been widely developed. Therefore, we designed the architecture of a two-layer machine learning technique in this study, and developed the classification system PClass. The protein quaternary structure of the complex is divided into five categories, namely, monomer, dimer, trimer, tetramer, and other subunit classes. In the framework of the bootstrap method with a support vector machine, we propose a new model selection method. Each type of complex is classified based on sequences, entropy, and accessible surface area, thereby generating a plurality of feature modules. Subsequently, the optimal model of effectiveness is selected as each kind of complex feature module. In this stage, the optimal performance can reach as high as 70% of Matthews correlation coefficient (MCC). The second layer of construction combines the first-layer module to integrate mechanisms and the use of six machine learning methods to improve the prediction performance. This system can be improved over 10% in MCC. Finally, we analyzed the performance of our classification system using transcription factors in dimer structure and virus-infection-associated glycoprotein in trimer structure.
REALoc→ Abstract    
Drug development and investigation of protein function both require an understanding of protein subcellular localization. We developed a system, REALoc, that can predict the subcellular localization of singleplex and multiplex proteins in humans. This system, based on comprehensive strategy, consists of two heterogeneous systematic frameworks that integrate one-to-one and many-to-many machine learning methods and use sequence-based features, including amino acid composition, surface accessibility, weighted sign aa index, and sequence similarity profile, as well as gene ontology function-based features. REALoc can be used to predict localization to six subcellular compartments (cell membrane, cytoplasm, endoplasmic reticulum/Golgi, mitochondrion, nucleus, and extracellular). REALoc yielded a 75.3% absolute true success rate during five-fold cross-validation and a 57.1% absolute true success rate in an independent database test, which was >10% higher than six other prediction systems. Lastly, we analyzed the effects of Vote and GANN models on singleplex and multiplex localization prediction efficacy.