GPS 2.0, a Tool to Predict Kinase-specific Phosphorylation Sites in Hierarchy
Molecular & Cellular Proteomics. 2008;7(9):1598-1608.
Identification of protein phosphorylation sites with their cognate protein kinases (PKs) is a key step to
delineate molecular dynamics and plasticity underlying a variety of cellular processes.
In this work, we adopted a well established rule to classify PKs into a hierarchical structure with four
levels, including group, family, subfamily, and single PK.
In addition, we developed a simple approach to estimate the theoretically maximal false positive rates.
The on-line service and local packages of the GPS (Group-based Prediction System) 2.0 were implemented in
Java with the modified version of the Group-based Phosphorylation Scoring algorithm. As the first stand
alone software for predicting phosphorylation, GPS 2.0 can predict kinase-specific phosphorylation sites for
408 human PKs in hierarchy. A large scale prediction of more than 13,000 mammalian phosphorylation sites by
GPS 2.0 was exhibited with great performance and remarkable accuracy.Thus, the GPS 2.0 is a useful tool for
predicting protein phosphorylation sites and their cognate kinases and is freely available on line.
GPS 2.0 now is updated as GPS3.0 and is freely available at http://gps.biocuckoo.org
CSS-Palm 2.0: an updated software for palmitoylation sites prediction
Protein Engineering, Design and Selection. 2008;21(11):639-644.
Protein palmitoylation is an essential post-translational lipid modification of proteins, and reversibly
orchestrates a variety of cellular processes.
In this work, we updated our previous CSS-Palm into version 2.0. An updated clustering and scoring strategy
(CSS) algorithm was employed with great improvement.
The leave-one-out validation and 4-, 6-, 8- and 10-fold cross-validations were adopted to evaluate the
prediction performance of CSS-Palm 2.0.
Also, an additional new data set not included in training was used to test the robustness of CSS-Palm 2.0.
As an application, we performed a small-scale annotation of palmitoylated proteins in budding yeast.
The online service and local packages of CSS-Palm 2.0 were freely available at:http://csspalm.biocuckoo.org
DOG 1.0: illustrator of protein domain structures
Cell Research. 2009;19(2):271-273.
Development of computer software that can illustrate user-designated protein domain structures will be a
great help for biological experimentalists to communicate their research results.
In this work, we present a novel software of DOG (Domain Graph, version 1.0) for experimentalists, to
prepare publication-quality figures of protein domain structures.
The scale of a protein domain and the position of a functional motif/site will be precisely defined.
The DOG 1.0 software was written in JAVA 1.5 (J2SE 5.0) and packed with Install4j 4.0.8.
Then we developed several packages to support three major Operating Systems (OS), including Windows,
Unix/Linux and Mac.
For Windows and Linux systems, a Java Runtime Environment 6 (JRE) package of Sun Microsystems was also
The DOG 1.0 software is freely available from: http://dog.biocuckoo.org
Systematic study of protein sumoylation: Development of a site-specific predictor of SUMOsp 2.0
Protein sumoylation is an important reversible post-translational modification on proteins, and orchestrates
a variety of cellular processes.
In this work, we developed SUMOsp 2.0, an accurate computing program with an improved group-based
phosphorylation scoring algorithm.
Our analysis demonstrated that SUMOsp 2.0 has greater prediction accuracy than SUMOsp 1.0 and other existing
tools, with a sensitivity of 88.17% and a specificity of 92.69% under the medium threshold.
Previously, several large-scale experiments have identified a list of potential sumoylated substrates in
Saccharomyces cerevisiae and Homo sapiens;
however, the exact sumoylation sites in most of these proteins remain elusive. We have predicted potential
sumoylation sites in these proteins using SUMOsp 2.0,
which provides a great resource for researchers and an outline for further mechanistic studies of
sumoylation in cellular plasticity and dynamics.
The online service and local packages of SUMOsp 2.0 are freely available at: http://sumosp.biocuckoo.org
MiCroKit 3.0: an integrated database of midbody, centrosome and kinetochore
Nucleic Acids Research. 2010;38:D155-D160.
During cell division/mitosis, a specific subset of proteins is spatially and temporally assembled into
protein super complexes
in three distinct regions, i.e. centrosome/spindle pole, kinetochore/centromere and midbody/cleavage
neck, and modulates cell division process faithfully. Here, we present the MiCroKit database (http://microkit.biocuckoo.org
) of proteins that localize in
midbody, centrosome and/or kinetochore. We collected into the MiCroKit database experimentally
verified microkit proteins from the scientific literature that have unambiguous supportive evidence for
under fluorescent microscope. The current version of MiCroKit 3.0 provides detailed information for 1489
from seven model organisms, including Saccharomyces cerevisiae
, Schizasaccharomyces pombe
, Drosophila melanogaster
, Xenopus laevis
and Homo sapiens
. Moreover, the orthologous information was provided for these
microkit proteins, and could be a useful resource for further
PhosSNP for Systematic Analysis of Genetic Polymorphisms That Influence Protein Phosphorylation
Molecular & Cellular Proteomics. 2010;9(4):623-634.
We are entering the era of personalized genomics as breakthroughs in sequencing technology have made it
possible to sequence or genotype an individual person in an efficient and accurate manner.
Preliminary results from HapMap and other similar projects have revealed the existence of tremendous genetic
variations among world populations and among individuals.
It is also generally believed that the genetic variation is the main cause for different susceptibility to
certain diseases or different response to therapeutic treatments.
In this work, using an in-house developed kinase-specific phosphorylation site predictor (GPS 2.0), we
computationally detected that ∼70% of the reported nsSNPs are potential phosSNPs.
Finally, all phosSNPs were integrated into the PhosSNP 1.0 database, which was implemented in JAVA 1.5 (J2SE
The PhosSNP 1.0 database is freely available for academic researchers at:http://phossnp.biocuckoo.org
GPS-SNO: Computational Prediction of Protein S-Nitrosylation Sites with a Modified GPS Algorithm
Plos One. 2010;5(6): e11290.
As one of the most important and ubiquitous post-translational modifications (PTMs) of proteins, S
plays important roles in a variety of biological processes, including the regulation of cellular dynamics
and plasticity. Identification of S
-nitrosylated substrates with their exact sites is crucial for
understanding the molecular mechanisms of S
-nitrosylation.In this work, we developed a novel
software of GPS-SNO 1.0 for the prediction of S
-nitrosylation sites.By comparison, the prediction
performance of GPS 3.0 algorithm was better than other methods, with an accuracy of 75.80%, a sensitivity of
53.57% and a specificity of 80.14%. As an application of GPS-SNO 1.0, we predicted putative S
sites for hundreds of potentially S
-nitrosylated substrates for which the exact S
sites had not been experimentally determined.The online service and local packages of GPS-SNO were
implemented in JAVA and are freely available at: http://sno.biocuckoo.org
A Summary of Computational Resources for Protein Phosphorylation
Current Protein & Peptide Science. 2010;11(6):485-496.
Protein phosphorylation is the most ubiquitous post-translational modification (PTM), and plays important
roles in most of biological processes. Identification of site-specific phosphorylated substrates is
fundamental for understanding the molecular mechanisms of phosphorylation. Besides experimental approaches,
prediction of potential candidates with computational methods has also attracted great attention for its
convenience, fast-speed and low-cost. In this review, we present a comprehensive but brief summarization of
computational resources of protein phosphorylation, including phosphorylation databases, prediction of
non-specific or organism-specific phosphorylation sites, prediction of kinase-specific phosphorylation sites
or phospho-binding motifs, and other tools. The latest compendium of computational resources for protein
phosphorylation is available at: http://gps.biocuckoo.org/links.php
CPLA 1.0: an integrated database of protein lysine acetylation
Nucleic Acids Research. 2011;39:D1029-1034.
As a reversible post-translational modification (PTM) discovered decades ago, protein lysine acetylation was
known for its regulation of transcription through the modification of histones. Recent studies discovered
that lysine acetylation targets broad substrates and especially plays an essential role in cellular
metabolic regulation.In this work, we presented the compendium of protein lysine acetylation (CPLA) database
for lysine acetylated substrates with their sites. The online services of CPLA database
developed in JAVA 1.5 (J2SE 5.0). The CPLA database
is updated as CPLM and is freely available for all users at: http://cplm.biocuckoo.org
GPS 2.1: enhanced prediction of kinase-specific phosphorylation sites with an algorithm of motif length
Protein Engineering, Design and Selection. 2011;24(3):255-260.
As the most important post-translational modification of proteins, phosphorylation plays essential roles in
all aspects of biological processes. Besides experimental approaches, computational prediction of
phosphorylated proteins with their kinase-specific
phosphorylation sites has also emerged as a popular strategy, for its low-cost, fast-speed and convenience.
In this work,
we developed a kinase-specific phosphorylation sites predictor of GPS 2.1 (Group-based Prediction System),
with a novel but
simple approach of motif length selection (MLS). By this approach, the robustness of the prediction system
was greatly improved.
All algorithms in GPS old versions were also reserved and integrated in GPS 2.1. The online service and
local packages of
GPS 2.1 were implemented in JAVA 1.5 (J2SE 5.0) and freely available for academic researches at: http://gps.biocuckoo.org
GPS-YNO2: computational prediction of tyrosine nitration sites in proteins
Mol. BioSyst. 2011;7(4):1197-1204.
The last decade has witnessed rapid progress in the identification of proteintyrosine nitration
(PTN), which is an essential and ubiquitous post-translational modification (PTM) that plays a variety of
important roles in both physiological and pathological processes, such as the immune response, cell death,
aging and neurodegeneration.
Identification of site-specific nitrated substrates is fundamental for understanding the molecular
mechanisms and biological functions of PTN.
In contrast with labor-intensive and time-consuming experimental approaches, here we report the development
of the novel software package GPS-YNO2 to predict PTN sites.
The software demonstrated a promising accuracy of 76.51%, a sensitivity of 50.09% and a specificity of
80.18% from the leave-one-out validation.
Through a statistical functional comparison with the nitric oxide (NO) dependent reversible modification of
S-nitrosylation, we observed that PTN prefers to attack certain fundamental biological processes and
Finally, the online service and local packages of GPS-YNO2 1.0 were implemented in JAVA and freely available
GPS-CCD: A Novel Computational Program for the Prediction of Calpain Cleavage Sites
Plos One. 2011;6(4):e19001.
As one of the most essential post-translational modifications (PTMs) of proteins, proteolysis, especially
calpain-mediated cleavage, plays an important role in many biological processes, including cell
death/apoptosis, cytoskeletal remodeling, and the cell cycle. Experimental identification of calpain targets
with bona fide
cleavage sites is fundamental for dissecting the molecular mechanisms and biological
roles of calpain cleavage. In contrast to time-consuming and labor-intensive experimental approaches,
computational prediction of calpain cleavage sites might more cheaply and readily provide useful information
for further experimental investigation. In this work, we constructed a novel software package of GPS-CCD
(Calpain Cleavage Detector) for the prediction of calpain cleavage sites, with an accuracy of 89.98%,
sensitivity of 60.87% and specificity of 90.07%. With this software, we annotated potential calpain cleavage
sites for hundreds of calpain substrates, for which the exact cleavage sites had not been previously
determined.The online service and local packages of GPS-CCD 1.0 were implemented in JAVA and are freely
available at: http://ccd.biocuckoo.org/
GPS-PUP: computational prediction of pupylation sites in prokaryotic proteins
Mol. BioSyst. 2011;7(10):2737-2740.
Recent experiments revealed the prokaryotic ubiquitin-like protein (PUP) to be a signal for the selective
degradation of proteins in Mycobacterium tuberculosis (Mtb).
By covalently conjugating the PUP, pupylation functions as a critical post-translational modification (PTM)
conserved in actinomycetes.
Here, we designed a novel computational tool of GPS-PUP for the prediction of pupylation sites, which was
shown to have a promising performance.
From small-scale and large-scale studies we collected 238 potentially pupylated substrates for which the
exact pupylation sites were still not determined.
As an example application, we predicted ∼85% of these proteins with at least one potential pupylation site.
Furthermore, through functional analysis,
we observed that pupylation can target various substrates so as to regulate a broad array of biological
processes, such as the response to stress, sulfate and proton transport, and metabolism.
The GPS-PUP 1.0 is freely available at: http://pup.biocuckoo.org
Computational Analysis of Phosphoproteomics: Progresses and Perspectives
Current Protein & Peptide Science. 2011;7(12):591-601.
Phosphorylation is one of the most essential post-translational modifications (PTMs) of proteins, regulates
a variety of cellular signaling pathways, and at least partially determines the biological diversity. Recent
progresses in phosphoproteomics have identified more than 100,000 phosphorylation sites, while this number
will easily exceed one million in the next decade. In this regard, how to extract useful information from
flood of phosphoproteomics data has emerged as a great challenge. In this review, we summarized the leading
edges on computational analysis of phosphoproteomics, including discovery of phosphorylation motifs from
phosphoproteomics data, systematic modeling of phosphorylation network, analysis of genetic variation that
influences phosphorylation, and phosphorylation evolution. Based on existed knowledge, we also raised
several perspectives for further studies. We believe that integration of experimental and computational
analyses will propel the phosphoproteomics research into a new phase.
Systematic Analysis of Protein Phosphorylation Networks From Phosphoproteomic Data
Molecular & Cellular Proteomics. 2012;11(10):1070-1083.
In eukaryotes, hundreds of protein kinases (PKs) specifically and precisely modify thousands of substrates
at specific amino
acid residues to faithfully orchestrate numerous biological processes, and reversibly determine the cellular
plasticity. Although over 100,000 phosphorylation sites (p-sites) have been experimentally identified from
studies, the regulatory PKs for most of these sites still remain to be characterized. Here, we present a
novel software package
of iGPS for the prediction of in vivo site-specific kinase-substrate relations mainly from the
phosphoproteomic data.By critical evaluations and comparisons,
the performance of iGPS is satisfying and better than other existed tools. Based on the prediction results,
we modeled protein
phosphorylation networks and observed that the eukaryotic phospho-regulation is poorly conserved at the site
levels.This work contributes to the understanding of phosphorylation mechanisms at the systemic level, and
provides a powerful methodology for the general analysis of in vivo post-translational
modifications regulating sub-proteomes.
Systematic analysis of the Plk-mediated phosphoregulation in eukaryotes
Briefings in Bioinformatics. 2013;14(3):344-360.
Substantial evidence has confirmed that Polo-like kinases (Plks) play a crucial role in a variety of
cellular processes via phosphorylation-mediated signaling transduction.
Identification of Plk phospho-binding proteins and phosphorylation substrates is fundamental for elucidating
the molecular mechanisms of Plks.
Here, we present an integrative approach for the analysis of Plk-specific phospho-binding and
phosphorylation sites (p-sites) in proteins.
From the currently available phosphoproteomic data, we predicted tens of thousands of potential Plk
phospho-binding and phosphorylation sites in eukaryotes, respectively.
Furthermore, statistical analysis suggested that Plk phospho-binding proteins are more closely implicated in
mitosis than their phosphorylation substrates.
Additional computational analysis together with in vitro and in vivo experimental assays demonstrated that
human Mis18B is a novel interacting partner of Plk1, while pT14 and pS48 of Mis18B were identified as
Taken together, this systematic analysis provides a global landscape of the complexity and diversity of
potential Plk-mediated phosphoregulation, and the prediction results can be helpful for further experimental
GPS-SUMO: a tool for the prediction of sumoylation sites and SUMO-interaction motifs
Nucleic Acids Research. 2014;42: W325-30.
Small ubiquitin-like modifiers (SUMOs) regulate a variety of cellular processes through two distinct
mechanisms, including covalent sumoylation and noncovalent SUMO interaction. The complexity of SUMO
regulations has greatly hampered the large-scale identification of SUMO substrates or interaction partners
on a proteome-wide level. In this work, we developed a new tool called GPS-SUMO for the prediction of both
sumoylation sites and SUMO-interaction motifs (SIMs) in proteins. To obtain an accurate performance, a new
generation group-based prediction system (GPS) algorithm integrated with Particle Swarm Optimization
approach was applied. By critical evaluation and comparison, GPS-SUMO was demonstrated to be substantially
superior against other existing tools and methods. With the help of GPS-SUMO, it is now possible to further
investigate the relationship between sumoylation and SUMO interaction processes. A web service of GPS-SUMO
An integrated overview of spatiotemporal organization and regulation in mitosis in terms of the proteins in
the functional supercomplexes
Frontiers in Microbiology. 2014;5:573.
Eukaryotic cells may divide via the critical cellular process of cell division/mitosis, resulting in two
daughter cells with the same genetic information. A large number of dedicated proteins are involved in this
process and spatiotemporally assembled into three distinct super-complex structures/organelles, including
the centrosome/spindle pole body, kinetochore/centromere and cleavage furrow/midbody/bud neck, so as to
precisely modulate the cell division/mitosis events of chromosome alignment, chromosome segregation and
cytokinesis in an orderly fashion. In recent years, many efforts have been made to identify the protein
components and architecture of these subcellular organelles, aiming to uncover the organelle assembly
pathways, determine the molecular mechanisms underlying the organelle functions, and thereby provide new
therapeutic strategies for a variety of diseases. However, the organelles are highly dynamic structures,
making it difficult to identify the entire components. Here, we review the current knowledge of the
identified protein components governing the organization and functioning of organelles, especially in human
and yeast cells, and discuss the multi-localized protein components mediating the communication between
organelles during cell division.
IBS: an illustrator for the presentation and visualization of biological sequences
ESI Hot Paper
Biological sequence diagrams are fundamental for visualizing various functional elements in protein or
nucleotide sequences that enable a summarization and presentation of existing information as well as means
of intuitive new discoveries. Here, we present a software package called illustrator of biological sequences
(IBS) that can be used for representing the organization of either protein or nucleotide sequences in a
convenient, efficient and precise manner. Multiple options are provided in IBS, and biological sequences can
be manipulated, recolored or rescaled in a user-defined mode. Also, the final representational artwork can
be directly exported into a publication-quality figure.
The standalone package of IBS was implemented in JAVA, while the online service was implemented in HTML5 and
RPFdb: a database for genome wide information of translated mRNA generated from ribosome profiling.
Nucleic Acids Research. 2016;44:D254-D258.
Translational control is crucial in the regulation of gene expression and deregulation of translation is
associated with a wide range of cancers and human diseases. Ribosome profiling is a technique that provides
genome wide information of mRNA in translation based on deep sequencing of ribosome protected mRNA fragments
(RPF). RPFdb is a comprehensive resource for hosting, analyzing and visualizing RPF data, available at http://www.rpfdb.org
. The current version
of database contains 777 samples from 82 studies in 8 species, processed and reanalyzed by a unified
pipeline. Overall our database provides a simple way to search, analyze, compare, visualize and download RPF
GPS-Lipid: a robust tool for the prediction of multiple lipid modification sites
Scientific Reports. 2016;6:28249.
As one of the most common post-translational modifications in eukaryotic cells, lipid modification is an
important mechanism for the regulation of variety aspects of protein function. In this work, we developed a
tool called GPS-Lipid for the prediction of four classes of lipid modifications by integrating the Particle
Swarm Optimization with an aging leader and challengers (ALC-PSO) algorithm. GPS-Lipid was proven to be
evidently superior to other similar tools. To facilitate the research of lipid modification, we hosted a
publicly available web server at http://lipid.biocuckoo.org
with not only the implementation of GPSLipid, but also an integrative database and visualization tool. We
performed a systematic analysis of the co-regulatory mechanism between different lipid modifications with
GPS-Lipid. The results demonstrated that the proximal dual-lipid modifications among palmitoylation,
myristoylation and prenylation are key mechanism for regulating various protein functions. In conclusion,
GPS-lipid is expected to serve as useful resource for the research on lipid modifications, especially on
VirusMap: A visualization database for the influenza A virus
Journal of Genetics and Genomics. 2017;44(4):281-284.
In this study, we reported a visualization platform called VirusMap, which is available at the website (http://virusmap.renlab.org
investigating the epidemiological and geographical distribution of influenza A viruses. We downloaded
615,866 protein and 482,663 nucleotide sequences of influenza A viruses in FASTA format from IVR(Bao et al.,
2008) andIRD(Squires et al., 2012). As the policy for the data submission in those databases, the
information of subtype, host, sampling location, sampling time and serotype should be included for each
virus strain. Thus, the title line of each FASTA sequence contains all of the necessary information. We
extracted these information through a semi-automated series of steps. To ensure the data quality, only
entries with the full information of host, serotype and sampling information were preserved. In total, there
were 583,052 protein and 448,495nucleotide records retained in a MySQL database. As the data were obtained
from the two most popular influenza virus resources, VirusMap contains a comprehensive and frequently
updated dataset on the influenza A virus.
Firmiana: towards a one-stop proteomic cloud platform for data processing and analysis
Nature Biotechnology. 2017;35:409–412.
Improvements in next-generation proteomics, including instrumentation, sample preparation, and computational
analysis, have generated large amounts of data that cover protein profiling, post-translational
modifications, and protein–protein interactions. The first draft of the human proteome, for example, made
use of 2,000 (ref. 6) and 16,000 (ref. 5) raw files. Proteomics now calls for a uniform online pipeline that
can host millions of data sets with the same quality standards, analyze hundreds to thousands of
experiments, and integrate multi-dimensional omics data for knowledge mining and hypothesis generation to
disseminate proteomics to the scientific community. Here, we describe Firmiana (V1.0) (http://www.firmiana.org/
), a one-stop proteomic data processing and
integrated omics analysis cloud platform that allows scientists to deposit mass spectrometry (MS) raw files,
perform proteome identification and quantification online, carry out bioinformatics analyses, extract
knowledge, and visualize results using a biologist-friendly web interface without the need for programming
A de novo substructure generation algorithm for identifying the privileged chemical fragments of liver X
Scientific Reports. 2017;7:11121.
Liver X receptorβ (LXRβ) is a promising therapeutic target for lipid disorders, atherosclerosis, chronic
inflammation, autoimmunity, cancer and neurodegenerative diseases. Druggable LXRβ agonists have been
explored over the past decades. However, the pocket of LXRβ ligand-binding domain (LBD) is too large to
predict LXRβ agonists with novel scaffolds based on either receptor or agonist structures. In this paper, we
report a de novo algorithm which drives privileged LXRβ agonist fragments by starting with individual
chemical bonds (de novo) from every molecule in a LXRβ agonist library, growing the bonds into substructures
based on the agonist structures with isomorphic and homomorphic restrictions, and electing the privileged
fragments from the substructures with a popularity threshold and background chemical and biological
knowledge. Using these privileged fragments as queries, we were able to figure out the rules to reconstruct
LXRβ agonist molecules from the fragments. The privileged fragments were validated by building regularized
logistic regression (RLR) and supporting vector machine (SVM) models as descriptors to predict a LXRβ
m6AVar: a database of functional variants involved in m6A modification.
Nucleic Acids Research. 2018; 46(D1): D139-145.
Here, we report m6AVar (http://m6avar.renlab.org
), a comprehensive
database of m6A-associated variants that potentially influence m6A modification, which will help to
interpret variants by m6A function. The m6A-associated variants were derived from three different m6A
sources including miCLIP/PA-m6A-seq experiments (high confidence), MeRIP-Seq experiments (medium confidence)
and transcriptome-wide predictions (low confidence). Currently, m6AVar contains 16,132 high, 71,321 medium
and 326,915 low confidence level m6A-associated variants. We also integrated the RBP-binding regions,
miRNA-targets and splicing sites associated with variants to help users investigate the effect of
m6A-associated variants on post-transcriptional regulation. Because it integrates the data from genome-wide
association studies (GWAS) and ClinVar, m6AVar is also a useful resource for investigating the relationship
between the m6A-assocaited variants and disease. Overall, m6AVar will serve as a useful resource for
annotating variants and identifying disease-causing variants.
Expression and regulation of long noncoding RNAs during the osteogenic differentiation of periodontal ligament
stem cells in the inflammatory microenvironment
Scientific Reports. 2017;7:13991.
Although long noncoding RNAs (lncRNAs) have been emerging as critical regulators in various tissues and
biological processes, little is known about their expression and regulation during the osteogenic
differentiation of periodontal ligament stem cells (PDLSCs) in inflammatory microenvironment. In this study,
we have identified 63 lncRNAs that are not annotated in previous database. These novel lncRNAs were not
randomly located in the genome but preferentially located near protein-coding genes related to particular
functions and diseases, such as stem cell maintenance and differentiation, development disorders and
inflammatory diseases. Moreover, we have identified 650 differentially expressed lncRNAs among different
subsets of PDLSCs. Pathway enrichment analysis for neighboring protein-coding genes of these differentially
expressed lncRNAs revealed stem cell differentiation related functions. Many of these differentially
expressed lncRNAs function as competing endogenous RNAs that regulate protein-coding transcripts through
competing shared miRNAs.
Background: Large-scale genome sequencing projects have identified many genetic variants for diverse
diseases. A major goal of these projects is to characterize these genetic variants to provide insight into
their function and roles in diseases. N6-methyladenosine (m6A) is one of the most abundant RNA modifications
in eukaryotes. Recent studies have revealed that aberrant m6A modifications are involved in many
Findings: In this study, we present a user-friendly web server called “m6ASNP” that is dedicated to
the identification of genetic variants targeting m6A modification sites. A random forest model was
implemented in m6ASNP to predict whether the methylation status of a m6A site is altered by the variants
surrounding the site. In m6ASNP, genetic variants in a standard VCF format are accepted as the input data,
and the output includes an interactive table containing the genetic variants annotated by m6A function. In
addition, statistical diagrams and a genome browser are provided to visualize the characteristics and
annotate the genetic variants.
Conclusions: We believe that m6ASNP is a highly convenient tool that can be used to boost further
functional studies investigating genetic variants. The web server “m6ASNP” is implemented in JAVA and PHP
and is freely available at http://m6asnp.renlab.org.
Pan-Cancer Analysis Reveals the Functional Importance of Protein Lysine Modification in Cancer
Front. Genet. 9:254. doi: 10.3389/fgene.2018.00254
Large-scale tumor genome sequencing projects have revealed a complex landscape of genomic
mutations in multiple cancer types. A major goal of these projects is to characterize somatic
mutations and discover cancer drivers, thereby providing important clues to uncover diagnostic
or therapeutic targets for clinical treatment. However, distinguishing only a few somatic
mutations from the majority of passenger mutations is still a major challenge facing the
biological community. Fortunately, combining other functional features with mutations to
predict cancer driver genes is an effective approach to solve the above problem. Protein
lysine modifications are an important functional feature that regulates the development of
cancer. Therefore, in this work, we have systematically analyzed somatic mutations on seven
protein lysine modifications and identified several important drivers that are responsible for
tumorigenesis. From published literature, we first collected more than 100,000 lysine
modification sites for analysis. Another 1 million non-synonymous single nucleotide variants
(SNVs) were then downloaded from TCGA and mapped to our collected lysine modification sites.
To identify driver proteins that significantly altered lysine modifications, we further
developed a hierarchical Bayesian model and applied the Markov Chain Monte Carlo (MCMC) method
for testing. Strikingly, the coding sequences of 473 proteins were found to carry a higher
mutation rate in lysine modification sites compared to other background regions.
Hypergeometric tests also revealed that these gene products were enriched in known cancer
drivers. Functional analysis suggested that mutations within the lysine modification regions
possessed higher evolutionary conservation and deleteriousness. Furthermore, pathway
enrichment showed that mutations on lysine modification sites mainly affected cancer related
processes, such as cell cycle and RNA transport. Moreover, clinical studies also suggested
that the driver proteins were significantly associated with patient survival, implying an
opportunity to use lysine modifications as molecular markers in cancer diagnosis or treatment.
By searching within protein-protein interaction networks using a random walk with restart
(RWR) algorithm, we further identified a series of potential treatment agents and therapeutic
targets for cancer related to lysine modifications. Collectively, this study reveals the
functional importance of lysine modifications in cancer development and may benefit the
discovery of novel mechanisms for cancer treatment.
m6A RNA modification controls autophagy through upregulating ULK1 protein abundance
Cell Research. 2018;
N6-methyladenosine (m6A) is the prominent dynamic mRNA modification, governed
by methyltransferase complex (“writers”), demethylases (“erasers”) and RNA-binding
proteins (‘readers’).1 m6A modification directs mRNAs to distinct fates by grouping them for
differential processing, translation and decay in the processes such as cell differentiation,
embryonic development and stress responses. Owing to a deeper understanding of this
modification and the technological advance, functional characterizations of m6A in gene
regulation have become a hot topic that warrants further dissection.
DeepNitro: Prediction of Protein Nitration and Nitrosylation Sites by Deep Learning
Genomics Proteomics Bioinformatics. 2018; 16(4): 294-306.
Protein nitration and nitrosylation are essential post-translational modifications (PTMs) involved in many
fundamental cellular processes. Recent studies have revealed that excessive levels of nitration and nitrosylation
in some critical proteins are linked to numerous chronic diseases. Therefore, the identification of substrates
that undergo such modifications in a site-specific manner is an important research topic in the community and will
provide candidates for targeted therapy. In this study, we aimed to develop a computational tool for predicting
nitration and nitrosylation sites in proteins. We first constructed four types of encoding features, including
positional amino acid distributions, sequence contextual dependencies, physicochemical properties, and
position-specific scoring features, to represent the modified residues. Based on these encoding features, we
established a predictor called DeepNitro using deep learning methods for predicting protein nitration and
nitrosylation. Using n-fold cross-validation, our evaluation shows great AUC values for DeepNitro, 0.65 for
tyrosine nitration, 0.80 for tryptophan nitration, and 0.70 for cysteine nitrosylation, respectively,
demonstrating the robustness and reliability of our tool. Also, when tested in the independent dataset, DeepNitro
is substantially superior to other similar tools with a 7%−42% improvement in the prediction performance. Taken
together, the application of deep learning method and novel encoding schemes, especially the position-specific
scoring feature, greatly improves the accuracy of nitration and nitrosylation site prediction and may facilitate
the prediction of other PTM sites. DeepNitro is implemented in JAVA and PHP and is freely available for academic
research at http://deepnitro.renlab.org
lnCAR: a comprehensive resource for lncRNAs from Cancer Arrays.
Cancer Res February 20 2019 DOI: 10.1158/0008-5472.CAN-18-2169
Long non-coding RNAs (lncRNA) have emerged as promising biomarkers in cancer diagnosis, treatment, and prognosis.
Recent studies suggest that a large number of coding gene expression microarray probes could be re-annotated as
lncRNAs. Microarray, once the most cutting-edge high throughput gene expression technology, has been used for
thousands of cancer studies and has brought invaluable resources for studying the functions of lncRNA in cancer
development. However, a comprehensive lncRNA resource based on microarray data is still lacking. Here we present
lnCAR, a comprehensive open resource for providing expression profiles and prognostic landscape of lncRNAs derived
from re-annotation of public microarray data. Currently, lnCAR contains 52,300 samples for differential expression
analysis and 12,883 samples for survival analysis from 10 cancer types. lnCAR allows users to interactively
explore any annotated or novel lncRNAs. We believe lnCAR will serve as a valuable resource for the community
focused on lncRNA research in cancer.
DeepPhagy: a deep learning framework for quantitatively measuring autophagy activity in Saccharomyces
Autophagy. Jun 12 2019 DOI: 10.1080/15548627.2019.1632622
Seeing is believing. The direct observation of GFP-Atg8 vacuolar delivery under confocal microscopy is one of the
most useful end-point measurements for monitoring yeast macroautophagy/autophagy. However, manually labelling
individual cells from large-scale sets of images is time-consuming and labor-intensive, which has greatly hampered
its extensive use in functional screens. Herein, we conducted a time-course analysis of nitrogen
starvation-induced autophagy in wild-type and knockout mutants of 35 AuTophaGy-related (ATG) genes in
Saccharomyces cerevisiae and obtained 1,944 confocal images containing > 200,000 cells. We manually labelled 8,078
autophagic and 18,493 non-autophagic cells as a benchmark dataset and developed a new deep learning tool for
autophagy (DeepPhagy), which exhibited superior accuracy in recognizing autophagic cells compared to other
existing methods, with an area under the curve (AUC) value of 0.9710 from 10-fold cross-validations. We further
used DeepPhagy to automatically analyze all the images and quantitatively classified the autophagic phenotypes of
the 35 atg knockout mutants into 3 classes. The high consistency in our computational and biochemical results
indicated the reliability of DeepPhagy for measuring autophagic activity. Moreover, we used DeepPhagy to analyze 3
additional types of autophagic phenotypes, including the targeting of Atg1-GFP to the vacuole, the vacuolar
delivery of GFP-Atg19, and the disintegration of autophagic bodies indicated by GFP-Atg8, all with satisfying
accuracies. Taken together, our study not only enables the GFP-Atg8 ﬂuorescence assay to become a quantitative
measurement for analyzing autophagic phenotypes in S. cerevisiae but also demonstrates that deep learning-based
methods could potentially be applied to different types of autophagy.
BBCancer: an expression atlas of blood-based biomarkers in the early diagnosis of cancers
Nucleic Acids Research. October 29 2019 DOI: 10.1093/nar/gkz942
The early detection of cancer holds the key to combat and control the increasing global burden of cancer morbidity and mortality. Blood-based screenings using circulating DNAs (ctDNAs), circulating RNA (ctRNAs), circulating tumor cells (CTCs) and extracellular vesicles (EVs) have shown promising prospects in the early detection of cancer. Recent high-throughput gene expression profiling of blood samples from cancer patients has provided a valuable resource for developing new biomarkers for the early detection of cancer. However, a well-organized online repository for these blood-based high-throughput gene expression data is still not available. Here, we present BBCancer (http://bbcancer.renlab.org/), a web-accessible and comprehensive open resource for providing the expression landscape of six types of RNAs, including messenger RNAs (mRNAs), long noncoding RNAs (lncRNAs), microRNAs (miRNAs), circular RNAs (circRNAs), tRNA-derived fragments (tRFRNAs) and Piwi-interacting RNAs (piRNAs) in blood samples, including plasma, CTCs and EVs, from cancer patients with various cancer types. Currently, BBCancer contains expression data of the six RNA types from 5040 normal and tumor blood samples across 15 cancer types. We believe this database will serve as a powerful platform for developing blood biomarkers.
Our group is engaged in the study of post-translational modifications(PTMs) using computational approaches.
been developing a high-effective algorithm named GPS (Group-based Prediction System) for the
prediction of PTMs sites.
Based on the GPS algorithm，over ten types of PTM predictors have been
released. We also built a series
databases for protein phosphorylation, lipid and lysine modifications.
Recently, we are combining
the computational methods with the technology of BiFC(Bimolecular
Complementation) to develop a systematic approach for studying
the SUMO regulation in
Gene Editting with CRISPR
Our group also focus on developing computational tools for assisting the design of CRISPR system.
Currently, we have
developed a high efficient binary alignment scheme to screen out potential on-target
and off-target sites from
the whole genome. Using machine learning methods, such as Random Forest, we
predicted the cleavage
efficacies of the potential target sites, and recommended an optimal gRNA design
for the users
based on our predictions. A subsequent experimental validation will be also
in the near further.
RNA N6-methyladenosine Modification
RNA N6-methyladenosine (m6A) modification has a critical role in the regulation of many fundamental
However, the role of m6A in cancer is poorly understood. We have developed a
computational tool, which is called
“m6A Finder”, for predicting m6A modification sites at
single-nucleotide resolution. We then systematically
investigate the m6A-associated somatic mutations
in cancers using TCGA data. We are also
developing algorithms to analyze m6A-Seq data, such as peak
and differential methylation analysis.