Gene ontology (GO):
is a major bioinformatics initiative to unify the representation of gene and gene product attributes across all species.1 More specifically, the project aims to: 1) maintain and develop its controlled vocabulary of gene and gene product attributes; 2) annotate genes and gene products, and assimilate and disseminate annotation data; and 3) provide tools for easy access to all aspects of the data provided by the project, and to enable functional interpretation of experimental data using the GO, for example via enrichment analysis.
GO is part of a larger classification effort, the Open Biomedical Ontologies (OBO).
Although gene nomenclature itself aims to maintain and develop controlled vocabulary of gene and gene products, the Gene Ontology extends the effort by using markup language to make the data (not only of the genes and their products but also of all their attributes) machine readable, and to do so in a way that is unified across all species (whereas gene nomenclature conventions vary by biologic taxon).
Terms and ontology:
From a practical view, an ontology is a representation of something we know about. “Ontologies” consist of a representation of things that are detectable or directly observable, and the relationships between those things. There is no universal standard terminology in biology and related domains, and term usages may be specific to a species, research area or even a particular research group. This makes communication and sharing of data more difficult. The Gene Ontology project provides an ontology of defined terms representing gene product properties. The ontology covers three domains:
1- cellular component: The cellular component ontology describes locations, at the levels of subcellular structures and macromolecular complexes. Examples of cellular components include ‘nuclear inner membrane’, with the synonym ‘inner envelope’, and the ‘ubiquitin ligase complex’, with several subtypes of these complexes represented.
Generally, a gene product is located in or is a subcomponent of a particular cellular component. The cellular component ontology includes multi-subunit enzymes and other protein complexes, but not individual proteins or nucleic acids. Cellular component also does not include multicellular anatomical terms.
2- molecular function: the elemental activities of a gene product at the molecular level, such as binding or catalysis.
3- biological process: A biological process is a recognized series of events or molecular functions. A process is a collection of molecular events with a defined beginning and end. Mutant phenotypes often reflect disruptions in biological processes.
Beginning and end:
Every process should have a discrete beginning and end, and these should be clearly stated in the process term definition.
Collections of processes.
The biological process:
ontology includes terms that represent collections of processes as well as terms that represent a specific, entire process. Generally, the former will have mainly is_a children, and the latter will have part_of children that represent subprocesses.
To determine whether a process term should be an is a or part of child of its parent, ask: is an instance of the child process an instance of the entire parent process? That is, does the whole process, from start to finish, take place? If yes, the child is is a; but if the process is only a portion of the parent process, the child is part of
is the practice of capturing data about a gene product, and GO annotations use terms from the GO ontology to do so. The members of the GO Consortium submit their annotation for integration and dissemination on the GO website, where they can be downloaded directly or viewed online using AmiGO. In addition to the gene product identifier and the relevant GO term, GO annotations have the following data: The reference used to make the annotation . An evidence code denoting the type of evidence upon which the annotation is based; The date and the creator of the annotation.
The evidence code comes from a controlled vocabulary of codes covering both manual and automated annotation methods. For example, Traceable Author Statement (TAS) means a curator has read a published scientific paper and the metadata for that annotation bears a citation to that paper; Inferred from Sequence Similarity (ISS) means a human curator has reviewed the output from a sequence similarity search and verified that it is biologically meaningful. Annotations from automated processes (for example, remapping annotations created using another annotation vocabulary) are given the code Inferred from Electronic Annotation (IEA). As of April 1, 2010, over 98% of all GO annotations were inferred computationally, not by curators. As these annotations are not checked by a human, the GO Consortium considers them to be less reliable and includes only a subset in the data available online in AmiGO.
Recently, many machine learning algorithms have been designed and implemented to predict Gene Ontology annotations.
Gene product: Actin, alpha cardiac muscle 1, UniProtKB:P68032
GO term: heart contraction ; GO:0060047 (biological process)
Evidence code: Inferred from Mutant Phenotype (IMP)
Reference: PMID 17611253
Assigned by: UniProtKB, June 6, 2008
There are a large number of tools available both online and to download that use the data provided by the GO project. The vast majority of these come from third parties; the GO Consortium develops and supports two tools, AmiGO and OBO-Edit.
AmiGO: is a web-based application that allows users to query, browse and visualize ontologies and gene product annotation data. In addition, it also has a BLAST tool tools allowing analysis of larger data sets and an interface to query the GO database directly. AmiGO can be used online at the GO website to access the data provided by the GO Consortium, or can be downloaded and installed for local use on any database employing the GO database schema . It is free open source software and is available as part of the go-dev software distribution.
OBO-Edit: is an open source, platform-independent ontology editor developed and maintained by the Gene Ontology Consortium. It is implemented in Java, and uses a graph-oriented approach to display and edit ontologies. OBO-Edit includes a comprehensive search and filter interface, with the option to render subsets of terms to make them visually distinct; the user interface can also be customized according to user preferences. OBO-Edit also has a reasoner that can infer links that have not been explicitly stated, based on existing relationships and their properties. Although it was developed for biomedical ontologies, OBO-Edit can be used to view, search and edit any ontology. It is freely available to download.
Computaional Tools for Genome Annotation
Meta-servers, web-servers and mirroring of web-servers and databases
Name Can be used for AlgorthimReference
GeneMarkArchaea, Metagenomes ,Eukaryotes,Viruses, Phages, Plasmids, EST and cDNAhidden Markov model Besemer J. and Borodovsky M. Nucleic Acids Research, 2005, Vol. 33, Web Server Issue, pp. W451-454
GeneHackerMicrobial genomes Markov model Yada.T , Hirosawa.M DNA Res., 3, 335-361 (1996). Syst. Mol. Biol. pp.252-260 (1996). Syst. Mol. Biol. pp.354-357 (1997).
GeneWalkerHuman Hidden Markov model HMMgene (v. 1.1)vertebrate and C. elegansHidden Markov model A. Krogh: In Proc. of Fifth Int. Conf. on Intelligent Systems for Molecular Biology, ed. Gaasterland, T. et al., Menlo Park, CA: AAAI Press, 1997, pp. 179-186.
Chemgenome2.0Prokaryotes Ab-initio Method Poonam Singhal, B. Jayaram, Surjit B. Dixit and David L. Beveridge. Prokaryotic Gene Finding based on Physicochemical Characteristics of Codons Calculated from Molecular Dynamics Simulations.Biophysical Journal,2008,Volume:94 Issue:11, 4173-4183
Algorthim Can be used for Name
GenomeThreaderPlants Similarity-based gene prediction program where additional cDNA/EST and/or protein sequences are used to predict gene structures via spliced alignments
JIGSAW(formerly “Combiner”)Eukaryotes multiple sources of evidence (output from gene finders, splice site prediction programs and sequence alignments to predict gene models)
GenZillaEukaryotes GeneZilla is based on the Generalized Hidden Markov Model (GHMM). It evolved out of the ab initio eukaryotic gene finder TIGRscan, which was developed at The Institute for Genomic Research.
AUGUSTUSEukaryotic genomic sequences It allows to use protein homology information and travel in the prediction.
EuGeneEukaryotes EuGÃ¨ne exploit probabilistic models like Markov models for discriminating coding from non coding sequences or to discriminate effective splice sites from false splice sites (using various mathematical model
Name DescribtionGeneCardsA database of human genes, their products and their involvement in diseases.
It offers concise information about the functions of all human genes that have an
approved symbol as well as selected others. It is especially useful for those
who are searching for information working in functional genomics and proteomics.
The data is collected with Knowledge Discovery and Data Mining’s techniques and
accessed by means of proprietary Guidance System that makes more or less intelligent
suggestions to the user of where and how the information may be retrieved.
TRANSFACTRANSFAC is a transcription factor database. It compiles data
about gene regulatory DNA sequences and protein factors binding to them.
On this basis, programs are developed that help to identify putative promoter o
r enhancer structures and to suggest their features.
The EpoDB (Erythropoiesis Database)A database of genes that relate to vertebrate red blood cells. A detailed description
of EpoDB can be found on Chapter 5.
The database includes DNA sequence, structural features and potential
transcription factor binding sites.
PlantProm DBA Database of plant promoter
RegulonDBRegulonDB provides curated information on gene organization and regulation
in E. coli. Current information is provided on the gene, operon and regulon level.
Future expansion will include information on regulation beyond transcription initiation.