|
POSTER NO: 67 Celera Functional Annotation Pipeline
1Mark Yandell, 1Jian Wang, 1Bixiong Chris Shue, 1William H. Majoros, 1Douglas B. Rusch, 2Paul D. Thomas, 1Jennifer R. Wortman, 1Peter W. Li, 1Richard J. Mural The availability of the nearly complete sets of human and mouse proteins presents both challenges and opportunities for high throughput prediction of the protein functions. A series of computational analyses have been applied to the human and mouse proteins to generate a baseline functional annotation. These data are imported into a relational protein database to allow sophisticated biological queries and analyses. Subscribers to the Celera human and mouse genome data can access the functional annotation along with sequence data through the Celera discovery System (CDS), a web-accessible research workbench for comprehensive information retrieval, analysis and data mining. The functional annotation pipeline includes: 1) GeneNamer - which assigns gene symbols, names and aliases based on HUGO, MGD and other public resources. 2) LAFS - which generates blast alignments against in-house and public datasets. LAFS also computes the physical properties of proteins such as molecular weight and isoelectric points. LAFS assigns proteins to GO biological process, molecular function and cellular component (Gene Ontology Consortium). 3) Celera's Proprietary Panther System classifies proteins based on a library of over 40,000 Hidden Markov Models (HMMs) and applies controlled annotation to each protein family. 4) InterPro Scan assigns motifs based on PFam, PRINTS, ProDom, ProSite, and their corresponding InterPro. 5) Lek clustering is a sequence similarity based protein family classification approach that organizes the protein space into families and subfamilies. 6) Human mouse orthologous predictor computes the putative human and mouse ortholog pairs based on sequence similarity and gene syntenic locations. For the functional annotation to be effective, the gene structures must be accurately designated. Celera human genes have been exhaustively verified and annotated by expert curators to ensure the input data to the pipeline is of highest quality. The annotated proteins and transcripts put the sequence data in biological context by bringing together diverse in-house and public resources to make discovery of new functions easier, especially for the biologists having limited access to a powerful local bioinformatics environment. |