HGM2002 Poster Abstracts: 1. Genome Informatics and Annotation
POSTER NO: 46
Clustering human genes based on full-length protein homology and its applications in genome-wide data mining
Bixiong Chris Shue, Jian Wang, William Majoros, Mark Yandel, Richard Mural
A total of 28382 human genes, which consists of 32117 proteins from the recent Celera annotation, were cluster based on their full-length homology. The algorithm we developed is aimed at identifying paralogous genes by clustering at different stringency levels. The complete protein set is compared to itself using BlastP (N x N blast), and the blast reports are used as bases for clustering proteins into paralog groups. Two proteins are asserted as paralogs based on the extent of their shared similarity rather than being grouped if they only share similar domain(s). The sensitivity (aligned sequence length over the length of the query sequence) and specificity (aligned sequence length over the length of the subject sequence) values are calculated and proteins and their corresponding genes are clustered as paralogs only if they meet a preset sensitivity and specificity threshold. At a threshold level which requires both sensitivity and specificity to be greater than 0.8, more than 25% (8440 out of 32117) of the total proteins are put into 2262 clusters, each of which contains at least two genes. The resulting clusters at different stringency levels are further analyzed by examining the functional classification of the proteins in the clusters using the Panther classification scheme. One of the applications of these data is to investigate possible large-scale intra/inter chromosomal duplications in the scope of the whole human genome based on co-localization of paralogous gene pairs. At sensitivity and specificity thresholds of 0.8, more than 600 pairs of possible segmental duplication were identified. The largest pair spans 13.6Mb on Chr12 and 17.3Mb on Chr17 and contains 21 different paralogous gene pairs. The expression pattern of these co-localized gene pairs, as well as other paralogous genes that are involved in signal transduction pathways and/or contribute to disease status are also being studied.
Other abstracts in same session