HGM2002 Poster Abstracts: 1. Genome Informatics and Annotation
POSTER NO: 676
Unexpected biases in BLASTP output
In the analysis of a series of BLAST results, we found that there is a significant 'bias' for alignments with similarity scores of 25.00%, 30.00%, 33.33%, 40.00%, 42.85%, 45.45% among others. When we used the latest BLASTP from NCBI to test different sets of protein sequences, the bias was observed in all sets. Furthermore, the same observation was obtained when we subjected a series of artificial randomly-generated sequences of varying lengths to BLASTP. Initially, we thought that these biases were due to the presence of short alignments. Thus, we filtered out the short alignments by imposing a more stringent E-value. Such treatment, however, did not completely remove the bias. When the same protein sequence sets were subjected to FASTA, we did not observe any bias. Furthermore, we took the BLASTP-generated alignments with 33.33% similarity score and E-values lower than 10-10 and subjected these to FASTA. Interestingly, the alignment scores assigned by FASTA ranged from 21-37%. After discussions with Dr. Tom Madden, group leader of BLAST developers, we found out the origin of the problem. The similarity score of BLASTP is calculated by the division of the number of identical amino acids against the length of the query sequence's aligned segment. Both the numerator and denominator are integers. Considering a series of alignments with a length ranging from 1 to 100, there can be 50 alignments that have 50% similarity (1 of 2, 2 of 4... 50 of 100), 33 alignments with 33.33%, and 2 alignments with 26.00% (13 of 50 and 26 of 100). These are inherent biases generated by the use of integers in calculating the similarity score between protein segments. Although FASTA can be used as an alternative, it is also not without limitations. Aside from being slower than BLAST, FASTA can miss some significant alignments due to the stringency of its algorithm. It appears, therefore, that these two sequence alignment tools are best used in tandem. FASTA will especially be useful when alignments having the above-mentioned scores generated by BLASTP are encountered. Special caution should therefore be exerted by authors who find alignments with similarity scores of 25.00%, 30.00%, 33.33%, 40.00%, etc when they interpret the data they analyse.
Other abstracts in same session