Brief introduction to PROPSEARCH
Common protein sequence alignment programs are at present not capable to detect functional and / or
structural homologs, if the sequence identity is below the significance threshold of about 25%. PROPSEARCH was designed to find the putative protein family if querying a new sequence has failed using alignment methods.
By neglecting the order of amino acid residues in a sequence, PROPSEARCH uses the amino acid composition instead.
In addition, other properties like molecular weight, content of bulky residues, content of small residues, average hydrophobicity, average charge a.s.o. and the content of selected dipeptide-groups are calculated from the sequence as well. 144 such properties are weighted individually and are used as query vector. The weights have been trained on a set of protein families with known structures, using a genetic algorithm. Sequences in the database are transformed into vectors as well, and the euclidian distance between the query and database sequences is calculated. Distances are rank ordered, and sequences with lowest distance are reported on top.
For questions and suggestions contact Uwe Hobohm via email (Uwe.Hobohm@tg.fh-giessen.de).
We appreciate feedback about correct protein identification.
Prof.Dr. Heinz-Uwe Hobohm
University of Applied Sciences
KMUB-Bioinformatics
Wiesenstrasse 14
D-35390 Giessen
Fone/Fax 0641 3092549
TABLE 1: RELIABILITY OF FAMILY IDENTIFICATION AS FUNCTION OF PROPSEARCH-DISTANCE
DIST | | | | | | RELIABILITY [percent] |
between | and | |
0.0 | 1.3 | | | | | 99.9 |
1.3 | 7.5 | | | | | 99.6 |
7.5 | 8.7 | | | | | 94 |
8.7 | 10.0 | | | | | 87 |
10.0 | 11.2 | | | | | 80 |
11.2 | 12.5 | | | | | 68 |
12.5 | 13.7 | | | | | 53 |
13.7 | 14.9 | | | | | 41 |
14.9 | 16.2 | | | | | 36 |
16.2 | 17.5 | | | | | 32 |
17.5 | 18.7 | | | | | 25 |
18.7 | 19.9 | | | | | 19 |
CORRELATION BETWEEN PROPSEARCH DISTANCE AND PROBABILITY OF IDENTIFICATION
This reliability estimation has been derived from a set of about 1300
sequences belonging to 58 protein families representative for the Brookhaven
Protein Data Bank PDB.
METHOD:
Each family was assumed to have a different structure based on the
observation, that no sequence identifier can be found in more than one
multiple sequence alignment (PDB sequence used as protein family
representative for the multiple sequence alignment). Using one sequence of
the set, we performed a PROPSEARCH query on the set and determined rank and
PROPSEARCH distance of the first false positive, i.e. the first sequence not
belonging to the family of the query sequence. Table 1 was derived doing this
for all 1300 sequences.
EXAMPLE:
A distance between 8.7 and 10.0 translates into a >87% chance that
query sequence and hit are in the same family (implying in the majority of
cases a similar function).
If one has got a hit with high probability one should go further and check
whether functional residues (Cys-bridges, active site residues a.s.o.) -
provided they are known - are conserved between query and hit.
PERFORMANCE COMPARED WITH CONVENTIONAL SEQUENCE ALIGNMENT TOOLS:
We have the experience that on the one hand PropSearch "looses" some known
relatives by calculating an insignificant PropSearch distance, but on the other
hand PropSearch tends to find about 10-20% more "remote homologs" with a high
reliability score, i.e. query and hit have the same fold (and most often
similar function) but insignificant alignment homology (unpublished).
VERIFICATION OF RESULTS
PropSearch can report false-positives. If you do not want to rely on the
reliability estimation given by the PropSearch distance, you should verify hits
by other means, like:
1. Take the hit reported by ProSearch, do a database search using Blast,Fasta,
SW a.s.o., collect clear homologs, do a multiple sequence alignment on those
(using e.g. ClustalW), and check whether conserved residues can be found in the
query sequence as well.
2. Compare secondary structure predictions between query and hit.
OUTPUT:
Rank: Position of protein found after sorting on DIST. In a fragment search,
only one fragment per protein is shown: the top scoring fragment. I.e,
not all hits are printed. The top scoring protein is the top
candidate for identification of the protein. However, if additional
biochemical knowledge about the protein under investigation is at
hand, e.g. if it is known that the protein is, for instance, of
cytoscelettal origin or membrane bound or an enzyme a.s.o., this
knowledge may be used to inspect proteins ranking higher than 1 in
the PROPSEARCH output. If the top scoring protein shows a high
distance with low reliability, than there is still a good chance to
find the correct protein among the first 20 or so hits [for details
see publication].
ID:
Protein identifier from the SwissProt or PirOnly protein-database
(PirOnly: sequences not included in SwissProt. SwissProt sequence
identifiers contain an underscore '_', Pir-sequences are indicated by
'>').
DIST:
PROPSEARCH distance between query composition and database protein
composition. Amino acid specific weights are considered.
LEN2:
Length of matching protein sequence
POS1,POS2:
Position of fragment in sequence (relevant in fragment search only).
pI:
Calculated isoelectric point of fragment.
DE:
SwissProt DE-line or Pir explanation line.