Propsearch Information

Propsearch Information

Brief introduction to PROPSEARCH

Common protein sequence alignment programs are at present not capable to detect functional and / or structural homologs, if the sequence identity is below the significance threshold of about 25%. PROPSEARCH was designed to find the putative protein family if querying a new sequence has failed using alignment methods. By neglecting the order of amino acid residues in a sequence, PROPSEARCH uses the amino acid composition instead. In addition, other properties like molecular weight, content of bulky residues, content of small residues, average hydrophobicity, average charge a.s.o. and the content of selected dipeptide-groups are calculated from the sequence as well. 144 such properties are weighted individually and are used as query vector. The weights have been trained on a set of protein families with known structures, using a genetic algorithm. Sequences in the database are transformed into vectors as well, and the euclidian distance between the query and database sequences is calculated. Distances are rank ordered, and sequences with lowest distance are reported on top.

For questions and suggestions contact Uwe Hobohm via email (Uwe.Hobohm@tg.fh-giessen.de). We appreciate feedback about correct protein identification.

Prof.Dr. Heinz-Uwe Hobohm
University of Applied Sciences
KMUB-Bioinformatics
Wiesenstrasse 14
D-35390 Giessen

Fone/Fax 0641 3092549

TABLE 1: RELIABILITY OF FAMILY IDENTIFICATION AS FUNCTION OF PROPSEARCH-DISTANCE

DIST		RELIABILITY [percent]
between	and
0.0	1.3	99.9
1.3	7.5	99.6
7.5	8.7	94
8.7	10.0	87
10.0	11.2	80
11.2	12.5	68
12.5	13.7	53
13.7	14.9	41
14.9	16.2	36
16.2	17.5	32
17.5	18.7	25
18.7	19.9	19

CORRELATION BETWEEN PROPSEARCH DISTANCE AND PROBABILITY OF IDENTIFICATION

This reliability estimation has been derived from a set of about 1300 sequences belonging to 58 protein families representative for the Brookhaven Protein Data Bank PDB.

METHOD:

Each family was assumed to have a different structure based on the observation, that no sequence identifier can be found in more than one multiple sequence alignment (PDB sequence used as protein family representative for the multiple sequence alignment). Using one sequence of the set, we performed a PROPSEARCH query on the set and determined rank and PROPSEARCH distance of the first false positive, i.e. the first sequence not belonging to the family of the query sequence. Table 1 was derived doing this for all 1300 sequences.

EXAMPLE:

A distance between 8.7 and 10.0 translates into a >87% chance that query sequence and hit are in the same family (implying in the majority of cases a similar function). If one has got a hit with high probability one should go further and check whether functional residues (Cys-bridges, active site residues a.s.o.) - provided they are known - are conserved between query and hit.

PERFORMANCE COMPARED WITH CONVENTIONAL SEQUENCE ALIGNMENT TOOLS:

We have the experience that on the one hand PropSearch "looses" some known relatives by calculating an insignificant PropSearch distance, but on the other hand PropSearch tends to find about 10-20% more "remote homologs" with a high reliability score, i.e. query and hit have the same fold (and most often similar function) but insignificant alignment homology (unpublished).

VERIFICATION OF RESULTS

PropSearch can report false-positives. If you do not want to rely on the reliability estimation given by the PropSearch distance, you should verify hits by other means, like:
1. Take the hit reported by ProSearch, do a database search using Blast,Fasta, SW a.s.o., collect clear homologs, do a multiple sequence alignment on those (using e.g. ClustalW), and check whether conserved residues can be found in the query sequence as well.
2. Compare secondary structure predictions between query and hit.

OUTPUT:
Rank: Position of protein found after sorting on DIST. In a fragment search, only one fragment per protein is shown: the top scoring fragment. I.e, not all hits are printed. The top scoring protein is the top candidate for identification of the protein. However, if additional biochemical knowledge about the protein under investigation is at hand, e.g. if it is known that the protein is, for instance, of cytoscelettal origin or membrane bound or an enzyme a.s.o., this knowledge may be used to inspect proteins ranking higher than 1 in the PROPSEARCH output. If the top scoring protein shows a high distance with low reliability, than there is still a good chance to find the correct protein among the first 20 or so hits [for details see publication].
ID:
Protein identifier from the SwissProt or PirOnly protein-database (PirOnly: sequences not included in SwissProt. SwissProt sequence identifiers contain an underscore '_', Pir-sequences are indicated by '>').
DIST:
PROPSEARCH distance between query composition and database protein composition. Amino acid specific weights are considered.
LEN2:
Length of matching protein sequence
POS1,POS2:
Position of fragment in sequence (relevant in fragment search only).
pI:
Calculated isoelectric point of fragment.
DE:
SwissProt DE-line or Pir explanation line.