Protein domain database: DOMO

DOMO: a database of aligned protein domains.

The database is accessible here.

Jérôme GRACY
jgracy@cbs.cnrs.fr
CBS, 29 rue de Navacelles, 34090 Montpellier

Patrick ARGOS
argos@embl-heidelberg.de
EMBL, Meyerhof Strasse 1, Postfach 10.2209, D-69012 Heidelberg, Germany.

A major issue related to protein classification is the decomposition of biomolecules into their constituent structural domains typically defined as independent and globular folding units within a three-dimensional protein structure (1). At the sequence level, accurately delineating the boundaries of homologous protein domains is a required condition for their subsequent multiple sequence alignment. Tertiary structural data which could guide a visual determination of such domain boundaries are missing for a vast majority of protein families. For this reason, though many motif (2), block (3), and full-sequence alignment (4) databases are available, currently only two databases of aligned domains (5,6) have been constructed from a fully automated process utilizing only sequence information.

Here, we describe DOMO, a new database of 8877 multiple sequence alignments including 99058 protein domains as well as repeating sequence regions extracted from 83054 non-redundant molecular primary structures contained in the SWISS-PROT (7) and PIR (8) databases. The domain boundaries and alignments have been inferred by a fully automated analysis process involving the detection and subsequent clustering of amino acid sequence similarities, followed by delineation of the domain boundaries, and then multiple sequence alignment of the related protein segments(9,10). The domain boundaries were not inferred from three-dimensional data but exclusively from the relative positions of homologous segment pairs within the same protein (repeats) or within homologous proteins at different distances from their respective N- or C-termini. The completeness and accuracy of the protein classifications, the correctness of the domain boundaries, and the quality of the multipe sequence alignments have been shown to be greatly improved in DOMO when compared to other databases (9,10).

Database format

Each entry of the database corresponds to one family of homologous domains. Its major fields provide information about the related proteins, their functional families, domain decomposition, multiple sequence alignment, conserved residues, and evolutionary classification tree.

The fields, identified from top to bottom of Figure 1 below, contain information from left to right according to the following format:

id: domain identifier (DM....), domain name, average number of amino acids (aa.) in the domain, number of aligned domain segments (dom.), and number of different proteins containing the aligned segments (seq.).
kw: keywords found most frequently in the sequence descriptions.
sq: protein accession number (access), sequence database (dtb) to which the protein belongs (SWP=SWISS-PROT), protein family codes (families) whose meanings are given by the fa fields, number of domain repeats in the protein (#do), short protein description.
fa: family code (f), family name (description), associated domain identifier (access), number of proteins (#sq) in the associated domain, number of proteins (#sq) in the associated PROSITE family, PROSITE family identifier (access).
do: protein accession number (access), N-terminal domain sequence position (pos), domain identifier (access), C-terminal domain (pos) sequence position (+1) which could be the N-terminus of the next domain, etc. while sequence regions with no detected homology are indicated by question marks.
tr: protein accession number (access), current domain identifier (dom) where all domains detected in the given protein sequence have been numbered sequentially, line whose length indicates the identity percentage of aligned amino acids between the closest sequences of the two protein clusters related by the corresponding tree node (the scale is represented above the clustering tree).
al: protein accession number (access), domain identifier (dom), sequence position of the first aligned amino acid (beg), aligned sequence fragment, sequence position of the last aligned amino acid.
co: consensus sequence summarizing the amino acids (lowercase letters if the conservation level is above 85% and below 100% and an uppercase if exactly conserved) or physico-chemical properties conserved within the corresponding alignment columns. Conserved properties are represented by the following non alpha-numeric symbols (according to a decreasing priority):

            + indicates at least 85% positively charged acids (K, R, or H).
            - indicates at least 85% negatively charged residues (E, D).
            * indicates at least 85% small amino acids (G, P, A, S, T, or N).
            @ indicates at least 85% aromatic amino acids (F, Y, W, or H).
            = indicates at least 85% hydrophobic amino acids (C, V, I, L, M, F, Y, W, G, A).
            # indicates at least 85% hydrophilic amino acids (E, D, N, Q, S, T, G, K, R, H).

In the particular example of Figure 1 below, the multiple alignment includes 16 apple domains consisting of four self repeats from four homologous serine proteases. It should be noted that the aligned fragments, each corresponding to one apple domain, were automatically delineated and spliced from the complete sequences before their multiple alignment.

Figure 1: Exemplary APPLE domain.

id DM00800 APPLE                                                        88 aa. 16 dom.(16)   4 seq.(4)
#
kw SERINE PROTEASE TRYPSIN HISTIDINE
#
# access dtb identifier families #do description
sq P14272 SWP KAL_RAT    ABCDEF--   4 PLASMA KALLIKREIN PRECURSOR.
sq P26262 SWP KAL_MOUSE ABCDEF--   4 PLASMA KALLIKREIN PRECURSOR.
sq P03952 SWP KAL_HUMAN ABCD--GH   4 PLASMA KALLIKREIN PRECURSOR.
sq P03951 SWP FA11_HUMAN ABCDEFGH   4 COAGULATION FACTOR XI PRECURSOR.
#
# f description                                  access #seq / #seq prosite
fa A APPLE                                        DM00800    4 /    4
fa B APPLE DOMAIN                                            4 /    4 PS00495
fa C SERINE PROTEASES, TRYPSIN FAMILY, HISTIDINE             4 / 203 PS00134
fa D SERINE PROTEASES, TRYPSIN FAMILY, SERINE                4 / 202 PS00135
fa E COAGULATION FACTOR XI                                   3 /    3
fa F TRYPSIN                                      DM00018    3 / 236
fa G APPLE DOMAIN                                            2 /    2 PS00495
fa H SERINE PROTEASES, TRYPSIN FAMILY, HISTIDINE             2 /   93 PS00134
#
# access : pos access   pos ...
do P14272 :   37 DM00800 126 DM00800 216 DM00800 307 DM00800 391 DM00018 626 ??????? 639
do P26262 :   37 DM00800 126 DM00800 216 DM00800 307 DM00800 391 DM00018 626 ??????? 639
do P03952 :   37 DM00800 126 DM00800 216 DM00800 307 DM00800 392 DM00018 626 ??????? 639
do P03951 :   36 DM00800 125 DM00800 215 DM00800 306 DM00800 389 DM00018 623 ??????? 626
#
#            100% 90% 80% 70% 60% 50% 40% 30% 20% 10%   0%
# access dom |____|____|____|____|____|____|____|____|____|____|
tr P14272   4 _________________________________
tr P26262   4 ______|     |             |    | |
tr P03952   4 ____________|             |    | |
tr P03951   4 __________________________|    | |
tr P03952   2 _______________________________| |
tr P14272   2 _________________|     |         |
tr P26262   2 ______|                |         |
tr P03951   2 _______________________|         |
tr P03951   1 _________________________________|
tr P26262   1 _____________________|         |
tr P14272   1 ____|        |                 |
tr P03952   1 _____________|                 |
tr P03951   3 _______________________________|
tr P03952   3 ______________________|
tr P26262   3 ___________|
tr P14272   3 _____|
#
# access dom beg                                                                                          end
al P14272   4 307 [LNATFVQGA DACQETCTKT IRCQFFTYSL LPQDCKAEGC K.CSLRLSTD GSPTRITYEA QGSSGYSLRL CKVVESSDCT 384
al P26262   4 307 [LNVTFVQGA DVCQETCTKT IRCQFFIYSL LPQDCKEEGC K.CSLRLSTD GSPTRITYGM QGSSGYSLRL CKLVDSPDCT 384
al P03952   4 307 [LNVTFVKGV NVCQETCTKM IRCQFFTYSL LPEDCKEEKC K.CFLRLSMD GSPTRIAYGT QGSSGYSLRL CNTGDNSVCT 384
al P03951   4 306 [LDIVAAKSH EACQKLCTNA VRCQFFTYTP AQASCNEGKG K.CYLKLSSN GSPTKILHGR GGISGYTLRL CKM..DNECT 381
al P03952   2 126 [FNVSKVSSV EECQKRCTNN IRCQFFSYAT QTFHKAEYRN N.CLLKYSPG GTPTAIKVLS NVESGFSLKP CALS.EIGCH 202
al P14272   2 126 [FNISKTDSI EECQKLCTNN IHCQFFTYAT KAFHRPEYRK S.CLLKRSSS GTPTSIKPVD NLVSGFSLKS CALS.EIGCP 202
al P26262   2 126 [FNISKTDNI EECQKLCTNN FHCQFFTYAT SAFYRPEYRK K.CLLKHSAS GTPTSIKSAD NLVSGFSLKS CALS.EIGCP 202
al P03951   2 125 [YNSSVAKSA QECQERCTDD VHCHFFTYAT RQFPSLEHRN I.CLLKHTQT GTPTRITKLD KVVSGFSLKS CALS.NLACI 201
al P03951   1   36 -[TTVFTPSA KYCQVVCTYH PRCLLFTFTA ESPSEDPTRW FTCVLKDSVT ET.LPRVNRT AAISGYSFKQ CSHQISA.CN 111
al P26262   1   37 -[AAIYTPDA QYCQKMCTFH PRCLLFSFLA VTPPKETNKR FGCFMKESIT GT.LPRIHRT GAISGHSLKQ CGHQISA.CH 112
al P14272   1   37 -[AAIYTPDA QHCQKMCTFH PRCLLFSFLA VSPTKETDKR FGCFMKESIT GT.LPRIHRT GAISGHSLKQ CGHQLSA.CH 112
al P03952   1   37 -[ASMYTPNA QYCQMRCTFH PRCLLFSFLP ASSINDMEKR FGCFLKDSVT GT.LPKVHRT GAVSGHSLKQ CGHQISA.CH 112
al P03951   3 215 [IDSVMAPDA FVCGRICTHH PGCLFFTFFS QEWPKESQRN L.CLLKTSES GLPSTRIKKS KALSGFSLQS CRHSIPVFCH 292
al P03952   3 216 [VARVLTPDA FVCRTICTYH PNCLFFTFYT NVWKIESQRN V.CLLKTSES GTPSSSTPQE NTISGYSLLT CKRTLPEPCH 293
al P26262   3 216 [VSQVITPDA FVCRTICTFH PNCLFFTFYT NEWETESQRN V.CFLKTSKS GRPSPPIPQE NAISGYSLLT CRKTRPEPCH 293
al P14272   3 216 [VSQVVTPDA FVCRTVCTFH PNCLFFTFYT NEWETESQRN V.CFLKTSKS GRPSPPIIQE NAVSGYSLFT CRKARPEPCH 293
co               1 =*===t**a #=Cq#=CT## =rC#fFty=* #*@#####r# =.C=lk#s#* gtpt*i=##* *==SG@slk* C*#*.*=.C#   80
#
al P14272   4 385 TKINAR]--- ---------- ---------- ---------- ---------- ---------- ---------- ---------- 390
al P26262   4 385 TKINAR]--- ---------- ---------- ---------- ---------- ---------- ---------- ---------- 390
al P03952   4 385 TKTSTRI]-- ---------- ---------- ---------- ---------- ---------- ---------- ---------- 391
al P03951   4 382 TKIKPRI]-- ---------- ---------- ---------- ---------- ---------- ---------- ---------- 388
al P03952   2 203 MNIFQHLAFS DVD]------ ---------- ---------- ---------- ---------- ---------- ---------- 215
al P14272   2 203 MDIFQHFAFA DLN]------ ---------- ---------- ---------- ---------- ---------- ---------- 215
al P26262   2 203 MDIFQHSAFA DLN]------ ---------- ---------- ---------- ---------- ---------- ---------- 215
al P03951   2 202 RDIFPNTVFA DSN]------ ---------- ---------- ---------- ---------- ---------- ---------- 214
al P03951   1 112 KDIYVDLDMK GIN]------ ---------- ---------- ---------- ---------- ---------- ---------- 124
al P26262   1 113 RDIYKGLDMR GSN]------ ---------- ---------- ---------- ---------- ---------- ---------- 125
al P14272   1 113 QDIYEGLDMR GSN]------ ---------- ---------- ---------- ---------- ---------- ---------- 125
al P03952   1 113 RDIYKGVDMR GVN]------ ---------- ---------- ---------- ---------- ---------- ---------- 125
al P03951   3 293 SSFYHDTDFL GEE]------ ---------- ---------- ---------- ---------- ---------- ---------- 305
al P03952   3 294 SKIYPGVDFG GEE]------ ---------- ---------- ---------- ---------- ---------- ---------- 306
al P26262   3 294 SKIYSGVDFE GEE]------ ---------- ---------- ---------- ---------- ---------- ---------- 306
al P14272   3 294 FKIYSGVAFE GEE]------ ---------- ---------- ---------- ---------- ---------- ---------- 306
co              81 ##i@##=.f. #.#                                                                            93

Accessing DOMO through the World Wide Web

DOMO can be accessed through the sequence retrieval system SRS (11) here.

It provides a form-based query manager able to retrieve familial domain alignments by their identifiers, any included sequence accession numbers, or keywords. Further, the query results can be linked to other sequence databases to collect complementary information on the relevant proteins or their families.

Moreover, the domains from DOMO have been compiled in FASTA format to permit a fast search for homologous domains to a query protein using BLAST2 (19) and a subsequent multiple sequence alignment using CLUSTALW (20).

The different SRS steps of this analysis are described below:

1. From the SRS "Select Libraries" page, select SWISSPROT or any other protein sequence database (but not DOMO) and continue.
2. From the "Query Form" page, formulate a sequence query and continue.
3. From the "Query Result" page, launch BLAST2.
4. From the "BLAST2 options" page, select DOMO as searched databank, edit the sequence in query box or leave it unchanged, and continue.
5. From the "BLAST2 Query Result" page, check the "selected" button, select some hits, and launch CLUSTALW.
6. From the "CLUSTALW options" page, check the "Align the selected hits with the query" button or leave it unchecked, and continue.
7. If the protein has multiple domains, go back to "BLAST2 Query Result" page and align the query with the remaining hits belonging to other families.

The sequence analysis environment (domain database, query manager, homology search and multiple sequence alignment tools) provides a facile aid to determine any domain arrangements, their evolutionary relationships with related sequences, and key structural and functional amino acids in a query protein sequence.

References

1. Rossmann, M. G. and Argos, P. (1981) Ann. Rev. Biochem. 50, 497-532
2. Bairoch, A., Bucher, P. and Hofmann, K. (1996) Nuc. Acid Res., 24, 189-196
3. Henikoff, J.G. and Henikoff, S. (1996) Methods Enz. mol., 266, 88-105
4. Gonnet, G., Cohen, M.A and Benner, S.A. (1992) Science, 256, 1443-1445
5. Sonnhammer, E.L. and Kahn, D. (1994) Protein Sci., 3, 482-492
6. Sonnhammer, E.L., Eddy S.R. and Durbin R. (1997) Proteins, 28, 405-420.
7. Bairoch, A. and Apweiler, R. (1996) Nuc. Acid Res., 24, 21-25
8. George D.G. et al. (1996) Nuc. Acid. Res., 24, 17-20
9. Gracy, J. and Argos, P. (1998) Bioinformatics, 14, 163-173.
10. Gracy, J. and Argos, P. (1998) Bioinformatics, 14, 174-187.
11. Etzold, T., Ulyanov, A. and Argos, P. (1996) Methods Enzymol., 266, 114-128
12. Sayle, RA. and Milner-White EJ. (1995) Trends Biochem. Sci., 20, 374.
13. Frishman, D. and Argos P. (1996) Protein Engng., 9, 133-142.
14. Levin, JM. (1997) Protein Engineering, 10, 771-776.
15. King, R. D. and Sternberg, M.J.E. (1996) Protein Science, 5, 2298-2310.
16. Gracy, J., Chiche, L. and Sallantin, J. (1993) Biochimie, 75, 353-361.
17. Lupas, A. (1996) Trends Biochem. Sci., 21, 375-382.
18. Persson, B., Argos, P. (1994) J. Mol. Biol. 237, 182-192.
19 Altschul S.F et al. (1997) Nucleic Acids Res., 25, 3389-3402.
20. Higgins D.G., Thompson J.D., Gibson ,T.J. (1996) Methods Enzymol., 266, 383-402