DOMO: a database of aligned protein domains.

The database is accessible here.
Jérôme GRACY
jgracy@cbs.cnrs.fr
CBS, 29 rue de Navacelles, 34090 Montpellier

Patrick ARGOS
argos@embl-heidelberg.de
EMBL, Meyerhof Strasse 1, Postfach 10.2209, D-69012 Heidelberg,  Germany.


A major issue related to protein classification is the decomposition of biomolecules into their constituent structural domains typically defined as independent and globular folding units within a three-dimensional protein structure (1). At the sequence level, accurately delineating the boundaries of homologous protein domains is a required condition for their subsequent multiple sequence alignment. Tertiary structural data which could guide a visual determination of such domain boundaries are missing for a vast majority of protein families. For this reason, though many motif (2), block (3), and full-sequence alignment (4) databases are available, currently only two databases of aligned domains (5,6) have been constructed from a fully automated process utilizing only sequence information.

Here, we describe DOMO, a new database of 8877 multiple sequence alignments including 99058 protein domains as well as repeating sequence regions extracted from 83054 non-redundant molecular primary structures contained in the SWISS-PROT (7) and PIR (8) databases. The domain boundaries and alignments have been inferred by a fully automated analysis process involving the detection and subsequent clustering of amino acid sequence similarities, followed by delineation of the domain boundaries, and then multiple sequence alignment of the related protein segments(9,10). The domain boundaries were not inferred from three-dimensional data but exclusively from the relative positions of homologous segment pairs within the same protein (repeats) or within homologous proteins at different distances from their respective N- or C-termini. The completeness and accuracy of the protein classifications, the correctness of the domain boundaries, and the quality of the multipe sequence alignments have been shown to be greatly improved in DOMO when compared to other databases (9,10).
 
 
Database format
Each entry of the database corresponds to one family of homologous domains. Its major fields provide information about the related proteins, their functional families, domain decomposition, multiple sequence alignment, conserved residues, and evolutionary classification tree.

The fields, identified from top to bottom of Figure 1 below, contain information from left to right according to the following format:

            + indicates at least 85% positively charged acids (K, R, or H).
            - indicates at least 85% negatively charged residues (E, D).
            * indicates at least 85% small amino acids (G, P, A, S, T, or N).
            @ indicates at least 85% aromatic amino acids (F, Y, W, or H).
            = indicates at least 85% hydrophobic amino acids (C, V, I, L, M, F, Y, W, G, A).
            # indicates at least 85% hydrophilic amino acids (E, D, N, Q, S, T, G, K, R, H).
 
In the particular example of Figure 1 below, the multiple alignment includes 16 apple domains consisting of four self repeats from four homologous serine proteases. It should be noted that the aligned fragments, each corresponding to one apple domain, were automatically delineated and spliced from the complete sequences before their multiple alignment.

Figure 1: Exemplary APPLE domain.
 
id DM00800 APPLE                                                        88 aa.  16 dom.(16)   4 seq.(4)
#
kw SERINE PROTEASE TRYPSIN HISTIDINE
#
#  access dtb identifier families #do description
sq P14272 SWP KAL_RAT    ABCDEF--   4 PLASMA KALLIKREIN PRECURSOR.
sq P26262 SWP KAL_MOUSE  ABCDEF--   4 PLASMA KALLIKREIN PRECURSOR.
sq P03952 SWP KAL_HUMAN  ABCD--GH   4 PLASMA KALLIKREIN PRECURSOR.
sq P03951 SWP FA11_HUMAN ABCDEFGH   4 COAGULATION FACTOR XI PRECURSOR.
#
#  f description                                  access  #seq / #seq prosite
fa A APPLE                                        DM00800    4 /    4
fa B APPLE DOMAIN                                            4 /    4 PS00495
fa C SERINE PROTEASES, TRYPSIN FAMILY, HISTIDINE             4 /  203 PS00134
fa D SERINE PROTEASES, TRYPSIN FAMILY, SERINE                4 /  202 PS00135
fa E COAGULATION FACTOR XI                                   3 /    3
fa F TRYPSIN                                      DM00018    3 /  236
fa G APPLE DOMAIN                                            2 /    2 PS00495
fa H SERINE PROTEASES, TRYPSIN FAMILY, HISTIDINE             2 /   93 PS00134
#
#  access :  pos access   pos ...
do P14272 :   37 DM00800  126 DM00800  216 DM00800  307 DM00800  391 DM00018  626 ???????  639
do P26262 :   37 DM00800  126 DM00800  216 DM00800  307 DM00800  391 DM00018  626 ???????  639
do P03952 :   37 DM00800  126 DM00800  216 DM00800  307 DM00800  392 DM00018  626 ???????  639
do P03951 :   36 DM00800  125 DM00800  215 DM00800  306 DM00800  389 DM00018  623 ???????  626
#
#            100%  90%  80%  70%  60%  50%  40%  30%  20%  10%   0%
#  access dom  |____|____|____|____|____|____|____|____|____|____|
tr P14272   4  _________________________________
tr P26262   4  ______|     |             |    | |
tr P03952   4  ____________|             |    | |
tr P03951   4  __________________________|    | |
tr P03952   2  _______________________________| |
tr P14272   2  _________________|     |         |
tr P26262   2  ______|                |         |
tr P03951   2  _______________________|         |
tr P03951   1  _________________________________|
tr P26262   1  _____________________|         |
tr P14272   1  ____|        |                 |
tr P03952   1  _____________|                 |
tr P03951   3  _______________________________|
tr P03952   3  ______________________|
tr P26262   3  ___________|
tr P14272   3  _____|
#
#  access dom  beg                                                                                          end
al P14272   4  307 [LNATFVQGA DACQETCTKT IRCQFFTYSL LPQDCKAEGC K.CSLRLSTD GSPTRITYEA QGSSGYSLRL CKVVESSDCT  384
al P26262   4  307 [LNVTFVQGA DVCQETCTKT IRCQFFIYSL LPQDCKEEGC K.CSLRLSTD GSPTRITYGM QGSSGYSLRL CKLVDSPDCT  384
al P03952   4  307 [LNVTFVKGV NVCQETCTKM IRCQFFTYSL LPEDCKEEKC K.CFLRLSMD GSPTRIAYGT QGSSGYSLRL CNTGDNSVCT  384
al P03951   4  306 [LDIVAAKSH EACQKLCTNA VRCQFFTYTP AQASCNEGKG K.CYLKLSSN GSPTKILHGR GGISGYTLRL CKM..DNECT  381
al P03952   2  126 [FNVSKVSSV EECQKRCTNN IRCQFFSYAT QTFHKAEYRN N.CLLKYSPG GTPTAIKVLS NVESGFSLKP CALS.EIGCH  202
al P14272   2  126 [FNISKTDSI EECQKLCTNN IHCQFFTYAT KAFHRPEYRK S.CLLKRSSS GTPTSIKPVD NLVSGFSLKS CALS.EIGCP  202
al P26262   2  126 [FNISKTDNI EECQKLCTNN FHCQFFTYAT SAFYRPEYRK K.CLLKHSAS GTPTSIKSAD NLVSGFSLKS CALS.EIGCP  202
al P03951   2  125 [YNSSVAKSA QECQERCTDD VHCHFFTYAT RQFPSLEHRN I.CLLKHTQT GTPTRITKLD KVVSGFSLKS CALS.NLACI  201
al P03951   1   36 -[TTVFTPSA KYCQVVCTYH PRCLLFTFTA ESPSEDPTRW FTCVLKDSVT ET.LPRVNRT AAISGYSFKQ CSHQISA.CN  111
al P26262   1   37 -[AAIYTPDA QYCQKMCTFH PRCLLFSFLA VTPPKETNKR FGCFMKESIT GT.LPRIHRT GAISGHSLKQ CGHQISA.CH  112
al P14272   1   37 -[AAIYTPDA QHCQKMCTFH PRCLLFSFLA VSPTKETDKR FGCFMKESIT GT.LPRIHRT GAISGHSLKQ CGHQLSA.CH  112
al P03952   1   37 -[ASMYTPNA QYCQMRCTFH PRCLLFSFLP ASSINDMEKR FGCFLKDSVT GT.LPKVHRT GAVSGHSLKQ CGHQISA.CH  112
al P03951   3  215 [IDSVMAPDA FVCGRICTHH PGCLFFTFFS QEWPKESQRN L.CLLKTSES GLPSTRIKKS KALSGFSLQS CRHSIPVFCH  292
al P03952   3  216 [VARVLTPDA FVCRTICTYH PNCLFFTFYT NVWKIESQRN V.CLLKTSES GTPSSSTPQE NTISGYSLLT CKRTLPEPCH  293
al P26262   3  216 [VSQVITPDA FVCRTICTFH PNCLFFTFYT NEWETESQRN V.CFLKTSKS GRPSPPIPQE NAISGYSLLT CRKTRPEPCH  293
al P14272   3  216 [VSQVVTPDA FVCRTVCTFH PNCLFFTFYT NEWETESQRN V.CFLKTSKS GRPSPPIIQE NAVSGYSLFT CRKARPEPCH  293
co               1  =*===t**a #=Cq#=CT## =rC#fFty=* #*@#####r# =.C=lk#s#* gtpt*i=##* *==SG@slk* C*#*.*=.C#   80
#
al P14272   4  385 TKINAR]--- ---------- ---------- ---------- ---------- ---------- ---------- ----------  390
al P26262   4  385 TKINAR]--- ---------- ---------- ---------- ---------- ---------- ---------- ----------  390
al P03952   4  385 TKTSTRI]-- ---------- ---------- ---------- ---------- ---------- ---------- ----------  391
al P03951   4  382 TKIKPRI]-- ---------- ---------- ---------- ---------- ---------- ---------- ----------  388
al P03952   2  203 MNIFQHLAFS DVD]------ ---------- ---------- ---------- ---------- ---------- ----------  215
al P14272   2  203 MDIFQHFAFA DLN]------ ---------- ---------- ---------- ---------- ---------- ----------  215
al P26262   2  203 MDIFQHSAFA DLN]------ ---------- ---------- ---------- ---------- ---------- ----------  215
al P03951   2  202 RDIFPNTVFA DSN]------ ---------- ---------- ---------- ---------- ---------- ----------  214
al P03951   1  112 KDIYVDLDMK GIN]------ ---------- ---------- ---------- ---------- ---------- ----------  124
al P26262   1  113 RDIYKGLDMR GSN]------ ---------- ---------- ---------- ---------- ---------- ----------  125
al P14272   1  113 QDIYEGLDMR GSN]------ ---------- ---------- ---------- ---------- ---------- ----------  125
al P03952   1  113 RDIYKGVDMR GVN]------ ---------- ---------- ---------- ---------- ---------- ----------  125
al P03951   3  293 SSFYHDTDFL GEE]------ ---------- ---------- ---------- ---------- ---------- ----------  305
al P03952   3  294 SKIYPGVDFG GEE]------ ---------- ---------- ---------- ---------- ---------- ----------  306
al P26262   3  294 SKIYSGVDFE GEE]------ ---------- ---------- ---------- ---------- ---------- ----------  306
al P14272   3  294 FKIYSGVAFE GEE]------ ---------- ---------- ---------- ---------- ---------- ----------  306
co              81 ##i@##=.f. #.#                                                                            93


Accessing DOMO through the World Wide Web
DOMO can be accessed through the sequence retrieval system SRS (11) here.

It provides a form-based query manager able to retrieve familial domain alignments by their identifiers, any included sequence accession numbers, or keywords. Further, the query results can be linked to other sequence databases to collect complementary information on the relevant proteins or their families.

Moreover, the domains from DOMO have been compiled in FASTA format to permit a fast search for homologous domains to a query protein using BLAST2 (19) and a subsequent multiple sequence alignment using CLUSTALW (20).

The different SRS steps of this analysis are described below:


The sequence analysis environment (domain database, query manager, homology search and multiple sequence alignment tools) provides a facile aid to determine any domain arrangements, their evolutionary relationships with related sequences, and key structural and functional amino acids in a query protein sequence.
 
 
References