DOMO: a database of aligned protein domains. |
Patrick ARGOS
argos@embl-heidelberg.de
EMBL,
Meyerhof Strasse 1, Postfach 10.2209, D-69012 Heidelberg, Germany.
A major issue related to protein classification is the decomposition of biomolecules into their constituent structural domains typically defined as independent and globular folding units within a three-dimensional protein structure (1). At the sequence level, accurately delineating the boundaries of homologous protein domains is a required condition for their subsequent multiple sequence alignment. Tertiary structural data which could guide a visual determination of such domain boundaries are missing for a vast majority of protein families. For this reason, though many motif (2), block (3), and full-sequence alignment (4) databases are available, currently only two databases of aligned domains (5,6) have been constructed from a fully automated process utilizing only sequence information.
Here, we describe DOMO,
a new database of 8877 multiple sequence alignments including 99058 protein
domains as well as repeating sequence regions extracted from 83054 non-redundant
molecular primary structures contained in the SWISS-PROT (7) and PIR (8)
databases. The domain boundaries and alignments have been inferred by a
fully automated analysis process involving the detection and subsequent
clustering of amino acid sequence similarities, followed by delineation
of the domain boundaries, and then multiple sequence alignment of the related
protein segments(9,10). The domain boundaries were not inferred from three-dimensional
data but exclusively from the relative positions of homologous segment
pairs within the same protein (repeats) or within homologous proteins at
different distances from their respective N- or C-termini. The completeness
and accuracy of the protein classifications, the correctness of the domain
boundaries, and the quality of the multipe sequence alignments have been
shown to be greatly improved in DOMO when compared to other databases (9,10).
Database format |
Figure 1: Exemplary APPLE
domain.
id DM00800
APPLE
88 aa. 16 dom.(16) 4 seq.(4)
# kw SERINE PROTEASE TRYPSIN HISTIDINE # # access dtb identifier families #do description sq P14272 SWP KAL_RAT ABCDEF-- 4 PLASMA KALLIKREIN PRECURSOR. sq P26262 SWP KAL_MOUSE ABCDEF-- 4 PLASMA KALLIKREIN PRECURSOR. sq P03952 SWP KAL_HUMAN ABCD--GH 4 PLASMA KALLIKREIN PRECURSOR. sq P03951 SWP FA11_HUMAN ABCDEFGH 4 COAGULATION FACTOR XI PRECURSOR. # # f description access #seq / #seq prosite fa A APPLE DM00800 4 / 4 fa B APPLE DOMAIN 4 / 4 PS00495 fa C SERINE PROTEASES, TRYPSIN FAMILY, HISTIDINE 4 / 203 PS00134 fa D SERINE PROTEASES, TRYPSIN FAMILY, SERINE 4 / 202 PS00135 fa E COAGULATION FACTOR XI 3 / 3 fa F TRYPSIN DM00018 3 / 236 fa G APPLE DOMAIN 2 / 2 PS00495 fa H SERINE PROTEASES, TRYPSIN FAMILY, HISTIDINE 2 / 93 PS00134 # # access : pos access pos ... do P14272 : 37 DM00800 126 DM00800 216 DM00800 307 DM00800 391 DM00018 626 ??????? 639 do P26262 : 37 DM00800 126 DM00800 216 DM00800 307 DM00800 391 DM00018 626 ??????? 639 do P03952 : 37 DM00800 126 DM00800 216 DM00800 307 DM00800 392 DM00018 626 ??????? 639 do P03951 : 36 DM00800 125 DM00800 215 DM00800 306 DM00800 389 DM00018 623 ??????? 626 # # 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% # access dom |____|____|____|____|____|____|____|____|____|____| tr P14272 4 _________________________________ tr P26262 4 ______| | | | | tr P03952 4 ____________| | | | tr P03951 4 __________________________| | | tr P03952 2 _______________________________| | tr P14272 2 _________________| | | tr P26262 2 ______| | | tr P03951 2 _______________________| | tr P03951 1 _________________________________| tr P26262 1 _____________________| | tr P14272 1 ____| | | tr P03952 1 _____________| | tr P03951 3 _______________________________| tr P03952 3 ______________________| tr P26262 3 ___________| tr P14272 3 _____| # # access dom beg end al P14272 4 307 [LNATFVQGA DACQETCTKT IRCQFFTYSL LPQDCKAEGC K.CSLRLSTD GSPTRITYEA QGSSGYSLRL CKVVESSDCT 384 al P26262 4 307 [LNVTFVQGA DVCQETCTKT IRCQFFIYSL LPQDCKEEGC K.CSLRLSTD GSPTRITYGM QGSSGYSLRL CKLVDSPDCT 384 al P03952 4 307 [LNVTFVKGV NVCQETCTKM IRCQFFTYSL LPEDCKEEKC K.CFLRLSMD GSPTRIAYGT QGSSGYSLRL CNTGDNSVCT 384 al P03951 4 306 [LDIVAAKSH EACQKLCTNA VRCQFFTYTP AQASCNEGKG K.CYLKLSSN GSPTKILHGR GGISGYTLRL CKM..DNECT 381 al P03952 2 126 [FNVSKVSSV EECQKRCTNN IRCQFFSYAT QTFHKAEYRN N.CLLKYSPG GTPTAIKVLS NVESGFSLKP CALS.EIGCH 202 al P14272 2 126 [FNISKTDSI EECQKLCTNN IHCQFFTYAT KAFHRPEYRK S.CLLKRSSS GTPTSIKPVD NLVSGFSLKS CALS.EIGCP 202 al P26262 2 126 [FNISKTDNI EECQKLCTNN FHCQFFTYAT SAFYRPEYRK K.CLLKHSAS GTPTSIKSAD NLVSGFSLKS CALS.EIGCP 202 al P03951 2 125 [YNSSVAKSA QECQERCTDD VHCHFFTYAT RQFPSLEHRN I.CLLKHTQT GTPTRITKLD KVVSGFSLKS CALS.NLACI 201 al P03951 1 36 -[TTVFTPSA KYCQVVCTYH PRCLLFTFTA ESPSEDPTRW FTCVLKDSVT ET.LPRVNRT AAISGYSFKQ CSHQISA.CN 111 al P26262 1 37 -[AAIYTPDA QYCQKMCTFH PRCLLFSFLA VTPPKETNKR FGCFMKESIT GT.LPRIHRT GAISGHSLKQ CGHQISA.CH 112 al P14272 1 37 -[AAIYTPDA QHCQKMCTFH PRCLLFSFLA VSPTKETDKR FGCFMKESIT GT.LPRIHRT GAISGHSLKQ CGHQLSA.CH 112 al P03952 1 37 -[ASMYTPNA QYCQMRCTFH PRCLLFSFLP ASSINDMEKR FGCFLKDSVT GT.LPKVHRT GAVSGHSLKQ CGHQISA.CH 112 al P03951 3 215 [IDSVMAPDA FVCGRICTHH PGCLFFTFFS QEWPKESQRN L.CLLKTSES GLPSTRIKKS KALSGFSLQS CRHSIPVFCH 292 al P03952 3 216 [VARVLTPDA FVCRTICTYH PNCLFFTFYT NVWKIESQRN V.CLLKTSES GTPSSSTPQE NTISGYSLLT CKRTLPEPCH 293 al P26262 3 216 [VSQVITPDA FVCRTICTFH PNCLFFTFYT NEWETESQRN V.CFLKTSKS GRPSPPIPQE NAISGYSLLT CRKTRPEPCH 293 al P14272 3 216 [VSQVVTPDA FVCRTVCTFH PNCLFFTFYT NEWETESQRN V.CFLKTSKS GRPSPPIIQE NAVSGYSLFT CRKARPEPCH 293 co 1 =*===t**a #=Cq#=CT## =rC#fFty=* #*@#####r# =.C=lk#s#* gtpt*i=##* *==SG@slk* C*#*.*=.C# 80 # al P14272 4 385 TKINAR]--- ---------- ---------- ---------- ---------- ---------- ---------- ---------- 390 al P26262 4 385 TKINAR]--- ---------- ---------- ---------- ---------- ---------- ---------- ---------- 390 al P03952 4 385 TKTSTRI]-- ---------- ---------- ---------- ---------- ---------- ---------- ---------- 391 al P03951 4 382 TKIKPRI]-- ---------- ---------- ---------- ---------- ---------- ---------- ---------- 388 al P03952 2 203 MNIFQHLAFS DVD]------ ---------- ---------- ---------- ---------- ---------- ---------- 215 al P14272 2 203 MDIFQHFAFA DLN]------ ---------- ---------- ---------- ---------- ---------- ---------- 215 al P26262 2 203 MDIFQHSAFA DLN]------ ---------- ---------- ---------- ---------- ---------- ---------- 215 al P03951 2 202 RDIFPNTVFA DSN]------ ---------- ---------- ---------- ---------- ---------- ---------- 214 al P03951 1 112 KDIYVDLDMK GIN]------ ---------- ---------- ---------- ---------- ---------- ---------- 124 al P26262 1 113 RDIYKGLDMR GSN]------ ---------- ---------- ---------- ---------- ---------- ---------- 125 al P14272 1 113 QDIYEGLDMR GSN]------ ---------- ---------- ---------- ---------- ---------- ---------- 125 al P03952 1 113 RDIYKGVDMR GVN]------ ---------- ---------- ---------- ---------- ---------- ---------- 125 al P03951 3 293 SSFYHDTDFL GEE]------ ---------- ---------- ---------- ---------- ---------- ---------- 305 al P03952 3 294 SKIYPGVDFG GEE]------ ---------- ---------- ---------- ---------- ---------- ---------- 306 al P26262 3 294 SKIYSGVDFE GEE]------ ---------- ---------- ---------- ---------- ---------- ---------- 306 al P14272 3 294 FKIYSGVAFE GEE]------ ---------- ---------- ---------- ---------- ---------- ---------- 306 co 81 ##i@##=.f. #.# 93 |
Accessing DOMO through the World Wide Web |
Moreover, the domains from DOMO have been compiled in FASTA format to permit a fast search for homologous domains to a query protein using BLAST2 (19) and a subsequent multiple sequence alignment using CLUSTALW (20).
The different SRS steps of this analysis are described below:
The sequence analysis environment (domain database, query manager, homology
search and multiple sequence alignment tools) provides a facile aid to
determine any domain arrangements, their evolutionary relationships with
related sequences, and key structural and functional amino acids in a query
protein sequence.
References |