Get   for     ? 
 Site search     ? 
  Group Info

 SWISS-PROT   Publications

Introduction to Molecular Biology Databases

[In: The EBI Online Manual on Molecular Biology Databases,
 Apweiler R.,Lopez R., Marx B. (1999).]


Rolf Apweiler

SWISS-PROT Coordinator, EMBL Outstation – The European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, UK.

Telephone:   ++ 44 1223 494435
Fax:            ++ 44 1223 494468





Recent years have seen an explosive growth in biological data, which is often not published anymore in a conventional sense, but deposited in a database. Sequence data from mega-sequencing projects may not even be linked to a conventional publication. This trend and the need for computational analyses of the data made databases essential tools for biological research. The goal of this material is to describe the different molecular biology databases available to researchers. There are so many specialised databases, that it is not reasonable to list the URLs of all of them, especially since this category of databases is quite changeable and any list provided here would soon be outdated.
However, under the URL will find a WWW document that lists information sources for molecular biologists, which is kept constantly up-to-date.


Bibliographic Databases

Services that abstract the scientific literature began to make their data available in machine-readable form in the early 1960. You should be aware that none of the abstracting services has a complete coverage. The best known is "MEDLINE", and now "PUBMED", abstracting mainly the medical literature.
MEDLINE/PUBMED ( is best accessible through NCBI's ENTREZ (
EMBASE is a commercial product for the medical literature.
BIOSIS (, the inheritor of the old Biological Abstracts, covers a broad biological field; the Zoological Record indexes the zoological literature.
CAB International ( maintains abstract databases in the fields of agriculture and parasitic diseases. AGRICOLA is for the agricultural field what MEDLINE is for the medical field ( The bibliographical databases are with the exception of MEDLINE/PUBMED only available through commercial database vendors.


Taxonomy Databases

Taxonomic databases are rather controversial since the soundness of the taxonomic classifications done by one taxonomist will be directly questioned by next taxonomist!
Various efforts are going on to create a taxonomy resource (e.g. "The Tree of Life" project  (, "Species 2000" (, International Organization for Plant Information (, Integrated Taxonomic Information System (, etc.). The most generally useful taxonomic database is that maintained by the NCBI ( This hierarchical taxonomy is used by the Nucleotide Sequence Databases, SWISS-PROT and TrEMBL, and is curated by an informal group of experts.


Nucleotide Sequence Databases

The International Nucleotide Sequence Database Collaboration (often, though inaccurately, referred to as "GenBank") is a joint production of the nucleotide sequence database by the DDBJ (DNA Data Bank of Japan,, EBI (European Bioinformatics Institute,, and NCBI National Center for Biotechnology Information, In Europe, the vast majority of the nucleotide sequence data produced is collected, organised and distributed by the EMBL Nucleotide Sequence Database ( located at the European Bioinformatics Institute (Cambridge, UK), an Outstation of the European Molecular Biology Laboratory (EMBL) in Heidelberg, Germany. The nucleotide sequence databases are data repositories, accepting nucleic acid sequence data from the community and making it freely available. The databases strive for completeness, with the aim of recording every publicly known nucleic acid sequence. These data are heterogenous, they vary with respect to the source of the material (e.g. genomic versus cDNA), the intended quality (e.g. finished versus single pass sequences), the extent of sequence annotation and the intended completeness of the sequence relative to its biological target (e.g. complete versus partial coverage of a gene or a genome). EMBL, NCBI and DDBJ automatically update each other every 24 hours with the new sequences they collected or updated. The result is that they contain exactly the same information, except for sequences that have been added in the last 24 hours.

Each entry in a database must have a unique identifier that is a string of letters and/or numbers that only that record has. This unique identifier, which is known as the accession number, can be quoted in the scientific literature, as it will never change. As the accession number must always remain the same, another code is used to indicate the different versions due to sequence corrections. You should therefore always take care to quote both the unique identifier and the version number, when referring to records in a nucleotide sequence database.

The AC (ACcession number) line in a nucleotide sequence record lists the accession numbers associated with this entry. The accession number consists of one letter followed by five digits (X12345), or (more recently) two letters followed by six digits (XY123456).

An example of an accession number line is shown below:

AC Y00321; J05348;

An accession number is dropped from the database only when the data to which it was assigned have been completely removed from the database.

The SV (Sequence Version) line contains the nucleotide sequence identifier, which allows you to recognise the sequence version of this record.

An example of a Sequence Version line is shown below:

SV AJ000012.1

The nucleotide sequence identifier is of the form of 'Accession.Version' (eg, AJ000012.1). The first part is the never changing accession number, followed by a period and a version number. The accession number part will be stable, but the version part will be incremented when the sequence changes.

Although the nucleotide sequence data are checked for integrity and obvious errors by the data library staff, the quality of the data is the responsibility of the submitter. As a consequence, there are many errors in the database: many sequence entries are either mislabelled, contaminated, incompletely or erroneously annotated, or contain sequencing errors. In addition, the database is very redundant, in the sense that the same sequence from the same organism may be included many times, simply reflecting the redundancy of the original scientific reports.

Sequence-cluster databases such as UniGene ( and STACK (Sequence Tag Alignment and Consensus Knowledgebase, address the redundancy problem by coalescing sequences that are sufficiently similar that one may reasonably infer that they are derived from the same gene.

Several specialised sequence databases are also available. Some of these deal with particular classes of sequence, e.g.
the Ribosomal Database Project (RDP,,
the HIV Sequence Database (, and
IMGT, the ImMunoGeneTics database (;
others are focussing on particular features, such as TRANSFAC for transcription factors and transcription factor binding sites (, EPD (Eukaryotic Protein Database, for promoters, and REBASE ( for restriction enzymes and restriction enzyme sites. GOBASE ( is a specialised database of organelle genomes. A database for mitochondrial genomics is mitBASE ( from the EBI.


Genetic Databases

For organisms of major interest to geneticists, there is a long history of conventionally published catalogues of genes or mutations. In the past few years, most of these have been made available in an electronic form and a variety of new databases have been developed. These various databases vary greatly in form and content; varying in the classes of data captured and how these data are stored.

There are several databases for Escherichia coli. CGSC, the E. coli Genetic Stock Center, ( maintains a database of E.coli genetic information, including genotypes and reference information for the strains in the CGSC collection, gene names, properties, and linkage map, gene product information, and information on specific mutations. The E. coli Database collection (ECDC, in Giessen, Germany, maintains curated gene-based sequence records for E. coli. EcoCyc, the "Encyclopedia of E. coli Genes and Metabolism" is a database of E. coli genes and metabolic pathways.

The MIPS yeast database ( is an important resource for information on the yeast genome and its products.
The Saccharomyces Genome Database ( is another major yeast database.

MaizeDB is the database for genetic data on maize ( AGIS (Agricultural Genome Information System, provides for other plants access to many different genome databases (mostly in ACEDB format), including Chlamydomonas, cotton, alfalfa, wheat, barley, rye, rice, millet, sorghum and species of Solanaceae and trees. MENDEL is a plant-wide database for plant genes (

ACeDB is the database for genetic and molecular data concerning Caenorhabditis elegans. The database management system written for ACeDB by R Durbin and J Thierry-Mieg has proved very popular and has been used in many other species-specific databases. ACEDB (spelled with a capital ‘E') is now the name of this database management system, resulting in some confusion relative to the C. elegans database. The entire database can be downloaded from the Sanger Institute (

Two of the best-curated genetic databases are FlyBase (, the database for Drosophila melanogaster and the Mouse Genome Database (MGD, ZFIN, a database for another important model organism, the zebrafish Brachydanio rerio, has been implemented recently (

There are also genetic databases available for several animals of economic importance to humans. These include pig (PIGBASE), cows (BovGBASE), sheep (SheepBASE) and chicken (ChickBASE). In addition, there is a database of mutant phenotypes modeled on Mendelian Inheritance in Man, Mendelian Inheritance in Animals.
All these databases are available via the AGIS server and most from the Roslin Institute server ( and from the Japanese Animal Genome Database (

Two major databases for human genes and genomics are in existence. V McKusick's Mendelian Inheritance in Man (MIM) is a catalogue of human genes and genetic disorders and is available in an online form (OMIM, the NCBI. The Genome Database (GDB, is the major human genome database including both molecular and mapping data.
Both OMIM and GDB include information on genetic variation in humans but there is also the human mutation server at the EBI (, with links to the many single sequence variation databases at the EBI; and to the SRS (Sequence Retrieval System) interface to many human mutation databases.
The GeneCards resource at the Weizmann Institute ( integrates information about human genes from a variety of databases, including GDB, OMIM, SWISS-PROT and the nucleotide sequence databases.
GENATLAS ( also provides a database of human genes, with links to diseases and maps.

A parasite genome database ( is supported by the World Health Organisation (WHO) at the EBI, covering the five ‘targets' of its Tropical Diseases Research programme: Leishmania, Trypanosoma cruzi, African Trypanosomes, Schistosoma and Filariasis. Databases for some vectors of parasitic diseases are also available, such as AnoDB ( for Anopheles and AaeDB ( for Aedes aegypti.


Protein Sequence Databases

The protein sequence databases are the most comprehensive source of information on proteins. It is necessary to distinguish between universal databases covering proteins from all species and specialised data collections storing information about specific families or groups of proteins, or about the proteins of a specific organism. Two categories of universal protein sequence databases can be discerned: simple archives of sequence data; and annotated databases where additional information has been added to the sequence record. In the following you will find a short description of the Protein Information Resource (PIR), the oldest protein sequence database; and a more detailed description of SWISS-PROT, an annotated universal sequence database; and of TrEMBL, the supplement of SWISS-PROT, which can be classified as a computer-annotated sequence repository. There will be furthermore a discussion of the issues of completeness and redundancy, and finally some examples of specialised protein sequence collections.


The Protein Information Resource (PIR)

PIR (Barker et al., 1999) was established in 1984 by the National Biomedical Research Foundation (NBRF) as a successor of the original NBRF Protein Sequence Database, developed over a 20 year period by the late Margaret O. Dayhoff and published as the `Atlas of Protein Sequence and Structure' (Dayhoff et al., 1965; Dayhoff, 1979). Since 1988 the database has been maintained by PIR-International, a collaboration between the NBRF, the Munich Information Center for Protein Sequences (MIPS), and the Japan International Protein Information Database (JIPID).

The PIR release 60.10 (June 15, 1999) contained 131,026 entries. The database is partitioned into four sections, PIR1 (14,753 entries), PIR2 (115,383 entries), PIR3 (560 entries) and PIR4 (330 entries). Entries in PIR1 are fully classified by superfamily assignment, fully annotated and fully merged with respect to other entries in PIR1. The annotation content as well as the level of redundancy reduction varies in PIR2 entries. Many entries in PIR2 are merged, classified, and annotated. Entries in PIR3 are not classified, merged or annotated. PIR3 serves as a temporary buffer for new entries. PIR4 was created to include sequences identified as not naturally occurring or expressed, such as known pseudogenes, unexpressed ORFs, synthetic sequences, and non-naturally occurring fusion, crossover or frameshift mutations.

PIR provides also some degree of cross-referencing to other biomolecular databases by linking to the DDBJ/EMBL/GenBank nucleotide sequence databases, PDB, GDB, FlyBase, OMIM, SGD, and MGD.



Introduction. SWISS-PROT (Bairoch and Apweiler, 1999) is an annotated protein sequence database established in 1986 and maintained collaboratively by the Swiss Institute of Bioinformatics and the EMBL Outstation - The European Bioinformatics Institute (EBI). It strives to provide a high level of annotation, a minimal level of redundancy, a high level of integration with other biomolecular databases as well as extensive external documentation. Each entry in SWISS-PROT gets thoroughly analysed and annotated by biologists ensuring a high standard of annotation and maintaining the quality of the database (Apweiler et al., 1997). SWISS-PROT contains data that originates from a wide variety of organisms; release 38 (July 1999) contained around 80'000 annotated sequence entries from more than 6000 different species. But half of the entries come from about 20 organisms, which are the target of many biological studies (ranked by number of entries): Homo sapiens, Saccharomyces cerevisiae, Escherichia coli, Mus musculus, Rattus norvegicus, Bacillus subtilis, Caenorhabditis elegans, Haemophilus influenzae, Schizosaccharomyces pombe, Methanococcus jannaschii, Bos taurus, Drosophila melanogaster, Mycobacterium tuberculosis, Gallus gallus, Arabidopsis thaliana, Salmonella typhimurium, Xenopus laevis, Synechocystis sp. (strain PCC 6803), Sus scrofa, and Oryctolagus cuniculus.


A close look at a SWISS-PROT entry. A sample SWISS-PROT entry is shown in Figure 1.The SWISS-PROT entries are made up of different line types, each of them beginning with a two-character line code indicative of the type of data stored in the line. There are 22 different line types in SWISS-PROT. Some line types may occur more than once in an entry and some entries do not contain all line types. Let us have a close look on the entries in Figure 1  to explain the different information found in the different lines:


ID   TNF5_HUMAN     STANDARD;      PRT;   261 AA.
AC   P29965;
DT   01-APR-1993 (Rel. 25, Created)
DT   01-APR-1993 (Rel. 25, Last sequence update)
DT   15-JUL-1999 (Rel. 38, Last annotation update)
OS   Homo sapiens (Human).
OC   Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Mammalia;
OC   Eutheria; Primates; Catarrhini; Hominidae; Homo.

The identification line (ID) is the first line in every SWISS-PROT entry. It contains the entry name, which provides an easy way of labelling an entry. In our example, TNF5_HUMAN is the entry name for the human CD40 ligand; while P29965 is its accession number, shown in the AC (ACession) line(s). For reasons of consistency it is sometimes necessary to change entry names from one release of the database to another. Accession numbers provide an unambiguous way to refer to sequence entries and should be always used if you need to cite a particular entry in a citation, since they never change! It sometimes happens that the AC line contains more than one accession number. In this case you should always cite the first one, the so-called "primary accession number". The three DaTe (DT) lines, which follow the AC line, show you when the entry was created, when the sequence was updated the last time and when the most recent annotation was added.

The DE (Description) line(s) lists all the names under which a particular protein is or has been known. The next line, the GN (GeneName) line lists the designation(s) of the protein's gene. This line can be absent if no gene name has been given, or it can be quite extensive, like for some DE lines, if multiple symbols have been assigned by different groups. The DE line gives also in indication about the characterisation of the protein. Our example describes the protein as ‘CD40 LIGAND'. That means that this protein has been experimentally characterised to be the ‘CD40 LIGAND'. With the increasing amount of data coming from mega-sequencing projects you will find more and more proteins in SWISS-PROT with no experimental characterisation. These proteins can be identified through their standardised labeling of the DE line.

When a protein exhibits extensive sequence similarity to a characterised protein and/or has the same conserved regions then the label ‘probable' is used in the DE line. It is normally followed by the full name of a protein from the same family that it matches.



The label ‘putative' is used in the DE line of proteins that exhibit limited sequence similarity to characterised proteins. These proteins often have a conserved site e.g. ATP-binding site but no other significant similarity to a characterised protein. It is most frequently used for sequences from genome projects.



The assignment of the labels ‘probable' and ‘putative' is dependent primarily on the results of sequence similarity searches against SWISS-PROT. It is important to point out here that no specific cut-off point is used to assign a protein as ‘putative' or ‘probable', i.e. it is not the case that <50% identity = putative and >50% = probable. Let us take Q10480, a predicted Schizosaccharomyces pombe protein, as an example.

This entry has the following description line:


The FastA results show that the sequence is 47% identical over the entire length to the mitochondrial nuclease from Saccharomycescerevisiae:

101233036 residues in 321608 sequences
statistics extrapolated from 50000 to 321410 sequences
Expectation_n fit: rho(ln(x))= 5.8023+/-0.00053; mu= 3.8850+/- 0.030;
mean_var=70.4844+/-13.963, 0's: 144 Z-trim: 31 B-trim: 1593 in 1/64
FASTA (3.2 December, 1998) function [optimised, +1/-3 matrix (15:-5)] ktup:2
join: 37, opt: 25, gap-pen: -12/ -2, width: 16 reg.-scaled
Scan time: 115.367
The best scores are: initn init1 opt z-sc E(321410)
SW:NUC1_YEAST P08466 MITOCHONDRIAL NU ( 329) 941 630 1017 1216.7 1.9e-60
initn: 941 init1: 630 opt: 1017 Z-score: 1216.7 expect() 1.9e-60
Smith-Waterman score: 1017; 47.147% identity in 333 aa overlap (1-326:1-325)

Large segments contain identical residues, the E-value (the assessment of the statistical significance based upon the extreme value distribution) of the alignment is statistically highly significant, the active site is conserved and so we tentatively classify it as a ‘PROBABLE MITOCHONDRIAL NUCLEASE‘.

All predicted protein sequences lacking any significant sequence similarity to characterised proteins are labeled as label ‘hypothetical proteins'. The majority of these cases come from the genome sequencing projects.



The next lines, the OS (Organism Species) and OC (Organism Classification) lines, describe the species from which the protein has been derived. The OS line shows the scientific name of the organism and, if existing, the common English name. The OC lines give the taxonomic tree. SWISS-PROT, as well as the DDBJ/EMBL/GenBank nucleotide sequence databases, uses the NCBI taxonomy to standardise the taxonomies of the molecular sequence databases.

A line not present in our example is the OG (OrGanelle) line. This line is used to indicate in what organelle or extrachromosomal element the gene is encoded.


OG   Chloroplast.

The next part of our sample entry contains various references:

RN   [1]
RX   MEDLINE; 93076854.
RT   "Cloning of TRAP, a ligand for CD40 on human T cells.";
RL   Eur. J. Immunol. 22:3191-3194(1992).
.. 6 references omitted
RN   [7]
RX   MEDLINE; 96131874.
RT   "2-A crystal structure of an extracellular fragment of human CD40
RT   ligand.";
RL   Structure 3:1031-1039(1995).
RN   [8]
RX   MEDLINE; 98266353.
RT   "The role of polar interactions in the molecular recognition of CD40L
RT   with its receptor CD40.";
RL   Protein Sci. 7:1124-1135(1998).
RN   [9]
.. 6 references omitted
RN   [14]
RP   VARIANTS HIGM1 ARG-36; CYS-140; SER-231; MET-254 AND GLY-227 DEL.
RX   MEDLINE; 97295077.
RA   YATA J.-I., OCH H.D.;
RT   "Mutations of the CD40 ligand gene in 13 Japanese patients with
RT   X-linked hyper-IgM syndrome.";
RL   Hum. Genet. 99:624-627(1997).

Each reference is a block of lines starting with ‘R': RN, RP, RX, RA, RT and RL. The RN (Reference Number) line gives simply the number of the reference in an entry. The RP line provides a short indication of the work described in the publication. In the RC (Reference Comment) line you will find information such as the tissue or strain from which the protein was extracted. The references shown above have no RC lines, so some examples to illustrate the type of information you can find in RC lines:


The RX line - ‘X' for Cross-reference – is used for the identifier assigned to a specific reference in a bibliographic database like Medline. The RA (Reference Author) line mentions the authors of the citation, the RT (Reference Title) line contains the title and the RL (Reference Location) line the conventional citation information of the reference.

You can see in our example that SWISS-PROT includes in addition to citations about sequencing work also references to other scientific work like 3-D structure determination, mutagenesis, and detection of post-translational modifications and variants. It is also important to know that you will find not only references to published journal articles, books and theses in SWISS-PROT, but also to information directly submitted to the database. Many scientific data are not published anymore in the conventional sense. It has already been some years since most journals have declined to publish sequence data – these are now simply deposited in the sequence databases. Sequence data from the mega-sequencing projects may not even be linked to conventional publications. There is an increasing trend for other classes of data to be published only in a database. It is important to be aware of these developments and to realise that biomolecular databases are becoming much more than a repository of data that can be found elsewhere.

Continuing in the sample entry we arrive at the following part:

CC   -!- DATABASE: NAME=CD40Lbase;
CC       NOTE=European CD40L defect database (mutation db);
CC       WWW="";
CC       FTP="".
CC   -!- DATABASE: NAME=PROW; NOTE=CD guide CD154 entry;
CC       WWW="".

The CC (Comments) lines contain various textual comments grouped under different topics. There are altogether 20 different topics. The current topics and their definitions are listed in the table below.




Description of the existence of related protein sequence(s) produced by alternative splicing of the same gene or by the use of alternative initiation codons


Description of the reaction(s) catalyzed by an enzyme


This topic warns you about possible errors and/or grounds for confusion


Description of an enzyme cofactor


Description of a cross-reference to a network database/resource for a specific protein


Description of the developmental specific expression of a protein


Description of the disease(s) associated with a deficiency of a protein


Description of the domain structure of a protein


Description of an enzyme regulatory mechanism


General description of the function(s) of a protein


Description of the compound(s) which stimulate the synthesis of a protein


Reports the exact molecular weight of a protein or part of a protein as determined by mass spectrometric methods


Any comment which does not belong to any of the other defined topics


Description of the metabolic pathway(s) to which a protein is associated


Description of polymorphism(s)


Description of a post-translational modification


Description of the similaritie(s) (sequence or structural) of a protein with other proteins


Description of the subcellular location of the mature protein


Description of the quaternary structure of a protein


Description of the tissue specificity of a protein


The CC lines give, as the DE lines, an indication about the level of characterisation of a protein. In our example you can find experimentally verified information about the ‘FUNCTION', the quartenary structure (‘SUBUNIT'), the ‘SUBCELLULAR LOCATION' and the ‘TISSUE SPECIFICITY'of the protein. You also find a description of the ‘DISEASE(s)'known to be associated with a deficiency of the protein, a description of the ‘SIMILARITY'of the protein with other proteins, and a cross-reference to network ‘DATABASE'resource(s) for this specific protein.

Let us have again a look at Q10480, the ‘PROBABLE MITOCHONDRIAL NUCLEASE' of Schizosaccharomyces pombe, as an example for a protein without biochemical characterisation. It has been mentioned before that the sequence is 47% identical over the entire length to the biochemically characterised mitochondrial nuclease from Saccharomycescerevisiae; and so it was tentatively classified as a mitochondrial nuclease. In Q10480 you can find the following CC lines:

CC       FAMILY.

The function, cofactor and subunit comments are all labelled ‘by similarity'. This indicates that these have been assigned due to similarity to an existing characterised entry, in this case the mitochondrial nuclease from Saccharomycescerevisiae. The label ‘potential' is also used to indicate the assignment by comparative analysis. In general this label is used if there is no experimental proof for the information given in a CC topic for a protein, but similarity searches or other prediction methods allow potential comments (in the example of Q10480 about the subcellular location). If comparative analysis reveals highly likely comments, then the label ‘probable' is used:


There is one more type of CC line, which has not yet been explained with the other CC lines, and that is the CC block with the Copyright statement:

CC   --------------------------------------------------------------------------
CC   This SWISS-PROT entry is copyright. It is produced through a collaboration
CC   between  the Swiss Institute of Bioinformatics  and the  EMBL outstation -
CC   the European Bioinformatics Institute.  There are no  restrictions on  its
CC   use  by  non-profit  institutions as long  as its content  is  in  no  way
CC   modified and this statement is not removed.  Usage  by  and for commercial
CC   entities requires a license agreement (See
CC   or send an email to
CC   --------------------------------------------------------------------------


Some background information about this very special type of CC lines:

The enormous growth in the quantity of sequence and characterisation data has made the task of producing an annotated and comprehensive protein sequence database a major challenge. While automation of some aspects of this work has made it possible to obtain significant progress in productivity, it nonetheless remains a task which is intensive in terms of human resources, and which requires an increasing amount of expertise. Recent years have shown that public funding for such an activity is not going to keep pace with its financial requirements. During the same period, the importance of high quality annotation for all kinds of life sciences research activities has grown. We are therefore faced with the paradoxical situation where no major life sciences research lab can function without a database such as SWISS-PROT, yet the existence and continued development of such a resource is in jeopardy. SWISS-PROT decided that the only feasible solution to this problem is to obtain additional funds through the payment of yearly license fees by non-academic users for access to SWISS-PROT. The copyright statement should remind commercial users of their obligation to contribute to the further development of SWISS-PROT by concluding a license agreement.

The groups in charge for the production of SWISS-PROT at EMBL and at the Swiss Institute of Bioinformatics announced in July 1998 that they would request license fees from commercial users in order to raise revenues, which would be used entirely to improve SWISS-PROT. Today, nearly a year later, we are in a position to take stock: Academic access to SWISS-PROT, and its use and redistribution, has not been affected, and we are beginning to see quality improvements resulting from the extra resources raised. Indeed, even in the commercial sector, aside from requests for subscriptions to be paid, nothing has changed in the way that SWISS-PROT is made available. Companies are showing their appreciation of the work done in the scientific curation of the scientific information in SWISS-PROT. The major pharmaceutical industries have signed, or are in the process of signing, license agreements. Smaller companies are starting to follow suit.

The producers of SWISS-PROT would have welcomed a survival plan for SWISS-PROT funded by public bodies and uncomplicated by subscriptions. However, Europe was organisationally unable to come up with the goods. The current pragmatic expedient to raise revenues has solved the problem for SWISS-PROT while avoiding commercialisation, and for that the users of SWISS-PROT are thankful.

But now back to the scientific content of the SWISS-PROT database. The next section contains the DR (Database cross-References) lines:

DR   EMBL; X68550; CAA48554.1; -.
DR   EMBL; Z15017; CAA78737.1; -.
DR   EMBL; X67878; CAA48077.1; -.
DR   EMBL; L07414; AAA35662.1; -.
DR   EMBL; D31797; BAA06599.1; -.
DR   EMBL; D31793; BAA06599.1; JOINED.
DR   EMBL; D31794; BAA06599.1; JOINED.
DR   EMBL; D31795; BAA06599.1; JOINED.
DR   EMBL; D31796; BAA06599.1; JOINED.
DR   PIR; S25684; S25684.
DR   PIR; S26694; S26694.
DR   PIR; S28017; S28017.
DR   PIR; S28852; S28852.
DR   PIR; JH0793; JH0793.
DR   PDB; 1ALY; 17-SEP-97.
DR   MIM; 308230; -.
DR   PROSITE; PS00251; TNF_1; 1.
DR   PROSITE; PS50049; TNF_2; 1.
DR   PFAM; PF00229; TNF; 1.

The DR lines link SWISS-PROT to other biomolecular databases. SWISS-PROT is currently linked to 29 different databases. In the example above you see links to 19 different entries in six different databases. The cross-references allow users to navigate to linked databases in order to retrieve part or all of the related information.The format of a DR line, except for cross-references to PROSITE (Hofmann et al., 1999), Pfam (Bateman et al., 1999), and the EMBL nucleotide sequence databases (Stoesser et al., 1999), is the following:


The database identifier is the name of the database that contains the linked entry. The primary identifier (in most cases the accession number) is the entry's primary key, while the secondary identifier complements the information given by the first identifier. The currently linked databases are listed below:



Database description


Nucleotide sequence database of EMBL (EBI)


Dictyostelium discoideum genome database


Escherichia coli gene-protein database (2D gel spots) (ECO2DBASE)


Escherichia coli K12 genome database (EcoGene)


Drosophila genome database (FlyBase)


G-protein--coupled receptor database (GCRDb)


HIV sequence database


Harefield hospital 2D gel protein databases (HSC-2DPAGE)


Homology-derived secondary structure of proteins database (HSSP)


Maize genome database (MaizeDB)


Maize genome 2D Electrophoresis database (Maize-2DPAGE)


Plant gene nomenclature database (Mendel)


Mouse genome database (MGD)


Mendelian Inheritance in Man Database (MIM)


Brookhaven Protein Data Bank (PDB)


Pfam protein domain database


Protein sequence database of the Protein Information Resource (PIR)


PROSITE protein domains and families database


Restriction enzyme database (REBASE)


Human keratinocyte 2D gel protein database from Aarhus and Ghent universities


Saccharomyces Genome Database (SGD)


Salmonella typhimurium LT2 genome database (StyGene)


Bacillus subtilis 168 genome database (SubtiList)


Human 2D Gel Protein Database from the University of Geneva (SWISS-2DPAGE)


The bacterial database(s) of 'The Institute of Genome Research' (TIGR)


Transcription factor database (TRANSFAC)


Caenorhabditis elegans genome sequencing project protein database (WormPep)


Yeast electrophoresis protein database (YEPD)


Zebrafish Information Network genome database (ZFIN)


The specific format for cross-references to the EMBL nucleotide sequence database is:



The secondary identifier is here the ‘PROTEIN_ID', which stands for the ‘Protein Sequence Identifier'. It is a string which is stored, in nucleotide sequence entries, in a qualifier called ‘/protein_id' which is tagged to every CDS in the nucleotide database.


FT CDS 302..2674
FT /protein_id="CAA03857.1"
FT /db_xref="SWISS-PROT:P26345"
FT /gene="recA"
FT /product="RecA protein"


The Protein_ID consists of a stable ID portion (8 characters: 3 letters followed by 5 numbers) plus a version number after a decimal point. The version number only changes when the protein sequence coded by the CDS changes, while the stable part remains unchanged.

The 'STATUS_IDENTIFIER'provides information about the relationship between the sequence in the SWISS-PROT entry and the CDS in the corresponding EMBL entry.

The specific format for cross-references to the PROSITE and Pfam protein domain and family databases is:


‘ACCESSION_NUMBER'stands for the accession number of the PROSITE or Pfam pattern, profile or HMM entry; ‘ENTRY_NAME'is the name of the entry and 'STATUS'is one of the following:


‘n' is the number of hits of the pattern or profile in that particular protein sequence. The ‘FALSE_NEG' status indicates that while the pattern or profile did not detect the protein sequence, it is a member of that particular family or domain. The ‘PARTIAL' status indicates that the pattern or profile did not detect the sequence because that sequence is not complete and lacks the region on which is the pattern/profile is based. Finally the ‘UNKNOWN' status indicates uncertainties as to the fact that the sequence is a member of the family or domain described by the pattern/profile. Pfam cross-references do not make use of the ‘FALSE_NEG' and ‘UNKNOWN' status.

After the DR lines you will find the KW (KeyWord) lines, which list relevant keywords that can be used to retrieve a specific subset of protein entries from the database:

KW   Cytokine; Transmembrane; Glycoprotein; Signal-anchor; 3D-structure;
KW   Disease mutation; Polymorphism.

We now arrive at the FT (FeaTure) lines, which describe regions or sites of interest in the sequence:


FT   DOMAIN        1     22       CYTOPLASMIC (POTENTIAL).
FT   DOMAIN       47    261       EXTRACELLULAR (POTENTIAL).
FT   DISULFID    178    218       POTENTIAL.
FT   CARBOHYD    240    240       POTENTIAL.
FT   VARIANT      36     36       M -> R (IN H1GM1).
.. 15 FT lines omitted

In general the feature table lists post-translational modifications, binding sites, active sites of an enzyme, the secondary structure, sequence conflicts and variations, signal sequences, transit peptides, propeptides, transmembrane regions, and other characteristics.

The feature table gives the user, as the CC and DE lines, an indication about the level of characterisation of a protein. In the example above only the variants are experimentally verified. Use of sequence similarity searches and prediction programs have derived the other features. If a feature is highly likely, then the label ‘probable' is used. The label ‘potential' is also used to indicate the assignment by comparative analysis. In our example it is known that this is a glycosylated, disulfid bonds containing type II membrane protein, but the correct topology of the protein, the glycosylation site(s) and the disulfid bonds have not been experimentally confirmed. The label ‘potential' is used to indicate the predicted character of the information given in the features ‘DOMAIN', ‘DISULFID', and ‘CARBOHYD'. Another label used to indicate that a feature has not been experimentally proven but only infered through sequence analysis is ‘by similarity':


This example comes again from Q10480, the ‘PROBABLE MITOCHONDRIAL NUCLEASE'of Schizosaccharomyces pombe, which we used already a few times as an example for a protein without biochemical characterisation. The label ‘by similarity' indicates that this feature has been assigned due to similarity to an existing characterised entry, in this case the mitochondrial nuclease from Saccharomycescerevisiae.

Now we are at the end of the in-depth view on a SWISS-PROT entry and arrive at SQ (SeQuence header) line and the sequence itself:
 SQ   SEQUENCE   261 AA;  29273 MW;  DC2AD21F CRC32;



Introduction.There is a tremendous increase of sequence data due to technological advances (such as sequencing machines), the use of new biochemical methods (such as PCR technology) as well as the implementation of projects to sequence complete genomes. These advances have brought along an enormous flood of sequence information. Maintaining the high quality of SWISS-PROT requires, for each entry, a time-consuming process that involves the extensive use of sequence analysis tools along with detailed curation steps by expert annotators. It is the rate-limiting step in the production of the database. A supplement to SWISS-PROT was created in 1996, since it is vital to make new sequences available as quickly as possible without relaxing the high editorial standards of SWISS-PROT. This supplement, TrEMBL (Translation of EMBL nucleotide sequence database), consists of computer-annotated entries derived from the translation of all coding sequences (CDS) in the EMBL nucleotide sequence database, except for those already included in SWISS-PROT. TrEMBL is split in two main sections, SP-TrEMBL and REM-TrEMBL. SP-TrEMBL (SWISS-PROT TrEMBL) contains the entries, which should be eventually incorporated into SWISS-PROT. REM-TrEMBL (REMaining TrEMBL) contains the entries that will not get included in SWISS-PROT. In the following you will find mainly a description of SP-TrEMBL. Therefore, unless otherwise specified, the word "TrEMBL" will stand for SP-TrEMBL in the rest of this chapter.

A typical TrEMBL entry is shown in Figure 2. As you can see, a TrEMBL entry looks very much like a SWISS-PROT entry, since TrEMBL follows the SWISS-PROT format and conventions as closely as possible. But there are a few necessary differences affecting the ID and DT lines.

It was already explained above that the very first line of a SWISS-PROT entry is the ID line – ‘ID' for identification - and is made of four different parts:


A TrEMBL 'ID' line is also made of four parts and looks like this one:


You can see that the SWISS-PROT and TrEMBL ID lines differ in the first two parts of the ID line. The first part is the entry name; ‘ANP_NOTCO‘in the case of the SWISS-PROT example and ‘Q12757‘in the TrEMBL example. The entry name used in all SP-TrEMBL entries is always the same as the accession number of the entry. The entry name used in REM-TrEMBL is the Protein_ID tagged to the corresponding CDS in the EMBL Nucleotide Sequence Database. To the right of the entry name you will find either 'PRELIMINARY'(in the TrEMBL entry) or 'STANDARD'(in the SWISS-PROT entry). The data class used in TrEMBL is always 'PRELIMINARY'. That means that what you are looking at is thoroughly checked by a computer but none of the biologists curating SWISS-PROT and TrEMBL has had time yet to read the necessary papers to finalise the annotation.

There is a last difference between the SWISS-PROT and TrEMBL entries, which affects the DT line (DaTe). The syntax and definition of the DT lines that serve to indicate when an entry was created and updated are identical to that defined in SWISS-PROT; but the DT lines in TrEMBL are referring to the TrEMBL release. The difference is shown in the example below.

DT lines in a SWISS-PROT entry:

DT   01-JAN-1988 (Rel. 06, Created)
DT   01-JUL-1989 (Rel. 11, Last sequence update)
DT   01-AUG-1992 (Rel. 23, Last annotation update)

 DT lines in a TrEMBL entry:

DT   01-NOV-1996 (TrEMBLrel. 01, Created)
DT   01-FEB-1997 (TrEMBLrel. 02, Last sequence update)
DT   01-JUN-1998 (TrEMBLrel. 06, Last annotation update)

The production of TrEMBL. To understand what information you can find in TrEMBL, you need to have some basic understanding of the TrEMBL production procedures. The production of TrEMBL is illustrated in Figure 3It starts with the translation of coding sequences (CDS) in the EMBL nucleotide sequence database. At this stage all annotation you can find in a TrEMBL entry comes from the corresponding EMBL entry. At the next stage, the Post-processing phase, the redundancy in TrEMBL gets reduced and additional annotation is automatically added to bring TrEMBL entries closer to SWISS-PROT standard.

All EMBL nucleotide sequence database divisions are regularly scanned for new or updated CDS features. These are translated to "TrEMBLnew" entries, which are in SWISS-PROT format. Each CDS leading to a correct translation results in one entry whose ID is the Protein_ID of the CDS. In the next step the original EMBL entries are scanned to extract relevant data, to filter it and eventually to insert it properly formatted into the TrEMBLnew entry. Only bibliographic references relevant to the given CDS are kept in the TrEMBLnew entry. This is achieved by scanning the RP (Reference Position) lines of the EMBL entry and matching with the CDS position in the sequence. The RC (Reference Comment) line is built by assigning the SWISS-PROT equivalent of the following EMBL qualifiers:

  • "/plasmid","PLASMID=",
  • "/strain","STRAIN=",
  • "/isolate","STRAIN=", (2nd choice)
  • "/cultivar","STRAIN=CV. "
  • "/tissue_type","TISSUE=",
  • "/transposon","TRANSPOSON=",

The description line (DE) comes from the /product qualifier when present, otherwise the EMBL DE line, the /gene and /note qualifiers are parsed. The EMBL DE line is only considered if the EMBL entry contains only one CDS and is stripped of non-pertinent information such as the organism name, or phrases like 'complete CDS'. The /gene qualifier is also used for the TrEMBLnew GN line. In most cases these procedures lead to some sort of informative DE line. However, in some cases the information content of the corresponding EMBL entry is quite low and the TrEMBLnew entry ends up with DE lines providing nonsense information like:


The EMBL keywords are included in the TrEMBLnew entry, but only when they match a subset of SWISS-PROT keywords which have the same meaning. Another condition is that the EMBL entry has just one CDS so that no ambiguity is possible. Some extra keywords derived from the features and description lines are added. A subset of SWISS-PROT features can be derived from the EMBL entry features. These are:

  • SIGNAL from sig_peptide
  • TRANSIT from transit_peptide
  • CHAIN from mat_peptide
  • VARIANT from allele, variation, misc_difference and mutation
  • CONFLICT from conflict


Two examples of TrEMBLnew entries, created in the way described before, are shown in Figure 4 . In addition to this information parsed into TrEMBLnew entries, data is put in the annotator's section of the entry, which is not visible to the public. This is used for further analysis both by programs and by biologists and consists of:

  • The EMBL entry description lines
  • EMBL CC lines
  • Bibliographic reference titles
  • Full CDS feature text
  • Full text of other relevant features within the CDS range
  • Number of CDS in the EMBL entry
  • The date of the last entry update
  • Information if the organism already exists in SWISS-PROT

At this stage different types of TrEMBLnew entries are put into different output files:

  • CDS with a /dbxref="SWISS-PROT" or a /dbxref="SPTREMBL" are not translated (already in SWISS-PROT + TrEMBL)
  • CDS from mhc genes -> mhc.dat
  • CDS from patent data -> patent.dat
  • CDS from immunoglobulins and t-cell receptors -> immuno.dat
  • CDS smaller than 8 amino acids -> smalls.dat
  • CDS from artificial, synthetic or chimeric genes -> synthetic.dat
  • CDS from pseudogenes -> pseudo.dat
  • remaining CDS -> stay in their relative taxonomic TrEMBLnew divisions

Now the entries from the composite divisions of the EMBL database (HTG, STS, EST, and UNC) are added to their relative taxonomic TrEMBLnew divisions. Then all files are searched for entries that have recently been added to SWISS-PROT or TrEMBL and are thus missing a /dbxref="SWISS-PROT" or a /dbxref="SPTREMBL" qualifier in EMBL. These entries are removed. The entries put in the files patent.dat, immuno.dat, smalls.dat, synthetic.dat and pseudo.dat are now already at the end of their production line. They are new entries in REM-TrEMBL (REMaining TrEMBL), which contains the entries (about 44'000 in release 10) that will not get included in SWISS-PROT. This section is organised in five subsections:

  1. Immunoglobulins and T-cell receptors (file name Immuno.dat): Most REM-TrEMBL entries are immunoglobulins and T-cell receptors. The integration of further immunoglobulins and T-cell receptors into SWISS-PROT has been stopped, since SWISS-PROT does not want to add all known somatic recombined variations of these proteins to the database. At the moment there are more than 18'000 immunoglobulins and T-cell receptors in REM-TrEMBL. SWISS-PROT plans to create a specialised database dealing with these sequences as a further supplement to SWISS-PROT but will keep only a representative cross-section of these proteins in SWISS-PROT.
  2. Synthetic sequences (file name Synth.dat): Another category of data which will not be included in SWISS-PROT are synthetic sequences.
  3. Small fragments (file name Smalls.dat): A subsection with protein fragments with less than eight amino acids.
  4. Patent application sequences (file name Patent.dat): Coding sequences captured from patent applications. A thorough survey of these entries have shown that apart for a small minority (which have already been integrated in SWISS-PROT), most of these sequences contain either erroneous data or concern artificially generated sequences outside the scope of SWISS-PROT.
  5. CDS not coding for real proteins (file name Pseudo.dat): The last subsection consists of CDS translations which are most probably not coding for real proteins.


The remaining 14 TrEMBLnew files (arc.dat, fun.dat, inv.dat, hum.dat, mam.dat, mhc.dat, org.dat, phg.dat, pln.dat, pro.dat, rod.dat, unc.dat, vrl.dat and vrt.dat) will undergo further post-processing. These steps are adding a lot of value to the TrEMBL data. Up to this stage the annotation of the TrEMBL entries is reflecting the status of the annotation of the CDS features in the EMBL nucleotide sequence database. Whenever a submitter to the DDBJ/EMBL/GenBank nucleotide sequence databases provided insufficient or wrong annotation, most of this erroneous information will be parsed into the TrEMBL entries, although there are already a lot of filters in place to get rid of the most frequently occurring junk annotation.

The first post-processing step is the reduction of redundancy (O'Donovan et al., 1999). One of SWISS-PROT's leading concepts from the very beginning was to minimise the redundancy of the database by merging separate entries corresponding to different literature reports. If conflicts exist between various sequencing reports, they are indicated in the feature table of the corresponding entry. This stringent requirement of minimal redundancy applies equally to SWISS-PROT + TrEMBL. However, it will still take some time before TrEMBL has the same low level of redundancy as SWISS-PROT. TrEMBL is partially redundant against SWISS-PROT and against itself since a significant percentage of the entries are actually additional reports of proteins already present in SWISS-PROT + TrEMBL. There two are different kinds of redundancy, which are commonplace in many sequence databases:

  1. Different literature and sequence reports of a given protein sequence.
  2. Mutations, polymorphism, variations in the sequence that are often given separate entries in the nucleotide sequence databases.

These redundancies should not be present in SWISS-PROT or TrEMBL; and thus it was necessary to find methods to manipulate the data from redundant source databases to meet the stringent standards of minimal redundancy. The objective was to recognise and eliminate the redundancy already present in the databases, and to prevent further redundancy entering the database.

A very fast and efficient method, which allows the identification of thousands of TrEMBL entries exactly matching SWISS-PROT or TrEMBL entries, is the use of the CRC32 checksum. The Cyclic Redundancy Check (CRC) calculates a nearly unique and very compact checksum for each sequence and this allows fast and accurate detection of identical sequences. At every TrEMBL release, a CRC32 check is carried out to identify identical sequences in TrEMBL and SWISS-PROT. A curator then merges these entries manually. There is also a CRC32 check of TrEMBLnew (the weekly TrEMBL updates) against TrEMBL and SWISS-PROT. The TrEMBLnew entries, which match SWISS-PROT entries, are collated for annotation by curators. TrEMBLnew entries that match TrEMBL entries are merged into one entry automatically, with the following exceptions:

  • Viral protein fragments
  • Cross-species protein fragments
  • MHC fragments
  • Plasmodium merozoites surface antigen fragments
  • Outer membrane protein fragments
  • Fusion protein fragments
  • Homeobox or Homeodomain protein fragments

The redundancy removal based on the CRC32 matching eliminated the most obvious redundancy from TrEMBL. However there are still tens of thousands of cases of potential (not easily detectable) redundancy, which needs to get eliminated:

  • Exact matches of fragments (a TrEMBL entry is a fragment of a SWISS-PROT entry or vice-versa; or a TrEMBL entry is a fragment of another TrEMBL entry).
  • SWISS-PROT and TrEMBL protein entries from the same organism which should be identical but differ due to sequencing errors, variants, frameshifts etc.

The next step in reducing redundancy was to merge exact subfragments to longer length sequences by using LASSAP (Large Scale Sequence compArison Package), a software package developed by Glemet and Codani (1997) at INRIA in France. LASSAP has been modified specifically to identify redundancy in SWISS-PROT and TrEMBL. The subfragment discovery and removal is an integral part of the TrEMBL production process at each release in order to check for such subfragment redundancy within TrEMBLnew itself and then between TrEMBLnew, TrEMBL and SWISS-PROT.

To give you some indication of the scope of the task, let us have a look at TrEMBL release 10, which consisted of 244,862 entries. TrEMBL 10 was supplementing SWISS-PROT release 38 (around 80,000 entries) and was produced from EMBL release 58, which contained 384,000 CDS. 120,000 of these 384,000 CDS were already present as sequence reports in SWISS-PROT and were excluded from the TrEMBL production process. The remaining 264,000 CDS were merged whenever possible as described above and the final result was 244,862 entries. This removal of ten thousands of entries clearly shows the value of the redundancy procedures that have been developed and implemented already. Figure 5shows an example of an automatically merged TrEMBL entry, created by merging of the two TrEMBL entries shown in Figure 4.

TrEMBL is still partially redundant against SWISS-PROT since approximately 40,000 of these entries are actually additional sequence reports of proteins already in SWISS-PROT. This remaining redundancy is more difficult to eliminate since the protein entries, which should be merged, differ due to sequencing errors, variants, frameshifts etc. Although the merging operations are automated, all merged entries are finally checked by biologists to avoid the merging of sequences from two different but highly similar genes into one entry. These time-consuming checks are the reason why it will still take some time before SWISS-PROT + TrEMBL will have the same low level of redundancy as SWISS-PROT. The biologists working on the curation of SWISS-PROT and TrEMBL are sifting through the entries where two or more teams report what should be an identical sequence and their sequences differ by one residue or more. In all these cases the curators need to decide whether these conflicting reports are really reports of the same gene. If they are sure that these reports should be merged, they need to find out the nature of the conflict: Are the differences due to strain differences, or alleles and polymorphisms, or due to disease-causing mutations or the product of alternative splicing? Or has a site been experimentally altered? Or are some of the differences only plain sequencing errors? The answers to these questions influence the way to annotate the differences.

The second post-processing step is the automated enhancement of the TrEMBL annotation to bring TrEMBL entries closer to SWISS-PROT standard. There is an increasing need for reliable automatic functional annotation to cope with the rapidly increasing amount of sequence data. Most of the current approaches are still based on sequence similarity searches against known proteins. Some groups try to collect the results of different prediction tools in a simple way, e.g. PEDANT (Frishman and Mewes, 1997) or GeneQuiz (Scharf et al., 1994). However, several pitfalls of these methods have been reported (Bork and Koonin, 1998).

A single sentence describing some properties of the unknown protein is not regarded as optimal automatic annotation of TrEMBL. Required is, as in SWISS-PROT, as much information as possible about properties like function(s) of the protein, domains and sites, catalytic activity, cofactors, regulation, induction, pathways, tissue specificity, developmental stages, subcellular location etc.

To enhance the annotation of TrEMBL, a novel method for the prediction of this information has been developed (Fleischmann et al., 1999). The principle is very simple: The method tries to find SWISS-PROT entries belonging to the same protein family as the unannotated TrEMBL entry, extracts the annotation shared by all SWISS-PROT entries, assigns this common annotation to the unannotated TrEMBL entry, and flags this annotation as annotated by similarity. The whole procedure starts with the scanning of all TrEMBL entries for PROSITE patterns. If a matching pattern is found, a three-step procedure is used to reduce the number of false positive hits. Firstly, the taxonomic classification of the TrEMBL entry must be within the known taxonomic range of the PROSITE pattern. For instance, a match of an a-priori prokaryotic pattern against a human protein is regarded as false positive and filtered out.

Secondly, the significance of the PROSITE pattern match is checked. This is done by a second check of the TrEMBL sequence with a set of secondary patterns derived from the PROSITE pattern. These secondary patterns are computed with the eMotif algorithm (Nevill-Manning et al., 1997). The PROSITE database contains a list of all SWISS-PROT proteins that are true members of the relevant protein family. For each pattern, the true positive sequences are aligned and fed into eMotif, which computes a nearly optimal set of regular expressions, based on statistical rather than biological evidence. A stringency of 10^-9 is used, so that each eMotif pattern is expected to produce on random a false positive hit in 10^9 matches.

Thirdly, in cases where a protein family is characterised by more than one PROSITE signature, all signatures must be found in the entry. For instance, bacterial rhodopsins have a signature for a conserved region in helix C and another signature for the retinal binding lysine. If a TrEMBL entry matches only the helix-C-pattern, but not the retinal-binding pattern, it will not be regarded as a bacterial rhodopsin.

The raw PROSITE hits and all results of the confirmation steps are stored in a hidden section of the TrEMBL entry, but only those hits that satisfy all confirmation conditions are made publicly visible in a ‘DR PROSITE'line.

PROSITE signatures can characterise approximately 35% of all TrEMBL entries, but only around 30% of all TrEMBL entries are true positive matches. The characterization based only on PROSITE would lead to 10-20% of false positive assignments. The confirmation steps reduce the level of characterization by nearly a third to 25%. At this stage, we achieve a level of less than 0.07% of false positive assignments.

Whenever a TrEMBL entry is recognised by these procedures as a true member of a certain protein family, annotation about the potential function, active sites, cofactors, binding sites, domains, subcellular locations is added to the entry. The main source of the annotation is compiled by extracting the annotation that is common to all SWISS-PROT entries of the relevant protein family. Other sources include manual descriptions of protein families and translations of trustworthy description libraries into SWISS-PROT wording. For example, there is a '/SITE=9,heme_iron' description for the cytochrome_b_heme pattern in PROSITE. This is translated to the correct SWISS-PROT syntax:


In other words, for every protein family, a "virtual SWISS-PROT entry" is created computationally, which is based on the specific annotation valid for all SWISS-PROT members of this family. If a new TrEMBL protein belongs to a certain family, the annotation of the virtual entry for this family is immediately transferred to this TrEMBL entry.

The "virtual SWISS-PROT entries" have a far-reaching effect on TrEMBL. For example, the virtual entry for the Rubisco large chain affects 3300 TrEMBL entries. Therefore a system has been developed to decompose these virtual entries into rules, which are stored in a relational database with proper version control features.

This rule-based system allows expressing the membership criteria for each protein family in a formal language. Furthermore, subfamilies have been introduced to meet the SWISS-PROT standard more closely. For example, the ribosomal protein L1 family contains eukaryotes as well as prokaryotes. But the annotation added to TrEMBL entries of this family obviously depends on the taxonomic kingdom. The description reads '50S RIBOSOMAL PROTEIN L1'for prokaryotes, archaebacteria, chloroplasts, and cyanelles, and '60S RIBOSOMAL PROTEIN L10A'for nuclear encoded proteins of eukaryotes.

The ENZYME database (Bairoch, 1996) is also used to generate standardised description lines for enzyme entries and to allow information such as catalytic activity, cofactors and relevant keywords to be taken from ENZYME and to be added automatically to TrEMBL entries. Additionally specialised databases like FlyBase (FlyBase Consortium, 1999) and MGD (Blake et al., 1999) are used to transfer information like the correct gene nomenclature and cross-references to these databases into TrEMBL entries. The automatic analysis and annotation of TrEMBL entries is redone and updated every TrEMBL release.

The now fully post-processed TrEMBL entry, already used as an example before, is shown in Figure 6. Although this computer-generated annotation is already enhancing the information about the sequence drastically, it is still a long way to the quality of the corresponding SWISS-PROT entry (shown in Figure 7), fully annotated by biologists.


InterPro and EDITtoTrEMBL. Currently around 20% of the TrEMBL entries get additional annotation in the way described above. There are two main reasons for this low coverage:

  1. To avoid overprediction very stringent criteria have been used.
  2. So far rules have been created for only a quarter of all PROSITE families.

It is easily possible to yield a higher coverage, if more patterns and improved conditions are used. The procedures have been found to be stable and reliable, therefore it is planed to add more rules to the RuleBase. The patterns and conditions will be based on the characterisation of SWISS-PROT and TrEMBL entries by InterPro, the Integrated Resource of Protein Domains and Functional Sites, a joint initiative of the databases PROSITE, (Hofmann et al., 1999), Pfam (Bateman et al., 1999), PRINTS (Attwood et al., 1999), ProDom (Corpet et al., 1999), and SWISS-PROT + TrEMBL (Bairoch and Apweiler, 1999). InterPro will serve as a common co-ordinate system, harmonising domain definitions, nomenclature, annotation, match-lists and hyperlinks, while the participating databases will maintain their individual approaches with all the known benefits. Up to now, it was difficult to compare hits to the different databases, as they are based on different protein database versions. This synchronisation problem has been solved. The above mentioned motif databases will continue with their release schedules, while the time between releases is covered by the EBI on a weekly basis. InterPro entries contain links to the motif databases, a general description, method specific descriptions, references, and a list of matched proteins. Every entry is classified as describing a protein family, a domain, or a post-translational modification site. A more detailed description of InterPro is given in the chapter about secondary protein sequence databases.

The addition of InterPro based rules to the RuleBase is of huge importance, since the RuleBase is a central component of EDITtoTrEMBL (Environment for Distributed Information Transfer to TrEMBL), which was used for the first time in August 1998 for the production of TrEMBL release 7 (Möller et al., 1999). EDITtoTrEMBL aims to provide a stable framework where different analyzing programs can be integrated in a plug-and-play manner. Not only the amount of data is rapidly increasing, but also the number of analyzing programs enabling the prediction of functional properties of proteins is constantly rising. EDITtoTrEMBL executes analysing programs, which are controlled by conditions that must be fulfilled to make their application meaningful. These conditions are stored in the RuleBase. EDITtoTrEMBL is implemented in Java and facilitates communication between programs using Remote Method Invocation. Figure 8depicts the flow of data inside the framework.

Databases and applications are used as potential sources of protein annotation. Although there is a certain difference between these two methods, since databases are queried while applications are started, the system does not distinguish between them. In both cases it is necessary to provide so-called wrappers written in JAVA to support the physical distribution of annotation processes. These wrappers solve three tasks:

  • Reformatting of a TrEMBL entry to a valid input for a program or a query. For programs, this is usually easy since most programs either accept TrEMBL entries directly or use FASTA format. For queries, the wrapper extracts certain parts of the TrEMBL entry, which is then send to the database.
  • Each wrapper chooses an optimal setting of parameters for each individual entry.
  • To ensure consistency with the controlled vocabulary of SWISS-PROT, the program output is transformed according to the manually curated set of rules in the RuleBase.

The unit of a wrapper with its associated program or database query is called an analyser. Analysers are often highly specific. The correctness of their results depends partially on certain conditions, such as the taxonomic specification. Annotation added by an analyser is often in turn exploited by other analysers executed later. EDITtoTrEMBL uses the conditions, which are stored in the RuleBase, for the execution of analysers. Dispatchers, programs that coordinate the flow of entries between different analysers, evaluate these conditions.


SWISS-PROT + TrEMBL, a complete and non-redundant view on the protein world?

This section will focus on the use of SWISS-PROT + TrEMBL for sequence similarity searches. Searches in protein sequence databases have now become a standard research tool in the life sciences. To produce valuable results, the source databases should be comprehensive, non-redundant, well annotated and up-to-date. However, the lack of a single protein sequence database satisfying all four criteria has previously forced users to perform searches across multiple databases to avoid incomplete results. This strategy normally produces complete, but redundant results due to different versions of the same sequence report in different databases.

To improve this unsatisfying situation, many bioinformatics sites construct non-redundant databases from a number of component databases, or they use external non-redundant databases, e.g. OWL (Bleasby et al., 1994). Both strategies improve the situation for the end user considerably, but they require the time- and resource-consuming maintenance of multiple databases or the acceptance of a certain time lag between creation of an entry and its appearance in the non-redundant database. Furthermore, both strategies lead to a loss of information in the individual entry due to the diversity of database formats. While OWL preserves most information of an entry and some of its structure, the NRDB program requires a conversion of the component databases to FASTA format, which contains only one description line per entry.

SP_TR_NRDB (or abbreviated SPTR) was created to overcome these limitations. SPTR provides a comprehensive, non-redundant and up-to date protein sequence database with a high information content. The components are:

  • The weekly updated SWISS-PROT work release. It contains the last SWISS-PROT release as well as the new or updated entries.
  • The weekly updated SP-TrEMBL work release. REM-TrEMBL is not included in SP_TR_NRDB, since REM-TrEMBL contains the entries that will not be included into SWISS-PROT, e.g. synthetic sequences and pseudogenes.
  • TrEMBLnew, the weekly updates to TrEMBL.

During the weekly SP_TR_NRDB building process, all three components undergo a syntax error check and a redundancy check. Entries, which are filtered out during the error check or the redundancy check, are manually updated and reintegrated in the next weekly SPTR release. In the interest of regular updates the SPTR production is not delayed until the erroneous entries have been corrected. This introduces a minimal incompleteness in SPTR, but the current average of five extracted entries or 0.002% of all entries per weekly release is regarded as tolerable.

The redundancy check used during the weekly SPTR production ensures non-redundancy on the level of accession numbers, IDs, and Protein_IDs. Entries with sequence similarity are at this stage not merged into single entries because this would also merge entries, which should be kept separate, e.g. fragments of different viral strains. When building the quarterly major releases of the component databases, LASSAP enables the identification of entries that are candidates for merging. The TrEMBL redundancy removal procedures have already been described in detail above.

Various verification steps have been introduced to ensure that SPTR is comprehensive and contains all relevant data sources. The main source of new protein sequences is the translations of CDS in the nucleotide sequence databases. The up-to-date inclusion of new protein sequence entries is ensured by the weekly translation of EMBL-NEW (the updates to the EMBL nucleotide sequence database). The three collaborating nucleotide sequence databases DDBJ, EMBL and GenBank exchange their data on a daily basis. Therefore any protein coding sequence submitted to DDBJ/EMBL/GenBank will appear in SPTR within two weeks in the worst case and within less than one week in the average case.

Another major source are the amino acid sequences directly derived from protein sequencing. Thousands of such sequences have been detected by the SWISS-PROT curators in publications (or have been directly submitted by researchers to SWISS-PROT) and entered into the database. Protein sequences detected by the NCBI journal scan have also been included. For some proteins the Brookhaven Protein Data Bank (PDB) (Abola et al., 1996) is the only source for the sequence information. The PDB entries are regularly checked and new SWISS-PROT entries get created whenever necessary.

The only additional publicly available protein sequence data, which might not be included in SPTR, are sequences that have been overlooked by SWISS-PROT, but have been detected by PIR. Detailed checks of this data source have been made, to be sure that SPTR contains all publicly available naturally occurring proteins. As a first step it was checked which PIR entries have been cross-referenced by SPTR entries. These entries have been marked as matched because the cross-references to PIR are manually added to SPTR entries and refer to directly corresponding entries. Then the entries containing a PID have been marked as matched because all these entries are contained in SPTR as manually curated SWISS-PROT entries or as EMBL translations in TrEMBL/TrEMBLnew. Finally, full-length sequence matches and matches of PIR fragments against longer SPTR and REM-TrEMBL entries have been marked. The remaining PIR entries (around 10% of PIR) have been manually checked. In the majority of cases, these entries were different (redundant) reports for the same sequence and already included in SPTR in entries with merged sequence reports. In the cases where the entries were really missing in SPTR (around 3% of PIR entries), the SWISS-PROT curators went back to the original publication and created from the original publication new SWISS-PROT entries to complete SPTR. These checking procedures are being done continuously, so that SPTR offers a comprehensive view of the protein sequence world. The only protein sequences not contained in SPTR are the sequences from the REM-TrEMBL entries, since REM-TrEMBL contains the entries that will not be included into SWISS-PROT, e.g. synthetic sequences and pseudogenes, but these remain available in the REM-TrEMBL distribution.

SPTR has been produced weekly since its start in January 1998. At the 10.9.1999 SPTR contained 352,393 entries: 80,681 SWISS-PROT entries, 198,791 TrEMBL entries and 72,921 TrEMBLnew entries. As the rate of incoming data and the addition of value through manual curation and automatic annotation increase, it is planned to start producing SPTR daily in the near future.

SPTR is distributed in three files: sprot.dat.Z, trembl.dat.Z and trembl_new.dat.Z. These files are, as indicated by their "Z" extension, Unix "compress" format files which, when decompressed, will produce ASCII files in SWISS-PROT format. Three others files are also available (sprot.fas.Z, trembl.fas.Z and trembl_new.fas.Z), which are compressed "fasta" format sequence files useful for building the databases used by FASTA, BLAST and other sequence similarity search programs. Please do not use these files for other purposes as you loose all annotation by using this format. The SPTR files are stored in the directory "/pub/databases/sp_tr_nrdb" on the EBI FTP server ( and in the directory "/databases/sp_tr_nrdb" on the ExPASy FTP server (

Please note that

  • the SWISS-PROT file continuously grows as new annotated sequences are added.
  • the TrEMBL file decreases in size as sequences are moved out of that section after being annotated and moved into SWISS-PROT. Four times a year a new release of TrEMBL is built at EBI and at this point the TrEMBL file increases in size as it then includes all of the new data that has accumulated since the last release.
  • the TrEMBLnew file starts as a small file and grows in size until a new release of TrEMBL is available.

You will not find any primary accession number duplicated between SWISS-PROT and TrEMBL, since they are sharing the same system of accession numbers. A TrEMBL entry (and its associated accession number(s)) can either move to SWISS-PROT as a new entry or be merged with an existing SWISS-PROT or TrEMBL entry. In the later case, the accession number(s) of that TrEMBL entry are added to that of the SWISS-PROT entry.


Specialised protein sequence databases

And now a few words about specialised protein sequence databases. There are many of them, some of them are quite small and only contain a handful of entries, and others are wider in scope and larger in size. This chapter will finish with a brief description of three representative examples of specialised protein sequence databases. As this category of databases is quite changeable, any list provided here would soon be outdated. However, under the URL will find a WWW document that lists information sources for molecular biologists, which is kept constantly up-to-date.


MEROPS. TheMEROPS database (Rawlings and Barrett, 1999) provides a catalogue and structure-based classification of peptidases (i.e. all proteolytic enzymes). An index of the peptidases by name or synonym gives access to a set of files termed PepCards, each of which provides information on a single peptidase. Each card file contains information on classification and nomenclature, and hypertext links to the relevant entries in other databases. The peptidases are classified into families on the basis of statistically significant similarities between the protein sequences in the part termed the `peptidase unit' that is most directly responsible for activity. Families that are thought to have common evolutionary origins and are known or expected to have similar tertiary folds are grouped into clans. The MEROPS database provides sets of files called FamCards and ClanCards describing the individual families and clans. Each FamCard document provides links to other databases for sequence motifs and secondary and tertiary structures, and shows the distribution of the family across the major taxonomic kingdoms.


GCRDb. GCRDb (Kolakowski, 1994) is a database of sequences and other data relevant to the biology of G-protein coupled receptors (GCRs), a very large protein family of critical components of many different signalling systems in animals. As can be seen in Figure 9, the information available in a GCRDb entry is not much more extensive than what you would find in the EMBL nucleotide sequence entry from which it is derived. What makes this database useful are not the entries themselves, but the analyses s (e.g. multiple alignments, classification into subfamilies) which have been made on the data and which are available from the GCRDb database. It is a good example for a specialised database adding value by offering an analytical view on data which a universal sequence database is unable to provide.


YPD. YPD (Hodges et al.,1997) is a database for the proteins of S. Cerevisiae. Based on the detailed curation of the scientific literature for the yeast Saccharomyces cerevisiae, YPD contains more than 50 000 annotations lines derived from the review of 8500 research publications. The information concerning each of the more than 6000 yeast proteins is structured around a one-page format, the Yeast Protein Report, with additional information provided as pop-up windows. Protein classification schemas are defining each protein's cellular role, function and pathway. YPD provides the user with a succinct summary of the protein's function and its place in the biology of the cell. The first transcript profiling data has been integrated into the YPD Protein Reports, providing the framework for the presentation of genome-wide functional data. Altogether YPD is a very useful data collection for all yeast researchers and especially for those working on the yeast proteome.


Specialised protein databases

The ENZYME database ( is an annotated extension of the Enzyme Commission's publication, linked to SWISS-PROT. There are also databases of enzyme properties – BRENDA, Ligand Chemical Database for Enzyme Reactions (LIGAND, and the Database of Enzymes and Metabolic Pathways (EMP).
BRENDA, LIGAND and EMP are searchable via SRS at the EBI ( LIGAND is linked to the metabolic pathways in KEGG (

Databases of two –dimensional gel electrophoresis data are available from Expasy ( and the Danish Centre for Human Genome Research ( A useful resource for mass spectrometry protein data, including protein cleavage products, is maintained at Rockefeller University (


Secondary protein databases

Very often the sequence of an unknown protein is too distantly related to any protein of known structure to detect its resemblance by overall sequence alignment, but it can be identified by the occurrence in its sequence of a particular cluster of residue types which is variously known as a pattern, motif, signature, or fingerprint. These motifs arise because of particular requirements on the structure of specific region(s) of a protein, which may be important, for example, for their binding properties or for their enzymatic activity. These requirements impose very tight constraints on the evolution of those limited (in size) but important portion(s) of a protein sequence. A signature modelling such a site must be as short as possible, should detect all or most of the sequences it is designed to describe and should not give too many false positive results. In other words it must exhibit both high sensitivity and high specificity.

There are a few databases available, which use different methodology and a varying degree of biological information on the characterised protein families, domains and sites. PROSITE ( includes extensive documentation on many protein families, as defined by sequence domains or motifs.
Other databases in which proteins are grouped, using various algorithms, by sequence similarity include PRINTS (, Pfam (, BLOCKS (, and SBASE (

These secondary protein sequence databases have become vital tools for identifying distant relationships in novel sequences and hence for inferring protein function. During the last decade, these databases have evolved by using signature-recognition methods to address different sequence analysis problems, resulting in rather different and independent databases. To perform a comprehensive analysis, a user therefore has to know several important things. For example, what are the resources and where can they be found? What is the difference between them in terms of diagnostic performance and family coverage? What do the different search outputs mean? Is it sufficient to use just one of the databases, and if so, which one? Or, given the seeming complexity, won't PSI-BLAST (Altschul et al., 1997) do just as well?

Diagnostically, the most commonly used secondary protein databases (PROSITE, PRINTS and PFAM) have different areas of optimum application owing to the different strengths and weaknesses of their underlying analysis methods (regular expressions, profiles, Hidden Markov Models and fingerprints). For example, regular expressions are likely to be unreliable in the identification of members of highly divergent super-families (where profiles and HMMs excel); fingerprints perform relatively poorly in the diagnosis of very short motifs (where regular expressions do well); and profiles and HMMs are less likely to give specific sub-family diagnoses (where fingerprints excel).

In terms of family coverage, PROSITE, PRINTS and PFAM are similar in size but differ in content – each contains between 1,000-1,500 entries, spanning a range of globular and membrane proteins, modules and mosaics, repeats, and so on. While all of the resources share a common interest in protein sequence classification, some focus on divergent domains (e.g., Pfam), some focus on functional sites (e.g., PROSITE), and others focus on families, specialising in hierarchical definitions from super-family down to sub-family levels in order to pin-point specific functions (e.g., PRINTS).

A number of sequence cluster databases are also commonly used in sequence analysis, for example to facilitate domain identification (e.g., ProDom). Unlike pattern databases, the clustered resources are derived automatically from sequence databases, using different clustering algorithms. This allows them to be relatively comprehensive, because they do not depend on manual crafting and validation of family discriminators; but the biological relevance of clusters can be ambiguous and may just be artefacts of particular thresholds.

Given these complexities, analysis strategies should endeavour to combine a range of databases, as none alone is sufficient. In concert, however, they can complement routine sequence database searches by providing more specific diagnoses than are possible with tools such as PSI-BLAST. PSI-BLAST highlights generic similarities by gathering sequences into families using an iterative profiling technique. However, there are problems with this approach. For example, if a multi-domain protein is matched, it may not be clear whether the region matched is the functional part of the protein, and hence whether functional annotations can be reliably transferred to the query; similarly, if a large super-family has been matched, it may be difficult to make the correct family or sub-family diagnosis.

Unfortunately, these secondary databases do not share the same formats and nomenclature, which makes the use of all of them in an automated way difficult. In response to this the SWISS-PROT and TrEMBL group at the EBI is working with the PROSITE, PRINTS, Pfam and ProDom groups on the integration of these databases into an Integrated resource of Protein domains and functional sites (InterPro). InterPro will allow users access to a wider, complementary range of site and domain recognition methods in a single package.


PROSITE. The special value of this database ( is the extensive documentation on many protein families, as defined by sequence domains or motifs. PROSITE contains biologically significant sites and patterns formulated in such a way that with appropriate computational tools it can rapidly and reliably identify to which family of proteins the new sequence belongs.

Release 15 of PROSITE contained motifs for 1034 protein families and sites. Most of the motifs in PROSITE are regular expressions, so called patterns. Around hundred of the motifs are so called extended profiles. Of the available analysis methods, regular expressions are the simplest to derive. Conserved motifs within sequence alignments are reduced into consensus expressions, in which all but the most significant residue information is discarded. In terms of their performance in pattern recognition, regular expressions have certain limitations. Patterns may themselve encode flexibility, or fuzziness, but require query sequences to match them exactly. Thus sequences that differ only slightly from the definition will be missed. Also, there are a number of protein families as well as functional or structural domains that cannot be detected using regular expressions due to their extreme sequence divergence.

The building of a PROSITE pattern usually starts by studying review(s) on a group or family of proteins. Then an alignment of the proteins discussed in that review and of additional sequences relevant to the subject under consideration is build. Using such alignments particular attention is paid to the residues and regions thought or proved to be important to the biological function of that group of proteins. Now a `core' pattern, a short conserved sequence, is created that is part of a region known to be important or which include biologically significant residue(s). The most recent version of SWISS-PROT is then scanned with these core pattern(s). If a core pattern will detect all the proteins under consideration and none (or very few) of the other proteins the core pattern is used as the signature. In most cases a core pattern picks up additional sequences which clearly do not belong to the group of proteins under consideration. Iterative series of scans, involving a gradual increase in the size of the pattern, are then necessary.

There are a number of protein families as well as functional or structural domains that cannot be detected using patterns due to their extreme sequence divergence; the use of techniques based on profiles or weight matrices allows the detection of such proteins or domains with sequence divergence. A profile is a table of position-specific amino acid weights and gap costs. These numbers are used to calculate a similarity score for any alignment between a profile and a sequence, or parts of a profile and a sequence. An alignment with a similarity score higher than or equal to a given cut-off value constitutes a motif occurrence. As with regular expressions, there may be several matches to a profile in one sequence, but multiple occurrences in the same sequence must be disjoint (non-overlapping) according to a specific definition included in the profile. Unlike patterns, profiles are usually not confined to small regions with high sequence similarity, but attempt to characterise a protein family or domain over its entire length. Profiles are supposed to be more sensitive and more robust than patterns because they provide discriminatory weights not only for the residues already found at a given position of a motif but also for those not yet found. The weights for those not yet found are extrapolated from the observed amino acid compositions using empirical knowledge about amino acid substitutability

Since 1994 PROSITE complements regular expression entries by gradually adding profile entries. The profile structure used in PROSITE is similar to but slightly more general than the one introduced by Gribskov and co-workers (Gribskov et al.,1987). Generalised profiles are remarkably similar to the specific type of Hidden Markov Models (HMMs) used in PFAM (explained below).


PRINTS. A different approach to pattern recognition, termed "fingerprinting" is used by PRINTS. Within a sequence alignment, it is usual to find not one, but several motifs that characterise the aligned family. Diagnostically, it makes sense to use many, or all, of the conserved regions to build a family signature. In a database search, there is then a greater chance of identifying a distant relative, whether or not all parts of the signature are matched. Thus, for example, a sequence that matches only four of seven motifs may still be diagnosed as a true match if the motifs are matched in the correct order in the sequence and the distances between them are consistent with that expected of true neighbouring motifs. The ability to tolerate mismatches, both at the level of residues within individual motifs, and at the level of motifs within the fingerprint as a whole, renders fingerprinting a powerful diagnostic technique.

PRINTS ( provides for each fingerprint extensive documentation about the characterised protein family, domain, or functional site. Release 23.1 contained 1159 fingerprints.

Pfam. Another important secondary protein database is Pfam. Release 4.1. of Pfam contained 1488 entries ( The methodology used by Pfam to create protein family or domain signatures is Hidden Markov Models (HMMs). HMMs are closely related to profiles, but are based on probability theory methods. These allow a direct statistical approach to identifying and scoring matches, and also to combining information from a multiple alignment with prior knowledge. One feature that distinguishes HMMs and profiles from regular expressions and fingerprints is that the formers allow the full extent of a domain to be identified in a sequence. They are thus particularly useful when analysing multidomain proteins.

PFAM consists of two parts. PFAM-A is curated and contains well-characterised protein families with high-quality alignments, which are maintained by using manually checked seed alignments and HMMs to find and align all members. PFAM-B is based on ProDom and to clusters and aligns the remaining protein sequences after removal of PFAM-A domains. PFAM-A families have stable accession numbers and form a library of HMMs available for scanning of protein sequences. The biggest drawback of Pfam is its lack of biological information (annotation) of the protein families.


InterPro. In the task of sequence characterisation, we need more reliable, concerted methods for identifying protein family traits and for inheriting functional annotation. This is especially important given our dependence on automatic methods for assigning functions to the raw sequence data issuing from genome projects. But rationalising this process by creating a single coherent resource for diagnosis and documentation of protein families is difficult, given entirely different database formats, different search tools and different search outputs. InterPro is an attempt to address some of these issues. This new resource provides an integrated view of a number of commonly used pattern databases, and provides an intuitive interface for text- and sequence-based searches.

The first release of InterPro was built from Pfam 4.1 (1,488 domains), PRINTS 23.1 (1,159 fingerprints) and PROSITE 15 (1,034 families).

Flat-files submitted by each of the groups were systematically merged and dismantled. Where relevant, family annotations were amalgamated, and all method-specific annotation separated out. This process was complicated by the relationships that can exist, both between entries in the same database, and between entries in different databases. Different types of parent-child relationship were evident, leading to the differentiation into ‘sub-types' and ‘sub-strings'. A sub-string means that a motif or motifs are contained within a region of sequence encoded by a wider pattern (e.g., a PROSITE pattern is typically contained within a PRINTS fingerprint; or a fingerprint might be contained within a Pfam domain). A sub-type means that one or more motifs are specific for a sub-set of sequences captured by another more general pattern (e.g., a super-family fingerprint may contain several family- and sub-family-specific fingerprints; or a generic Pfam domain may include several family fingerprints).

Having classified the parent-child relationships of overlapping PROSITE, PRINTS and Pfam entries, all recognisably distinct entities were assigned unique accession numbers (which take the form IPR00000). In doing this, the general principle was adopted that parents and children with sub-string relationships usually have the same IPR numbers, while sub-type parent-child relationships warrant their own IPRs.

To facilitate in-house maintenance, InterPro is managed within a relational database system. For users, however, the core InterPro entries are released in a single ASCII (text) flatfile, which is written in XML. The overall data flow, from individual data provider, through the DBMS, out to the flatfile and on to the user, is fairly complex – a flavour of this complexity is given in Figure 10.

Release 1.0 (November 1999) contains nearly 2,300 entries, representing families, domains, repeats and PTMs encoded by 4,300 different regular expressions, profiles, fingerprints and HMMs. Overall, InterPro entries have more than 370,000 hits against sequences in SWISS-PROT and TrEMBL.

InterPro is accessible for interactive use via the EBI Web server, which can also be reached via each of the member databases. The Web interface allows text-based searches using SRS (Etzold et al., 1996) and sequence-based searches using software provided by the consortium members. Interpretation of output is facilitated by means of a graphical user interface, which has been extended from the tools used to visualise ProDom families. Thus, for each sequence, the domain and/or motif organisation can be seen at a glance.

The flatfile distribution may be retrieved from the EBI anonymous-ftp server (

While the initial InterPro release was created around PRINTS, PROSITE and Pfam. ProDom will shortly also be included. Various factors rendered a step-wise approach to the development of InterPro desirable. First, the scale of the task of amalgamating just the first three databases was immense. The rational merging of apparently equivalent database entries that in fact simultaneously define a specific family, domains within that family, or even repeats within those domains, presented an enormous challenge. Thus, the immediate goal for InterPro was to limit the problem only to databases that offered annotation. A second important consideration was that while Pfam, PRINTS and PROSITE are true pattern databases, ProDom is based solely on automatic clustering of sequences by similarity (i.e., discriminators are not derived). Resulting clusters need not have precise biological correlations and some family designations have changed between database versions. It was therefore necessary that ProDom should adopt stable accession numbers before its entries could be meaningfully considered for inclusion in InterPro. The full integration of ProDom into InterPro will be achieved in release 2 (May 2000).

Once the founder members of the InterPro consortium have been assimilated into the unified resource, other pattern databases will also be included. First, scheduled for release 3 (November 2000), will be the SMART resource (Schultz et al., 1998). In addition, the Blocks database (Henikoff et al., 1999) is planning to use InterPro as the basis for the creation of Blocks. As Blocks does not include annotation and will be based on families already in InterPro, the process of cross-referencing between Blocks and InterPro, and even the full integration of Blocks within InterPro, should be relatively straightforward. Ultimately, InterPro will include many other protein family databases to give a more comprehensive view of the resources available.

A primary application of InterPro's family, domain and functional site definitions will be in the computational functional classification of newly determined sequences that lack biochemical characterisation. For instance, the EBI will use InterPro for enhancing the automated annotation of TrEMBL. This process should be more efficient and reliable than using each of the pattern databases separately, because InterPro will provide internal consistency checks and deeper coverage. This has been already outlined in detail earlier in this article.

Another major use of InterPro will be in identifying those families and domains for which the existing discriminators are not optimal and could hence be usefully supplemented with an alternative pattern (e.g., where a regular expression identifies large numbers of false matches it could be useful to develop an HMM, or where a Pfam entry covers a vast super-family it could be beneficial to develop discrete family fingerprints, and so on).Alternatively, InterPro is likely to highlight key areas where none of the databases has yet made a contribution and hence where the development of some sort of pattern might be useful.


Structure databases

The number of known protein structures is increasing very rapidly and these are available through the Protein Data Bank (PDB, The Nucleic Acid Database (NDB, is the database for structural information about nucleic acid molecules. There is also a database of structures of ‘small' molecules, of interest to biologists concerned with protein-ligand interactions, from the Cambridge Crystallographic Data Centre (




Abola, E.E., Manning, N.O., Prilusky, J., Stampf, D.R., and Sussman, J.L. (1996) J. Res. Natl. Inst. Stand. Technol. 101, 231-241.

Altschul, S.F., Madden, T.L., Schaffer, A.A., Zhang, J., Zhang, Z., Miller, W., and. Lipman, D.J (1997). Nucleic Acids Res. 25, 3389-3402.

Apweiler, R., Gateau, A., Contrino, S., Martin, M.J., Junker, V., O'Donovan, C., Lang, F., Mitaritonna, N., Kappus, S., and Bairoch, A. (1997). In: "Proceedings of the Fifth International Conference on Intelligent Systems for Molecular Biology (ISMB)" (T. Gaasterland, P. Karp, K. Karplus, C. Ouzounis, C. Sander, and A. Valencia, eds.), pp. 33-43. AAAI Press, Menlo Park.

Attwood, T. K., Flower, D. R., Lewis, A. P., Mabey, J. E., Morgan, S. R., Scordis, P., Selley J. N., and Wright W. (1999). Nucl. Acids Res. 27, 220-225.

Bairoch, A. (1996). Nucl. Acids Res. 24,221-222.

Bairoch, A., and Apweiler, R. (1999). Nucl. Acids Res. 27, 49-54.

Barker, W.C., Garavelli, J.S., McGarvey, P.B., Marzec, C.R., Orcutt, B.C., Srinivasarao, G.Y., Yeh, L.-S.L., Ledley, R.S., Mewes, H.-W., Pfeiffer, F., and Tsugita, A. (1999). Nucleic Acids Res., 27, 39-43.

Bateman, A., Birney, E., Durbin, R., Eddy, S.R., Finn, R.D., and Sonnhammer, E.L.L. (1999). Nucl. Acids Res. 27, 260-262.

Blake, J.A., Richardson, J.E., Davisson, M.T., and Eppig, J.T. (1999). Nucl. Acids Res. 27, 95-98.

Bleasby, A., Akrigg, D., Attwood, T.K. (1994) Nucl. Acids Res. 22, 3574-3577.

Bork, P., and Koonin, E.V. (1998) Nature Genet. 18, 313-318.

Corpet, F., Gouzy, J., and Kahn, D. (1999). Nucl. Acids Res. 27, 263-267.

Dayhoff, M.O., Eck, R.V., Chang, M.A., and Sochard, M.R. (1965). Atlas of Protein Sequence and Structure Vol. 1. National Biomedical Research Foundation, Silver Spring, MD.

Dayhoff, M.O. (1979). Atlas of Protein Sequence and Structure Vol., 5, Supplement 3. National Biomedical Research Foundation, Washington, DC.

Etzold, T, Ulyanov, A., and Argos, P. (1996) Methods Enzymol.266, 114-128.

Fleischmann, W., Möller, S., Gateau, A., and Apweiler, R. (1999). Bioinfomatics 15, 228-233.

FlyBase Consortium (1999).Nucl. Acids Res. 27, 85-88.

Frishman, D., and Mewes, H.-W. (1997). Trends in Genetics 13, 415-416.

Glemet, E., and Codani, J.-J. (1997). Comp. Appl. Bio. Sci. 13, 137-143.

Gribskov, M., McLachlan, A.D., and Eisenberg D. (1987). Proc. Natl. Acad. Sci. U.S.A. 84, 4355-4358.

Henikoff, S., Henikoff, J.G., and Pietrokovski, S. (1999). Bioinformatics 15, 471-479.

Hodges, P.E., McKee, A.H.Z., Davis, B.P., Payne, W.E., and Garrels, J.I. (1999). Nucl. Acids Res. 27, 69-73.

Hofmann, K., Bucher, P., Falquet, L., and Bairoch, A. (1999). Nucl. Acids Res. 27, 215-219.

Kolakowski, L.F. Jr. (1994). Receptors Channels 2, 1-7.

Möller, S., Leser, U., Fleischmann, W., and Apweiler, R. (1999). Bioinfomatics 15, 219-227.

Nevill-Manning, C.G., Sethi, K.S., Wu, T.D., and Brutlag D.L. (1997). In: "Proceedings of the Fifth International Conference on Intelligent Systems for Molecular Biology (ISMB)" (T. Gaasterland, P. Karp, K. Karplus, C. Ouzounis, C. Sander, and A. Valencia, eds.), pp. 202-209. AAAI Press, Menlo Park.

O'Donovan, C., Martin, M.J., Glemet, E., Codani, J.-J., and Apweiler, R. (1999). Bioinfomatics 15, 258-269.

Rawlings, N.D., and Barrett, A.J. (1999). Nucl. Acids Res. 27, 325-331.

Scharf, M., Schneider, R., Casari, G., Bork, P., Valencia, A., Ouzounis, C., and Sander, C. (1994). In: "Proceedings of the Second International Conference on Intelligent Systems for Molecular Biology (ISMB)" (R. Altman, D. Brutlag, P. Karp, R. Lathrop, D. Searls, eds.), pp. 348-353. AAAI Press, Menlo Park.

Schultz, J., Milpetz, F., Bork, P., and Ponting, C.P. (1998) Proc.Natl.Acad.Sci.USA 95, 5857-5864.

Stoesser, G., Tuli, M.A., Lopez, R., and Sterk, P. (1999). Nucl. Acids Res. 27, 18-24.




Page maintained by  Last updated: 

Start SRS Session EBI Site Map